Tango Flux
declare-labIntroduction
TangoFlux is a state-of-the-art text-to-audio generation model developed by DECLARE-LAB. It uses innovative techniques like Flow Matching and CLAP-Ranked Preference Optimization to produce high-quality audio from text prompts. The model can generate audio at a 44.1 kHz sample rate for durations up to 30 seconds.
Architecture
TangoFlux incorporates FluxTransformer blocks, comprising Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). These blocks are conditioned on textual prompts and duration embeddings. The model learns a rectified flow trajectory from audio latent representations encoded by a variational autoencoder (VAE). The architecture facilitates efficient and faithful audio generation.
Training
The training process of TangoFlux involves three stages:
- Pre-training: Initial training of the model on a large dataset.
- Fine-tuning: Adjusting the model parameters for specific tasks or datasets.
- Preference Optimization: Conducted through CLAP-Ranked Preference Optimization (CRPO), which generates new synthetic data and constructs preference pairs to refine the model's performance.
Guide: Running Locally
To run TangoFlux locally, follow these steps:
- Clone the Repository: Download the model from the GitHub repository: TangoFlux GitHub.
- Install Dependencies: Ensure you have the necessary Python packages, including
torchaudio
. - Load the Model: The model will be downloaded and cached automatically. For subsequent runs, it will load directly from the cache.
- Generate Audio:
import torchaudio from tangoflux import TangoFluxInference from IPython.display import Audio model = TangoFluxInference(name='declare-lab/TangoFlux') audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10) Audio(data=audio, rate=44100)
- Steps: Default is 25; increase to 50 for higher quality at the cost of run-time.
For optimal performance, consider using cloud GPUs such as AWS EC2 P3 instances or Google Cloud's Compute Engine with NVIDIA GPUs.
License
TangoFlux is released under the MIT License, allowing for flexibility in use and distribution.