Tango Flux base
declare-labIntroduction
TangoFlux is a model designed for fast and faithful text-to-audio generation, leveraging flow matching and clap-ranked preference optimization. It utilizes FluxTransformer blocks to generate audio conditioned on textual prompts and duration embeddings. TangoFlux can produce audio at 44.1kHz for up to 30 seconds.
Architecture
The model architecture comprises FluxTransformer blocks, including Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). It processes audio latent representations encoded by a variational autoencoder (VAE) to learn a rectified flow trajectory. The training pipeline involves three stages: pre-training, fine-tuning, and preference optimization, with alignment achieved through CRPO (Clap-Ranked Preference Optimization).
Training
TangoFlux undergoes a multi-stage training process. Initially, it is pre-trained to establish a foundational understanding of audio generation. It is then fine-tuned for specific tasks, and finally, preference optimization is performed to enhance the alignment with user preferences by generating synthetic data and constructing preference pairs.
Guide: Running Locally
-
Download the Model: Clone the repository from GitHub.
-
Installation: Ensure dependencies like
torchaudio
are installed. -
Load the Model: Use the
TangoFluxInference
class to load the base model. -
Generate Audio: Use the
generate
function with a textual prompt, specifying steps (recommend 50 for better quality) and duration.import torchaudio from tangoflux import TangoFluxInference from IPython.display import Audio model = TangoFluxInference(name='declare-lab/TangoFlux-base') audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10) Audio(data=audio, rate=44100)
-
Hardware Suggestions: For optimal performance, consider using a cloud GPU service like Google Colab or AWS EC2 instances with GPU support.
License
TangoFlux is available for non-commercial research use only. It is governed by the Stable Audio Open’s license, WavCap’s license, and the original licenses of each training dataset. The model is licensed under the Stability AI Community License, Copyright © Stability AI Ltd. All Rights Reserved.