Tango Flux base

declare-lab

Introduction

TangoFlux is a model designed for fast and faithful text-to-audio generation, leveraging flow matching and clap-ranked preference optimization. It utilizes FluxTransformer blocks to generate audio conditioned on textual prompts and duration embeddings. TangoFlux can produce audio at 44.1kHz for up to 30 seconds.

Architecture

The model architecture comprises FluxTransformer blocks, including Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). It processes audio latent representations encoded by a variational autoencoder (VAE) to learn a rectified flow trajectory. The training pipeline involves three stages: pre-training, fine-tuning, and preference optimization, with alignment achieved through CRPO (Clap-Ranked Preference Optimization).

Training

TangoFlux undergoes a multi-stage training process. Initially, it is pre-trained to establish a foundational understanding of audio generation. It is then fine-tuned for specific tasks, and finally, preference optimization is performed to enhance the alignment with user preferences by generating synthetic data and constructing preference pairs.

Guide: Running Locally

  1. Download the Model: Clone the repository from GitHub.

  2. Installation: Ensure dependencies like torchaudio are installed.

  3. Load the Model: Use the TangoFluxInference class to load the base model.

  4. Generate Audio: Use the generate function with a textual prompt, specifying steps (recommend 50 for better quality) and duration.

    import torchaudio
    from tangoflux import TangoFluxInference
    from IPython.display import Audio
    
    model = TangoFluxInference(name='declare-lab/TangoFlux-base')
    audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)
    
    Audio(data=audio, rate=44100)
    
  5. Hardware Suggestions: For optimal performance, consider using a cloud GPU service like Google Colab or AWS EC2 instances with GPU support.

License

TangoFlux is available for non-commercial research use only. It is governed by the Stable Audio Open’s license, WavCap’s license, and the original licenses of each training dataset. The model is licensed under the Stability AI Community License, Copyright © Stability AI Ltd. All Rights Reserved.

More Related APIs