Tango Flux LLM Model — Open LLM List

Introduction

TangoFlux is a state-of-the-art text-to-audio generation model developed by DECLARE-LAB. It uses innovative techniques like Flow Matching and CLAP-Ranked Preference Optimization to produce high-quality audio from text prompts. The model can generate audio at a 44.1 kHz sample rate for durations up to 30 seconds.

Architecture

TangoFlux incorporates FluxTransformer blocks, comprising Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). These blocks are conditioned on textual prompts and duration embeddings. The model learns a rectified flow trajectory from audio latent representations encoded by a variational autoencoder (VAE). The architecture facilitates efficient and faithful audio generation.

Training

The training process of TangoFlux involves three stages:

Pre-training: Initial training of the model on a large dataset.
Fine-tuning: Adjusting the model parameters for specific tasks or datasets.
Preference Optimization: Conducted through CLAP-Ranked Preference Optimization (CRPO), which generates new synthetic data and constructs preference pairs to refine the model's performance.

Guide: Running Locally

To run TangoFlux locally, follow these steps:

Clone the Repository: Download the model from the GitHub repository: TangoFlux GitHub.
Install Dependencies: Ensure you have the necessary Python packages, including torchaudio.
Load the Model: The model will be downloaded and cached automatically. For subsequent runs, it will load directly from the cache.

Generate Audio:

import torchaudio
from tangoflux import TangoFluxInference
from IPython.display import Audio

model = TangoFluxInference(name='declare-lab/TangoFlux')
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)

Audio(data=audio, rate=44100)