M A R S5 T T S

CAMB-AI

Introduction

MARS5-TTS is an advanced text-to-speech (TTS) model developed by CAMB.AI. It employs a cutting-edge two-stage AR-NAR pipeline to produce high-quality speech, even in complex prosodic scenarios like sports commentary and anime. The model is capable of generating speech using only 5 seconds of reference audio and a text snippet.

Architecture

The architecture of MARS5-TTS involves a two-stage pipeline. Initially, an autoregressive (AR) transformer model generates coarse L0 encodec speech features from text and reference audio. These features are then refined by a multinomial Diffusion Denoising Probabilistic Model (DDPM) to produce the final encodec codebook values. The model allows for control over prosody using text features like punctuation and capitalization. Speaker identity is extracted using a reference audio file, with optimal results from audio lengths of around 6 seconds.

Training

MARS5-TTS is trained on a combination of raw audio and byte-pair-encoded text. The model supports both shallow and deep cloning modes, where deep cloning involves using a transcript of the reference audio for improved quality.

Guide: Running Locally

  1. Install Dependencies: Ensure Python 3.10+ and install required packages using:

    pip install --upgrade torch torchaudio librosa vocos encodec huggingface_hub
    
  2. Load the Model: Use torch.hub to load the Mars5 AR and NAR models:

    from inference import Mars5TTS, InferenceConfig as config_class
    import librosa
    mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")
    
  3. Prepare Reference Audio and Transcript:

    wav, sr = librosa.load('<path to 24kHz waveform>.wav', sr=mars5.sr, mono=True)
    wav = torch.from_numpy(wav)
    ref_transcript = "<transcript of the reference audio>"
    
  4. Perform Synthesis:

    deep_clone = True
    cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)
    ar_codes, output_audio = mars5.tts("Your text here", wav, ref_transcript, cfg=cfg)
    

For optimal performance, use a cloud GPU with at least 20GB of VRAM.

License

MARS5 is open-sourced under the GNU AGPL 3.0 license. Alternative licensing can be requested by contacting help@camb.ai.

More Related APIs in Text To Speech