Introduction

MaskGCT is a zero-shot text-to-speech (TTS) model leveraging a Masked Generative Codec Transformer. It is designed for generating speech without the need for explicit text and speech alignment, making it a versatile solution for various applications.

Architecture

The architecture includes the following components:

  • Semantic Codec: Converts speech to semantic tokens.
  • Acoustic Codec: Converts speech to acoustic tokens and reconstructs waveforms from these tokens.
  • MaskGCT-T2S: Predicts semantic tokens from text and prompt semantic tokens.
  • MaskGCT-S2A: Generates acoustic tokens based on semantic tokens.

Training

The model utilizes the Emilia dataset, a multilingual speech dataset featuring English and Chinese data, each with 50,000 hours of speech. The dataset is designed for large-scale speech generation.

Guide: Running Locally

  1. Clone the Repository:

    git clone https://github.com/open-mmlab/Amphion.git
    
  2. Create Environment:
    Run the environment setup script:

    bash ./models/tts/maskgct/env.sh
    
  3. Download Pretrained Checkpoints:
    Use the Hugging Face API to download necessary pretrained model checkpoints:

    from huggingface_hub import hf_hub_download
    
    semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors")
    codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors")
    codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors")
    t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
    s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors")
    s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
    
  4. Inference:
    Use the provided script to perform inference and generate speech from text and prompt speech.

Cloud GPUs: Consider using cloud-based GPU services like AWS, Google Cloud, or Azure for efficient model execution.

License

MaskGCT is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).

More Related APIs in Text To Speech