Mask G C T
amphionIntroduction
MaskGCT is a zero-shot text-to-speech (TTS) model leveraging a Masked Generative Codec Transformer. It is designed for generating speech without the need for explicit text and speech alignment, making it a versatile solution for various applications.
Architecture
The architecture includes the following components:
- Semantic Codec: Converts speech to semantic tokens.
- Acoustic Codec: Converts speech to acoustic tokens and reconstructs waveforms from these tokens.
- MaskGCT-T2S: Predicts semantic tokens from text and prompt semantic tokens.
- MaskGCT-S2A: Generates acoustic tokens based on semantic tokens.
Training
The model utilizes the Emilia dataset, a multilingual speech dataset featuring English and Chinese data, each with 50,000 hours of speech. The dataset is designed for large-scale speech generation.
Guide: Running Locally
-
Clone the Repository:
git clone https://github.com/open-mmlab/Amphion.git
-
Create Environment:
Run the environment setup script:bash ./models/tts/maskgct/env.sh
-
Download Pretrained Checkpoints:
Use the Hugging Face API to download necessary pretrained model checkpoints:from huggingface_hub import hf_hub_download semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors") codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors") codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors") t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors") s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors") s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
-
Inference:
Use the provided script to perform inference and generate speech from text and prompt speech.
Cloud GPUs: Consider using cloud-based GPU services like AWS, Google Cloud, or Azure for efficient model execution.
License
MaskGCT is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).