Cosy Voice2 0.5 B
FunAudioLLMIntroduction
CosyVoice2-0.5B is a multilingual text-to-speech (TTS) synthesizer developed by FunAudioLLM. It is designed for zero-shot, cross-lingual, and streaming inference, delivering high-quality audio outputs. The model supports multiple languages and uses advanced techniques for text processing and speech synthesis.
Architecture
CosyVoice2-0.5B leverages ONNX and Safetensors libraries, implementing a scalable framework for TTS applications. The architecture supports various inference modes, including zero-shot, cross-lingual, and repetition-aware sampling for improved stability. The model has a roadmap for future enhancements, including further support for multilingual data and optimized streaming capabilities.
Training
Training involves flow matching and WeTextProcessing support, with planned improvements for repetition-aware sampling and streaming inference. The model can be fine-tuned with specific datasets for better performance in targeted applications.
Guide: Running Locally
Basic Steps
-
Clone the Repository:
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git cd CosyVoice git submodule update --init --recursive
-
Install Conda and Dependencies:
- Install Conda from here.
- Create and activate a Conda environment:
conda create -n cosyvoice python=3.10 conda activate cosyvoice conda install -y -c conda-forge pynini==2.1.5 pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
-
Install System Dependencies (if needed):
- For Ubuntu:
sudo apt-get install sox libsox-dev
- For CentOS:
sudo yum install sox sox-devel
- For Ubuntu:
-
Download Pretrained Models: Use the
modelscope
SDK for downloading models:from modelscope import snapshot_download snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
-
Basic Usage for TTS:
- Set up the PYTHONPATH:
export PYTHONPATH=third_party/Matcha-TTS
- Run the inference script to generate audio:
from cosyvoice.cli.cosyvoice import CosyVoice from cosyvoice.utils.file_utils import load_wav import torchaudio cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT', load_jit=True) prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000) for i, j in enumerate(cosyvoice.inference_zero_shot('Your text here', prompt_speech_16k, stream=False)): torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
- Set up the PYTHONPATH:
Cloud GPUs
For enhanced performance, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure to leverage high-performance computing resources.
License
CosyVoice2-0.5B is intended for academic and demonstration purposes. Ensure compliance with relevant licenses and terms of use when deploying or modifying the software.