Cosy Voice2 0.5 B

FunAudioLLM

Introduction

CosyVoice2-0.5B is a multilingual text-to-speech (TTS) synthesizer developed by FunAudioLLM. It is designed for zero-shot, cross-lingual, and streaming inference, delivering high-quality audio outputs. The model supports multiple languages and uses advanced techniques for text processing and speech synthesis.

Architecture

CosyVoice2-0.5B leverages ONNX and Safetensors libraries, implementing a scalable framework for TTS applications. The architecture supports various inference modes, including zero-shot, cross-lingual, and repetition-aware sampling for improved stability. The model has a roadmap for future enhancements, including further support for multilingual data and optimized streaming capabilities.

Training

Training involves flow matching and WeTextProcessing support, with planned improvements for repetition-aware sampling and streaming inference. The model can be fine-tuned with specific datasets for better performance in targeted applications.

Guide: Running Locally

Basic Steps

  1. Clone the Repository:

    git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
    cd CosyVoice
    git submodule update --init --recursive
    
  2. Install Conda and Dependencies:

    • Install Conda from here.
    • Create and activate a Conda environment:
      conda create -n cosyvoice python=3.10
      conda activate cosyvoice
      conda install -y -c conda-forge pynini==2.1.5
      pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
      
  3. Install System Dependencies (if needed):

    • For Ubuntu:
      sudo apt-get install sox libsox-dev
      
    • For CentOS:
      sudo yum install sox sox-devel
      
  4. Download Pretrained Models: Use the modelscope SDK for downloading models:

    from modelscope import snapshot_download
    snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
    
  5. Basic Usage for TTS:

    • Set up the PYTHONPATH:
      export PYTHONPATH=third_party/Matcha-TTS
      
    • Run the inference script to generate audio:
      from cosyvoice.cli.cosyvoice import CosyVoice
      from cosyvoice.utils.file_utils import load_wav
      import torchaudio
      
      cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT', load_jit=True)
      prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
      for i, j in enumerate(cosyvoice.inference_zero_shot('Your text here', prompt_speech_16k, stream=False)):
          torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
      

Cloud GPUs

For enhanced performance, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure to leverage high-performance computing resources.

License

CosyVoice2-0.5B is intended for academic and demonstration purposes. Ensure compliance with relevant licenses and terms of use when deploying or modifying the software.

More Related APIs in Text To Speech