Introduction

Llasa-3B by HKUST-Audio is a text-to-speech (TTS) model derived from the LLaMA-3B language model, integrated with speech tokens from the XCodec2 codebook. It supports both Chinese and English, enabling speech synthesis either from text or using a speech prompt.

Architecture

The Llasa-3B model augments the LLaMA-3B framework with XCodec2 speech tokens, comprising 65,536 tokens, to facilitate speech synthesis. It is trained on a substantial dataset of 250,000 hours of bilingual speech data.

Training

To train the Llasa-3B model from scratch, use the LLaSA Training Repository. For scaling test-time computations, refer to the LLaSA Testing Repository.

Guide: Running Locally

  1. Setup Environment:

    • Install necessary dependencies:
      conda create -n xcodec2 python=3.9
      conda activate xcodec2
      pip install xcodec2==0.1.1
      
  2. Run Speech Synthesis:

    • For speech synthesis from text:

      from transformers import AutoTokenizer, AutoModelForCausalLM
      import torch
      import soundfile as sf
      
      llasa_3b = 'HKUST-Audio/Llasa-3B'
      tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
      model = AutoModelForCausalLM.from_pretrained(llasa_3b)
      model.eval().to('cuda')
      
      from xcodec2.modeling_xcodec2 import XCodec2Model
      Codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2")
      Codec_model.eval().cuda()
      
      input_text = 'Dealing with family secrets is never easy...'
      # Use the provided code to convert text to speech and save the output to 'gen.wav'
      
    • For synthesis utilizing a speech prompt:

      • Similar to the above, but incorporate a prompt waveform.
  3. Hardware Recommendations:

    • Utilize cloud GPUs, such as those offered by AWS, Google Cloud, or Azure, to handle the computational demands efficiently.

License

The Llasa-3B model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).

More Related APIs