Llasa 3 B
HKUST-AudioIntroduction
Llasa-3B by HKUST-Audio is a text-to-speech (TTS) model derived from the LLaMA-3B language model, integrated with speech tokens from the XCodec2 codebook. It supports both Chinese and English, enabling speech synthesis either from text or using a speech prompt.
Architecture
The Llasa-3B model augments the LLaMA-3B framework with XCodec2 speech tokens, comprising 65,536 tokens, to facilitate speech synthesis. It is trained on a substantial dataset of 250,000 hours of bilingual speech data.
Training
To train the Llasa-3B model from scratch, use the LLaSA Training Repository. For scaling test-time computations, refer to the LLaSA Testing Repository.
Guide: Running Locally
-
Setup Environment:
- Install necessary dependencies:
conda create -n xcodec2 python=3.9 conda activate xcodec2 pip install xcodec2==0.1.1
- Install necessary dependencies:
-
Run Speech Synthesis:
-
For speech synthesis from text:
from transformers import AutoTokenizer, AutoModelForCausalLM import torch import soundfile as sf llasa_3b = 'HKUST-Audio/Llasa-3B' tokenizer = AutoTokenizer.from_pretrained(llasa_3b) model = AutoModelForCausalLM.from_pretrained(llasa_3b) model.eval().to('cuda') from xcodec2.modeling_xcodec2 import XCodec2Model Codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2") Codec_model.eval().cuda() input_text = 'Dealing with family secrets is never easy...' # Use the provided code to convert text to speech and save the output to 'gen.wav'
-
For synthesis utilizing a speech prompt:
- Similar to the above, but incorporate a prompt waveform.
-
-
Hardware Recommendations:
- Utilize cloud GPUs, such as those offered by AWS, Google Cloud, or Azure, to handle the computational demands efficiently.
License
The Llasa-3B model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).