Llasa 3 B LLM Model — Open LLM List

Introduction

Llasa-3B by HKUST-Audio is a text-to-speech (TTS) model derived from the LLaMA-3B language model, integrated with speech tokens from the XCodec2 codebook. It supports both Chinese and English, enabling speech synthesis either from text or using a speech prompt.

Architecture

The Llasa-3B model augments the LLaMA-3B framework with XCodec2 speech tokens, comprising 65,536 tokens, to facilitate speech synthesis. It is trained on a substantial dataset of 250,000 hours of bilingual speech data.

Training

To train the Llasa-3B model from scratch, use the LLaSA Training Repository. For scaling test-time computations, refer to the LLaSA Testing Repository.

Guide: Running Locally

Setup Environment:

Install necessary dependencies:

conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install xcodec2==0.1.1

Run Speech Synthesis:

For speech synthesis from text:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf

llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval().to('cuda')

from xcodec2.modeling_xcodec2 import XCodec2Model
Codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2")
Codec_model.eval().cuda()

input_text = 'Dealing with family secrets is never easy...'
# Use the provided code to convert text to speech and save the output to 'gen.wav'

For synthesis utilizing a speech prompt:
- Similar to the above, but incorporate a prompt waveform.

Hardware Recommendations:
- Utilize cloud GPUs, such as those offered by AWS, Google Cloud, or Azure, to handle the computational demands efficiently.

License

The Llasa-3B model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).

More Related APIs