Sense Voice Small LLM Model

Introduction

SenseVoice is a speech foundation model capable of automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). It offers high-accuracy multilingual capabilities and is optimized for low-latency inference.

Architecture

The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, which allows for extremely low inference latency. It supports multilingual speech recognition, emotion recognition, and audio event detection, functioning efficiently even with a similar parameter count as other models like Whisper.

Training

SenseVoice has been trained on over 400,000 hours of data and supports more than 50 languages, including Mandarin, Cantonese, English, Japanese, and Korean. It includes features for easy finetuning and efficient inference, achieving superior performance in multilingual speech recognition and emotion recognition tasks.

Guide: Running Locally

Installation: Ensure that all dependencies are installed using:
```
pip install -r requirements.txt
```

Inference Setup:

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "FunAudioLLM/SenseVoiceSmall"

model = AutoModel(
    model=model_dir,
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
    hub="hf",
)
res = model.generate(
    input=f"{model.model_path}/example/en.mp3",
    language="auto",
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

Optimizations: For short audios, disable VAD and adjust batch_size for improved efficiency.
Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for accelerated processing.

License

The model is released under a specific model license. For the detailed license text, refer to the model license file.

More Related APIs