Sense Voice Small
FunAudioLLMIntroduction
SenseVoice is a speech foundation model capable of automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). It offers high-accuracy multilingual capabilities and is optimized for low-latency inference.
Architecture
The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, which allows for extremely low inference latency. It supports multilingual speech recognition, emotion recognition, and audio event detection, functioning efficiently even with a similar parameter count as other models like Whisper.
Training
SenseVoice has been trained on over 400,000 hours of data and supports more than 50 languages, including Mandarin, Cantonese, English, Japanese, and Korean. It includes features for easy finetuning and efficient inference, achieving superior performance in multilingual speech recognition and emotion recognition tasks.
Guide: Running Locally
-
Installation: Ensure that all dependencies are installed using:
pip install -r requirements.txt
-
Inference Setup:
from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model_dir = "FunAudioLLM/SenseVoiceSmall" model = AutoModel( model=model_dir, vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda:0", hub="hf", ) res = model.generate( input=f"{model.model_path}/example/en.mp3", language="auto", use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15, ) text = rich_transcription_postprocess(res[0]["text"]) print(text)
-
Optimizations: For short audios, disable VAD and adjust
batch_size
for improved efficiency. -
Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for accelerated processing.
License
The model is released under a specific model license. For the detailed license text, refer to the model license file.