Vi Whisper medium
NhutPIntroduction
We introduce a new model for Vietnamese speech recognition, fine-tuned from the OpenAI Whisper Medium model using the VSV-1100 dataset.
Architecture
The model is based on the openai/whisper-medium
architecture, optimized for Vietnamese language processing.
Training
The model was trained using a composite dataset:
- VSV-1100: 1100 hours
- CMV14-vi: 11 hours
- VIVOS: 3.04 hours
- VLSP2021: 180 hours
Total training data amounts to 1308 hours. A text-to-speech model was employed to augment the dataset with sentences containing infrequent words.
WER Results
The Word Error Rate (WER) was evaluated on several datasets:
- CMV14-vi: 8.1
- VIVOS: 4.69
- VLSP2020-T1: 13.22
- VLSP2020-T2: 28.76
- VLSP2021-T1: 11.78
- VLSP2021-T2: 8.28
- Bud500: 5.38
Guide: Running Locally
Basic Steps
- Install Dependencies:
pip install transformers librosa
- Load Model and Processor:
from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained("NhutP/ViWhisper-medium") model = WhisperForConditionalGeneration.from_pretrained("NhutP/ViWhisper-medium")
- Inference:
import librosa array, sampling_rate = librosa.load('path_to_audio', sr=16000) input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
Suggest Cloud GPUs
Consider using cloud GPU services like AWS, Google Cloud, or Azure for improved performance when running inference tasks.
License
The model is released under the MIT License.