whisper large v3 russian
antony66Introduction
The whisper-large-v3-russian
is a fine-tuned version of OpenAI's Whisper model, optimized for Russian language support in automatic speech recognition (ASR). It leverages the Common Voice 17.0 dataset to improve performance.
Architecture
This model is based on the transformer architecture and uses the Whisper model by OpenAI. It incorporates the use of SafeTensors for efficient tensor storage and retrieval.
Training
The model was fine-tuned using the Russian portion of the Common Voice 17.0 dataset, which contains over 200,000 entries. The data was split into training and testing sets in a 95/5 ratio. The fine-tuning process, conducted over 60 hours on dual Tesla A100 GPUs, reduced the Word Error Rate (WER) from 9.84 to 6.39.
Guide: Running Locally
-
Preprocess Audio Files: Normalize and adjust the volume of your audio files using a tool like Sox.
sox record.wav -r 16k record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2
-
Setup Python Environment:
- Install necessary libraries, preferably in a virtual environment.
- Ensure you have
torch
andtransformers
installed.
-
Load and Run the Model:
- Use the Whisper model and processor from the
transformers
library. - Configure device settings for GPU or CPU.
- Process the audio file and obtain transcriptions.
import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline device = 'cuda' if torch.cuda.is_available() else 'cpu' whisper = WhisperForConditionalGeneration.from_pretrained("antony66/whisper-large-v3-russian", use_safetensors=True) processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian") asr_pipeline = pipeline("automatic-speech-recognition", model=whisper, tokenizer=processor.tokenizer, device=device) with open('record-normalized.wav', 'rb') as f: wav = BytesIO(f.read()) asr = asr_pipeline(wav, generate_kwargs={"language": "russian"}) print(asr['text'])
- Use the Whisper model and processor from the
-
Cloud GPUs: For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The model is available on Hugging Face, and the usage terms are defined by the respective license agreements associated with the datasets and tools used in its development.