whisper large v3 russian

antony66

Introduction

The whisper-large-v3-russian is a fine-tuned version of OpenAI's Whisper model, optimized for Russian language support in automatic speech recognition (ASR). It leverages the Common Voice 17.0 dataset to improve performance.

Architecture

This model is based on the transformer architecture and uses the Whisper model by OpenAI. It incorporates the use of SafeTensors for efficient tensor storage and retrieval.

Training

The model was fine-tuned using the Russian portion of the Common Voice 17.0 dataset, which contains over 200,000 entries. The data was split into training and testing sets in a 95/5 ratio. The fine-tuning process, conducted over 60 hours on dual Tesla A100 GPUs, reduced the Word Error Rate (WER) from 9.84 to 6.39.

Guide: Running Locally

  1. Preprocess Audio Files: Normalize and adjust the volume of your audio files using a tool like Sox.

    sox record.wav -r 16k record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2
    
  2. Setup Python Environment:

    • Install necessary libraries, preferably in a virtual environment.
    • Ensure you have torch and transformers installed.
  3. Load and Run the Model:

    • Use the Whisper model and processor from the transformers library.
    • Configure device settings for GPU or CPU.
    • Process the audio file and obtain transcriptions.
    import torch
    from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
    
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    whisper = WhisperForConditionalGeneration.from_pretrained("antony66/whisper-large-v3-russian", use_safetensors=True)
    processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")
    asr_pipeline = pipeline("automatic-speech-recognition", model=whisper, tokenizer=processor.tokenizer, device=device)
    
    with open('record-normalized.wav', 'rb') as f:
        wav = BytesIO(f.read())
    
    asr = asr_pipeline(wav, generate_kwargs={"language": "russian"})
    print(asr['text'])
    
  4. Cloud GPUs: For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

The model is available on Hugging Face, and the usage terms are defined by the respective license agreements associated with the datasets and tools used in its development.

More Related APIs in Automatic Speech Recognition