Crisper Whisper

nyrahealth

Introduction

CrisperWhisper is an enhanced version of OpenAI's Whisper model, focusing on fast, precise, and verbatim speech recognition with accurate word-level timestamps. It includes features such as filler detection and hallucination mitigation. The model is designed to transcribe every spoken word, including disfluencies, pauses, stutters, and false starts.

Architecture

CrisperWhisper builds upon Whisper Large v3, employing Dynamic Time Warping (DTW) on cross-attention scores to achieve word-level timestamps. Enhancements include an adjusted tokenizer and a custom attention loss function developed to improve timestamp accuracy.

Training

The model is trained using datasets with word-level annotations, employing PyTorch CTC aligner for additional data generation. The training process includes augmentations with WavLM and a specialized loss function to fine-tune alignment heads. The model is trained on both English and German datasets over three stages, focusing on verbatim transcription.

Guide: Running Locally

  1. Installation:

    pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
    
  2. Python Script Setup:

    import os
    import torch
    from datasets import load_dataset
    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
    
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model_id = "nyrahealth/CrisperWhisper"
    
    model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True, use_safetensors=True)
    model.to(device)
    
    processor = AutoProcessor.from_pretrained(model_id)
    
    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        chunk_length_s=30,
        batch_size=16,
        return_timestamps='word',
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device=device,
    )
    
    dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
    sample = dataset[0]["audio"]
    result = pipe(sample)
    print(result)
    
  3. Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to run the model.

License

CrisperWhisper is licensed under the CC-BY-NC-4.0, which allows for non-commercial use with attribution.

More Related APIs in Automatic Speech Recognition