Crisper Whisper
nyrahealthIntroduction
CrisperWhisper is an enhanced version of OpenAI's Whisper model, focusing on fast, precise, and verbatim speech recognition with accurate word-level timestamps. It includes features such as filler detection and hallucination mitigation. The model is designed to transcribe every spoken word, including disfluencies, pauses, stutters, and false starts.
Architecture
CrisperWhisper builds upon Whisper Large v3, employing Dynamic Time Warping (DTW) on cross-attention scores to achieve word-level timestamps. Enhancements include an adjusted tokenizer and a custom attention loss function developed to improve timestamp accuracy.
Training
The model is trained using datasets with word-level annotations, employing PyTorch CTC aligner for additional data generation. The training process includes augmentations with WavLM and a specialized loss function to fine-tune alignment heads. The model is trained on both English and German datasets over three stages, focusing on verbatim transcription.
Guide: Running Locally
-
Installation:
pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
-
Python Script Setup:
import os import torch from datasets import load_dataset from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" model_id = "nyrahealth/CrisperWhisper" model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True, use_safetensors=True) model.to(device) processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, chunk_length_s=30, batch_size=16, return_timestamps='word', torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, device=device, ) dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation") sample = dataset[0]["audio"] result = pipe(sample) print(result)
-
Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to run the model.
License
CrisperWhisper is licensed under the CC-BY-NC-4.0, which allows for non-commercial use with attribution.