Ichigo whisper v0.1
homebrewltdIntroduction
Ichigo Whisper is a compact (22M parameters) open-source speech tokenizer developed for the Whisper-medium model. It is designed to enhance multilingual performance with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, allowing for immediate speech understanding. It has been trained on approximately 400 hours of English data and 1000 hours of Vietnamese data.
Architecture
The model uses the WhisperVQ architecture and functions as a quantizer for the Whisper model. It supports both English and Vietnamese languages. The model is licensed under Apache 2.0.
Training
Hardware Specifications
- GPUs: 8 × NVIDIA A6000
Training Time
- Phase 1: 75 hours (50 epochs)
- Phase 2: 29 hours (20 epochs)
- Total Training: 104 hours
Training Details
-
Phase 1: With KL Loss
- Initialization: WhisperVQ-Large-v3
- Epochs: 50
- Batch Size: 336
- Learning Rate: 1e-3
- Scheduler: Linear warm-up with Cosine decay
- Optimizer: AdamW
- Max Audio Length: 30 seconds
-
Phase 2: Without KL Loss
- Initialization: Phase 1 checkpoint
- Epochs: 20
- Batch Size: 336
- Learning Rate: 1e-3
- Scheduler: Linear warm-up with Cosine decay
- Optimizer: AdamW
- Max Audio Length: 30 seconds
Evaluation
- Vietnamese
- Word Error Rate (WER): 11.68 on viVoice dataset
- English
- WER: 11.89 on LibriTTS-R dataset
Guide: Running Locally
- Clone the official Ichigo Whisper repository from GitHub.
- Install necessary dependencies.
- Run the inference script:
python demo/inference.py --input path/to/your/audio.wav
- For optimal performance, consider using cloud GPUs like those provided by AWS, Google Cloud, or Azure.
License
Ichigo Whisper is released under the Apache 2.0 license, allowing for open-source use and modification.