Ichigo whisper v0.1 LLM Model

Introduction

Ichigo Whisper is a compact (22M parameters) open-source speech tokenizer developed for the Whisper-medium model. It is designed to enhance multilingual performance with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, allowing for immediate speech understanding. It has been trained on approximately 400 hours of English data and 1000 hours of Vietnamese data.

Architecture

The model uses the WhisperVQ architecture and functions as a quantizer for the Whisper model. It supports both English and Vietnamese languages. The model is licensed under Apache 2.0.

Training

Hardware Specifications

GPUs: 8 × NVIDIA A6000

Training Time

Phase 1: 75 hours (50 epochs)
Phase 2: 29 hours (20 epochs)
Total Training: 104 hours

Training Details

Phase 1: With KL Loss
- Initialization: WhisperVQ-Large-v3
- Epochs: 50
- Batch Size: 336
- Learning Rate: 1e-3
- Scheduler: Linear warm-up with Cosine decay
- Optimizer: AdamW
- Max Audio Length: 30 seconds
Phase 2: Without KL Loss
- Initialization: Phase 1 checkpoint
- Epochs: 20
- Batch Size: 336
- Learning Rate: 1e-3
- Scheduler: Linear warm-up with Cosine decay
- Optimizer: AdamW
- Max Audio Length: 30 seconds

Evaluation

Vietnamese
- Word Error Rate (WER): 11.68 on viVoice dataset
English
- WER: 11.89 on LibriTTS-R dataset

Guide: Running Locally

Clone the official Ichigo Whisper repository from GitHub.
Install necessary dependencies.

Run the inference script:

python demo/inference.py --input path/to/your/audio.wav

For optimal performance, consider using cloud GPUs like those provided by AWS, Google Cloud, or Azure.

License

Ichigo Whisper is released under the Apache 2.0 license, allowing for open-source use and modification.

More Related APIs in Audio Text To Text

Omni Audio 2.6 B

Qwen2 Audio 7 B Instruct

Qwen2 Audio 7 B

ultravox v0_4_1 llama 3_1 8b

Audio Text To Text

a month ago

2.1K

ultravox v0_4_1 mistral nemo

Audio Text To Text

a month ago

345