Ichigo whisper v0.1

homebrewltd

Introduction

Ichigo Whisper is a compact (22M parameters) open-source speech tokenizer developed for the Whisper-medium model. It is designed to enhance multilingual performance with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, allowing for immediate speech understanding. It has been trained on approximately 400 hours of English data and 1000 hours of Vietnamese data.

Architecture

The model uses the WhisperVQ architecture and functions as a quantizer for the Whisper model. It supports both English and Vietnamese languages. The model is licensed under Apache 2.0.

Training

Hardware Specifications

  • GPUs: 8 × NVIDIA A6000

Training Time

  • Phase 1: 75 hours (50 epochs)
  • Phase 2: 29 hours (20 epochs)
  • Total Training: 104 hours

Training Details

  • Phase 1: With KL Loss

    • Initialization: WhisperVQ-Large-v3
    • Epochs: 50
    • Batch Size: 336
    • Learning Rate: 1e-3
    • Scheduler: Linear warm-up with Cosine decay
    • Optimizer: AdamW
    • Max Audio Length: 30 seconds
  • Phase 2: Without KL Loss

    • Initialization: Phase 1 checkpoint
    • Epochs: 20
    • Batch Size: 336
    • Learning Rate: 1e-3
    • Scheduler: Linear warm-up with Cosine decay
    • Optimizer: AdamW
    • Max Audio Length: 30 seconds

Evaluation

  • Vietnamese
    • Word Error Rate (WER): 11.68 on viVoice dataset
  • English
    • WER: 11.89 on LibriTTS-R dataset

Guide: Running Locally

  1. Clone the official Ichigo Whisper repository from GitHub.
  2. Install necessary dependencies.
  3. Run the inference script:
    python demo/inference.py --input path/to/your/audio.wav
    
  4. For optimal performance, consider using cloud GPUs like those provided by AWS, Google Cloud, or Azure.

License

Ichigo Whisper is released under the Apache 2.0 license, allowing for open-source use and modification.

More Related APIs in Audio Text To Text