kotoba whisper v2.2

kotoba-tech

Introduction

Kotoba-Whisper-v2.2 is a Japanese Automatic Speech Recognition (ASR) model developed by Kotoba Technologies. It extends the capabilities of Kotoba-Whisper-v2.0 by integrating additional post-processing features, such as speaker diarization and punctuation addition. This model is supported by the Hugging Face Transformers library and is designed to handle Japanese audio transcription with enhanced features.

Architecture

The model utilizes the Whisper architecture from OpenAI, specifically built for automatic speech recognition tasks. It uses additional components like diarizers for speaker segmentation and punctuators for text punctuation, enhancing transcription quality and readability.

Training

Kotoba-Whisper-v2.2 was developed in collaboration with Asahi Ushio and Kotoba Technologies. The model benefits from the Whisper architecture by OpenAI and is integrated with Hugging Face's Transformers library to leverage pre-trained diarization models. It utilizes datasets such as the ReazonSpeech dataset for training and evaluation.

Guide: Running Locally

To run Kotoba-Whisper-v2.2 locally, follow these steps:

  1. Install Dependencies:

    pip install --upgrade pip
    pip install --upgrade transformers accelerate torchaudio
    pip install "punctuators==0.0.5"
    pip install "pyannote.audio"
    pip install git+https://github.com/huggingface/diarizers.git
    
  2. Accept Model Terms:

  3. Download Audio Sample:

    wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
    
  4. Run the Model:

    import torch
    from transformers import pipeline
    
    model_id = "kotoba-tech/kotoba-whisper-v2.2"
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
    
    pipe = pipeline(
        model=model_id,
        torch_dtype=torch_dtype,
        device=device,
        model_kwargs=model_kwargs,
        batch_size=8,
        trust_remote_code=True,
    )
    
    result = pipe("sample_diarization_japanese.mp3", chunk_length_s=15)
    print(result)
    
  5. Optional Enhancements:

    • Activate punctuator by adding add_punctuation=True.
    • Control the number of speakers using num_speakers, min_speakers, and max_speakers.
    • Improve transcription quality by adding silence before/after audio.

Suggestion for Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure for access to powerful GPUs to run large models efficiently.

License

Kotoba-Whisper-v2.2 is licensed under the Apache-2.0 license, which allows for both commercial and non-commercial use, distribution, and modification, provided that any modifications are also shared under the same license.

More Related APIs in Automatic Speech Recognition