kotoba whisper v2.2
kotoba-techIntroduction
Kotoba-Whisper-v2.2 is a Japanese Automatic Speech Recognition (ASR) model developed by Kotoba Technologies. It extends the capabilities of Kotoba-Whisper-v2.0 by integrating additional post-processing features, such as speaker diarization and punctuation addition. This model is supported by the Hugging Face Transformers library and is designed to handle Japanese audio transcription with enhanced features.
Architecture
The model utilizes the Whisper architecture from OpenAI, specifically built for automatic speech recognition tasks. It uses additional components like diarizers for speaker segmentation and punctuators for text punctuation, enhancing transcription quality and readability.
Training
Kotoba-Whisper-v2.2 was developed in collaboration with Asahi Ushio and Kotoba Technologies. The model benefits from the Whisper architecture by OpenAI and is integrated with Hugging Face's Transformers library to leverage pre-trained diarization models. It utilizes datasets such as the ReazonSpeech dataset for training and evaluation.
Guide: Running Locally
To run Kotoba-Whisper-v2.2 locally, follow these steps:
-
Install Dependencies:
pip install --upgrade pip pip install --upgrade transformers accelerate torchaudio pip install "punctuators==0.0.5" pip install "pyannote.audio" pip install git+https://github.com/huggingface/diarizers.git
-
Accept Model Terms:
- Visit pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1 to accept terms of use.
- Log in using a Hugging Face authentication token:
huggingface-cli login
-
Download Audio Sample:
wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
-
Run the Model:
import torch from transformers import pipeline model_id = "kotoba-tech/kotoba-whisper-v2.2" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 device = "cuda:0" if torch.cuda.is_available() else "cpu" model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} pipe = pipeline( model=model_id, torch_dtype=torch_dtype, device=device, model_kwargs=model_kwargs, batch_size=8, trust_remote_code=True, ) result = pipe("sample_diarization_japanese.mp3", chunk_length_s=15) print(result)
-
Optional Enhancements:
- Activate punctuator by adding
add_punctuation=True
. - Control the number of speakers using
num_speakers
,min_speakers
, andmax_speakers
. - Improve transcription quality by adding silence before/after audio.
- Activate punctuator by adding
Suggestion for Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure for access to powerful GPUs to run large models efficiently.
License
Kotoba-Whisper-v2.2 is licensed under the Apache-2.0 license, which allows for both commercial and non-commercial use, distribution, and modification, provided that any modifications are also shared under the same license.