anime whisper LLM Model — Open LLM List

Introduction

Anime Whisper is a Japanese automatic speech recognition (ASR) model fine-tuned for anime-style dialogue. It is based on the kotoba-whisper-v2.0 model and trained on the Galgame_Speech_ASR_16kHz dataset, which contains approximately 5,300 hours of anime-style audio and script data. The model is optimized for anime voice acting but also offers high performance on other audio types.

Architecture

Anime Whisper is built on top of the kotoba-whisper-v2.0 model, a distilled version of openai/whisper-large-v3, making it lightweight and fast. It is specifically fine-tuned to handle unique non-verbal utterances like laughter and breath, and it accurately transcribes punctuation based on the rhythm and emotion of the speech.

Training

The model was trained using the Galgame_Speech_ASR_16kHz dataset, excluding the last tar file as test data. Initially, the encoder was frozen, and only the decoder was trained for several epochs. Later, the encoder was unfrozen to train the entire model. Optimal performance was achieved by averaging models over different time points and optimizing for character error rate (CER) using Optuna.

Guide: Running Locally

Installation:

Ensure you have PyTorch and Transformers installed.

Use the following code snippet to set up the pipeline:

import torch
from transformers import pipeline

generate_kwargs = {
    "language": "Japanese",
    "no_repeat_ngram_size": 0,
    "repetition_penalty": 1.0,
}
pipe = pipeline(
    "automatic-speech-recognition",
    model="litagin/anime-whisper",
    device="cuda",
    torch_dtype=torch.float16,
    chunk_length_s=30.0,
    batch_size=64,
)
audio_path = "test.wav"
result = pipe(audio_path, generate_kwargs=generate_kwargs)
print(result["text"])

Running Inference:
- Use the pipeline to transcribe audio files by passing a list of file paths for batch processing.
Cloud GPUs:
- For larger tasks, consider using cloud GPUs like those available on vast.ai for efficient processing.

License

Anime Whisper is released under the MIT License, which permits reuse with limited restrictions.

More Related APIs in Automatic Speech Recognition