Kokoro 82 M
hexgradIntroduction
Kokoro-82M is a text-to-speech (TTS) model with 82 million parameters, capable of converting text input into audio output. Released on December 25, 2024, under the Apache 2.0 license, Kokoro has been recognized for its efficiency in performance, utilizing fewer parameters and data compared to other models in its category.
Architecture
Kokoro is built on the StyleTTS 2 architecture, as outlined in the paper by Li et al. The model employs ISTFTNet for its operations and functions with a decoder-only setup, avoiding both diffusion processes and encoder release. The architecture supports American and British English languages.
Training
Kokoro was trained using A100 80GB vRAM instances from Vast.ai, chosen for its cost-effectiveness. The model's training set consisted of less than 100 hours of permissive/non-copyrighted audio data and IPA phoneme labels, with training spanning fewer than 20 epochs.
Model Stats Number
- Parameters: 82 million
- Training Audio Duration: < 100 hours
- Training Epochs: < 20
- Compute: A100 80GB vRAM instances
Guide: Running Locally
-
Install Dependencies:
- Clone the repository and install necessary packages.
!git clone https://huggingface.co/hexgrad/Kokoro-82M %cd Kokoro-82M !apt-get -qq -y install espeak-ng > /dev/null 2>&1 !pip install -q phonemizer torch transformers scipy munch
-
Load the Model and Voicepack:
- Use provided Python scripts to build and load the model with a default voicepack.
-
Generate Audio:
- Use the
generate
function to produce 24kHz audio from text input.
- Use the
-
Display Audio Output:
- Use IPython's display and Audio functions to play the generated audio.
Cloud GPUs
Consider using cloud GPUs like those from Google Colab for efficient processing.
License
- Model Weights: Apache 2.0
- Inference Code: MIT License
- Dependencies: Includes GPLv3 licensed espeak-ng
The Apache 2.0 license allows for wide usage and distribution, while specific components and dependencies may have different licensing terms.