Kokoro 82 M

hexgrad

Introduction

Kokoro-82M is a text-to-speech (TTS) model with 82 million parameters, capable of converting text input into audio output. Released on December 25, 2024, under the Apache 2.0 license, Kokoro has been recognized for its efficiency in performance, utilizing fewer parameters and data compared to other models in its category.

Architecture

Kokoro is built on the StyleTTS 2 architecture, as outlined in the paper by Li et al. The model employs ISTFTNet for its operations and functions with a decoder-only setup, avoiding both diffusion processes and encoder release. The architecture supports American and British English languages.

Training

Kokoro was trained using A100 80GB vRAM instances from Vast.ai, chosen for its cost-effectiveness. The model's training set consisted of less than 100 hours of permissive/non-copyrighted audio data and IPA phoneme labels, with training spanning fewer than 20 epochs.

Model Stats Number

  • Parameters: 82 million
  • Training Audio Duration: < 100 hours
  • Training Epochs: < 20
  • Compute: A100 80GB vRAM instances

Guide: Running Locally

  1. Install Dependencies:

    • Clone the repository and install necessary packages.
    !git clone https://huggingface.co/hexgrad/Kokoro-82M
    %cd Kokoro-82M
    !apt-get -qq -y install espeak-ng > /dev/null 2>&1
    !pip install -q phonemizer torch transformers scipy munch
    
  2. Load the Model and Voicepack:

    • Use provided Python scripts to build and load the model with a default voicepack.
  3. Generate Audio:

    • Use the generate function to produce 24kHz audio from text input.
  4. Display Audio Output:

    • Use IPython's display and Audio functions to play the generated audio.

Cloud GPUs

Consider using cloud GPUs like those from Google Colab for efficient processing.

License

  • Model Weights: Apache 2.0
  • Inference Code: MIT License
  • Dependencies: Includes GPLv3 licensed espeak-ng

The Apache 2.0 license allows for wide usage and distribution, while specific components and dependencies may have different licensing terms.

More Related APIs in Text To Speech