Kokoro 82 M LLM Model — Open LLM List

Introduction

Kokoro-82M is a text-to-speech (TTS) model with 82 million parameters, capable of converting text input into audio output. Released on December 25, 2024, under the Apache 2.0 license, Kokoro has been recognized for its efficiency in performance, utilizing fewer parameters and data compared to other models in its category.

Architecture

Kokoro is built on the StyleTTS 2 architecture, as outlined in the paper by Li et al. The model employs ISTFTNet for its operations and functions with a decoder-only setup, avoiding both diffusion processes and encoder release. The architecture supports American and British English languages.

Training

Kokoro was trained using A100 80GB vRAM instances from Vast.ai, chosen for its cost-effectiveness. The model's training set consisted of less than 100 hours of permissive/non-copyrighted audio data and IPA phoneme labels, with training spanning fewer than 20 epochs.

Model Stats Number

Parameters: 82 million
Training Audio Duration: < 100 hours
Training Epochs: < 20
Compute: A100 80GB vRAM instances

Guide: Running Locally

Install Dependencies:

Clone the repository and install necessary packages.

!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

Load the Model and Voicepack:
- Use provided Python scripts to build and load the model with a default voicepack.
Generate Audio:
- Use the generate function to produce 24kHz audio from text input.
Display Audio Output:
- Use IPython's display and Audio functions to play the generated audio.