metavoice 1 B v0.1

metavoiceio

Introduction

MetaVoice-1B is a 1.2 billion parameter base model designed for text-to-speech (TTS) applications, focusing on emotional speech rhythm and tone in English without hallucinations. It supports voice cloning with minimal training data and zero-shot cloning for American and British voices. The model is released under the Apache 2.0 license, allowing unrestricted use.

Architecture

MetaVoice-1B leverages EnCodec tokens for text and speaker information, diffused to the waveform level with post-processing for audio clarity. The architecture includes:

  • A causal GPT for predicting the first two hierarchies of EnCodec tokens, incorporating text and speaker information via a speaker verification network.
  • A non-causal transformer for predicting the remaining hierarchies, enabling zero-shot generalization and parallel timestep prediction.
  • Multi-band diffusion for waveform generation, with DeepFilterNet used for artifact cleanup.
  • Optimizations including KV-caching via Flash Decoding and batching support.

Training

MetaVoice-1B is trained on 100,000 hours of speech data, emphasizing emotional tone and rhythm without hallucinations. It allows for effective voice cloning with minimal data and supports zero-shot cloning with short reference audio.

Guide: Running Locally

  1. Installation: Clone the MetaVoice source from GitHub.
  2. Environment Setup: Install necessary dependencies using a package manager like pip.
  3. Model Download: Access the pretrained MetaVoice-1B model via Hugging Face's model hub.
  4. Running Inference: Use the provided scripts to run TTS tasks, adjusting parameters as needed for cloning or long-form synthesis.
  5. Finetuning: Follow the finetuning instructions available on the GitHub repository.

For optimal performance, consider using a cloud GPU service like AWS, Google Cloud, or Azure to handle the computational load.

License

MetaVoice-1B is available under the Apache 2.0 license, permitting use, modification, and distribution without restrictions.

More Related APIs in Text To Speech