metavoice 1 B v0.1
metavoiceioIntroduction
MetaVoice-1B is a 1.2 billion parameter base model designed for text-to-speech (TTS) applications, focusing on emotional speech rhythm and tone in English without hallucinations. It supports voice cloning with minimal training data and zero-shot cloning for American and British voices. The model is released under the Apache 2.0 license, allowing unrestricted use.
Architecture
MetaVoice-1B leverages EnCodec tokens for text and speaker information, diffused to the waveform level with post-processing for audio clarity. The architecture includes:
- A causal GPT for predicting the first two hierarchies of EnCodec tokens, incorporating text and speaker information via a speaker verification network.
- A non-causal transformer for predicting the remaining hierarchies, enabling zero-shot generalization and parallel timestep prediction.
- Multi-band diffusion for waveform generation, with DeepFilterNet used for artifact cleanup.
- Optimizations including KV-caching via Flash Decoding and batching support.
Training
MetaVoice-1B is trained on 100,000 hours of speech data, emphasizing emotional tone and rhythm without hallucinations. It allows for effective voice cloning with minimal data and supports zero-shot cloning with short reference audio.
Guide: Running Locally
- Installation: Clone the MetaVoice source from GitHub.
- Environment Setup: Install necessary dependencies using a package manager like pip.
- Model Download: Access the pretrained MetaVoice-1B model via Hugging Face's model hub.
- Running Inference: Use the provided scripts to run TTS tasks, adjusting parameters as needed for cloning or long-form synthesis.
- Finetuning: Follow the finetuning instructions available on the GitHub repository.
For optimal performance, consider using a cloud GPU service like AWS, Google Cloud, or Azure to handle the computational load.
License
MetaVoice-1B is available under the Apache 2.0 license, permitting use, modification, and distribution without restrictions.