Whisper Speech
WhisperSpeechIntroduction
WhisperSpeech is an open-source text-to-speech system developed by inverting the Whisper model. It aims to provide a powerful and customizable solution for generating speech, similar to the influence of Stable Diffusion in image generation. The project is focused on using properly licensed speech recordings, making the model safe for commercial use. Currently, WhisperSpeech is trained on the English LibreLight dataset, with future plans to support multiple languages.
Architecture
WhisperSpeech's architecture is influenced by models like AudioLM, SPEAR TTS, and MusicGen. It uses OpenAI's Whisper to generate semantic tokens and perform transcription, Meta's EnCodec for acoustic modeling, and Vocos from Charactr Inc. as a high-quality vocoder. Whisper generates embeddings that are quantized into semantic tokens, while EnCodec models the audio waveform. Vocos enhances the quality of the generated audio.
Training
The WhisperSpeech models are currently trained on English, with ongoing efforts to expand to multilingual training. The model leverages semantic tokens generated by Whisper, which are trained on English and Polish, indicating the potential to support a wide range of languages, including those not well-supported by Whisper. The training process has been optimized for performance, achieving speeds over 12 times faster than real-time on consumer GPUs.
Guide: Running Locally
- Setup: Begin by accessing the WhisperSpeech Google Colab notebook to test the model.
- Dependencies: The Colab notebook installation takes less than 30 seconds, thanks to optimized dependencies.
- Local Execution: Download the pre-trained models and datasets from Hugging Face to run locally.
- Hardware Recommendation: For optimal performance, consider using cloud GPUs like NVIDIA's RTX 4090.
License
WhisperSpeech is provided under the MIT License, ensuring the model and its code are open source and available for commercial use.