fish speech 1.5
fishaudioIntroduction
Fish Speech V1.5 is an advanced text-to-speech (TTS) model that supports 13 languages and is based on over a million hours of audio data. It is designed to produce high-quality speech synthesis across multiple languages.
Architecture
Fish Speech V1.5 leverages large language models for multilingual text-to-speech synthesis, enabling it to handle diverse linguistic features and accents. The model has been trained extensively on datasets for each supported language, ensuring high-quality and natural-sounding speech output.
Training
The model has been trained on a substantial amount of audio data, totaling more than 1 million hours across various languages. The training data includes:
- English (en): >300k hours
- Chinese (zh): >300k hours
- Japanese (ja): >100k hours
- German (de), French (fr), Spanish (es), Korean (ko), Arabic (ar), Russian (ru): ~20k hours each
- Dutch (nl), Italian (it), Polish (pl), Portuguese (pt): <10k hours each
The model's architecture and training process allow it to synthesize speech with high fidelity in these languages.
Guide: Running Locally
To run Fish Speech V1.5 locally, follow these steps:
- Clone the Fish Speech repository from GitHub:
git clone https://github.com/fishaudio/fish-speech
- Navigate to the cloned directory:
cd fish-speech
- Set up the environment and install dependencies:
pip install -r requirements.txt
- Run the model using the provided scripts and configurations.
For optimal performance, consider using a cloud GPU service such as AWS, Google Cloud, or Azure to handle the computational load of running the model.
License
Fish Speech V1.5 is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This allows for sharing and adapting the model for non-commercial purposes, provided appropriate credit is given, and adaptations are shared under the same license.