fish speech 1.5

fishaudio

Introduction

Fish Speech V1.5 is an advanced text-to-speech (TTS) model that supports 13 languages and is based on over a million hours of audio data. It is designed to produce high-quality speech synthesis across multiple languages.

Architecture

Fish Speech V1.5 leverages large language models for multilingual text-to-speech synthesis, enabling it to handle diverse linguistic features and accents. The model has been trained extensively on datasets for each supported language, ensuring high-quality and natural-sounding speech output.

Training

The model has been trained on a substantial amount of audio data, totaling more than 1 million hours across various languages. The training data includes:

  • English (en): >300k hours
  • Chinese (zh): >300k hours
  • Japanese (ja): >100k hours
  • German (de), French (fr), Spanish (es), Korean (ko), Arabic (ar), Russian (ru): ~20k hours each
  • Dutch (nl), Italian (it), Polish (pl), Portuguese (pt): <10k hours each

The model's architecture and training process allow it to synthesize speech with high fidelity in these languages.

Guide: Running Locally

To run Fish Speech V1.5 locally, follow these steps:

  1. Clone the Fish Speech repository from GitHub:
    git clone https://github.com/fishaudio/fish-speech
    
  2. Navigate to the cloned directory:
    cd fish-speech
    
  3. Set up the environment and install dependencies:
    pip install -r requirements.txt
    
  4. Run the model using the provided scripts and configurations.

For optimal performance, consider using a cloud GPU service such as AWS, Google Cloud, or Azure to handle the computational load of running the model.

License

Fish Speech V1.5 is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This allows for sharing and adapting the model for non-commercial purposes, provided appropriate credit is given, and adaptations are shared under the same license.

More Related APIs in Text To Speech