fish agent v0.1 3b

fishaudio

FISH AGENT V0.1 3B

Introduction

Fish Agent V0.1 3B is an advanced Voice-to-Voice model capable of capturing and generating environmental audio information with high accuracy. It features a semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders. The model is also a state-of-the-art text-to-speech (TTS) system, trained on a vast dataset of 700,000 hours of multilingual audio content. It is a continue-pretrained version of Qwen-2.5-3B-Instruct, optimized for 200 billion voice and text tokens.

Architecture

The architecture of Fish Agent V0.1 3B is distinguished by its semantic-token-free design, which enhances its ability to process and generate audio without relying on traditional encoders like Whisper and CosyVoice. This innovative approach allows for more efficient and accurate audio handling.

Training

The model has been trained using a comprehensive dataset comprising 700,000 hours of multilingual audio content. The supported languages and their respective training data sizes are:

  • English (en): ~300,000 hours
  • Chinese (zh): ~300,000 hours
  • German (de): ~20,000 hours
  • Japanese (ja): ~20,000 hours
  • French (fr): ~20,000 hours
  • Spanish (es): ~20,000 hours
  • Korean (ko): ~20,000 hours
  • Arabic (ar): ~20,000 hours

Guide: Running Locally

To run the Fish Agent V0.1 3B model locally, follow these basic steps:

  1. Clone the Fish Speech GitHub repository: git clone https://github.com/fishaudio/fish-speech.
  2. Install the required dependencies as listed in the repository's documentation.
  3. Configure your environment according to the guidelines provided in the repository.
  4. Execute the model using the provided scripts and instructions.

For optimal performance, consider using cloud-based GPUs, such as those available from Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure.

License

This model and its associated code are released under the Creative Commons BY-NC-SA 4.0 license. It permits non-commercial use with proper attribution, ensuring users adhere to the terms and conditions specified.

More Related APIs in Audio To Audio