Oute T T S 0.2 500 M

OuteAI

Introduction

OuteTTS-0.2-500M is an advanced text-to-speech (TTS) model designed to produce natural and coherent speech. It builds on the Qwen-2.5-0.5B foundation, enhancing performance across various aspects, including multilingual support for English, Chinese, Japanese, and Korean.

Architecture

The model utilizes the Qwen-2.5-0.5B architecture, featuring 500 million parameters. It is designed to operate efficiently with a range of audio prompts, maintaining high accuracy in speech synthesis.

Training

OuteTTS-0.2-500M was trained on diverse datasets such as Emilia-Dataset, LibriTTS-R, and Multilingual LibriSpeech (MLS) to improve its accuracy, naturalness, and multilingual capabilities. The training process benefited from a GPU grant provided by Hugging Face.

Guide: Running Locally

  1. Installation:

    • Install the main package via pip:
      pip install outetts --upgrade
      
    • If using GGUF or EXL2 support, follow the respective installation guides:
  2. Configuration:

    • Use the provided Python code to configure and initialize the model:
      import outetts
      
      model_config = outetts.HFModelConfig_v1(
          model_path="OuteAI/OuteTTS-0.2-500M",
          language="en"
      )
      
      interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
      
  3. Generate Speech:

    • Load a speaker and generate speech with custom settings:
      speaker = interface.load_default_speaker(name="male_1")
      output = interface.generate(
          text="Speech synthesis is the artificial production of human speech.",
          temperature=0.1,
          repetition_penalty=1.1,
          max_length=4096,
          speaker=speaker
      )
      output.save("output.wav")
      
  4. Cloud GPUs:

    • For better performance, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure to handle intensive computations.

License

OuteTTS-0.2-500M is distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY NC 4.0) license. This allows for sharing and adaptation with attribution, but not for commercial use.

More Related APIs in Text To Speech