Introduction

Mini-Omni is an open-source multimodal large language model designed by Hugging Face. It has the capability to process real-time speech inputs and produce streaming audio outputs, facilitating seamless conversations without the need for additional Automatic Speech Recognition (ASR) or Text-to-Speech (TTS) models. The model supports features such as talking while thinking, allowing for simultaneous text and audio generation, and includes both "Audio-to-Text" and "Audio-to-Audio" batch inference for enhanced performance.

Architecture

Mini-Omni is built on the Qwen2-0.5B base model and integrates several technologies for its operations:

  • Qwen2 serves as the language model backbone.
  • litGPT is used for both training and inference processes.
  • Whisper handles audio encoding.
  • Snac is responsible for audio decoding.
  • CosyVoice generates synthetic speech.
  • OpenOrca and MOSS are utilized for model alignment.

Training

The model leverages the litGPT framework for its training and inference capabilities, which is designed to efficiently handle large language models. Detailed training methodologies and configurations are available in the Mini-Omni GitHub repository.

Guide: Running Locally

  1. Set Up Environment

    • Create a new conda environment and activate it:
      conda create -n omni python=3.10
      conda activate omni
      
  2. Clone Repository and Install Dependencies

    • Clone the Mini-Omni repository and install required packages:
      git clone https://github.com/gpt-omni/mini-omni.git
      cd mini-omni
      pip install -r requirements.txt
      
  3. Start Server

    • Launch the server to host the interactive demo:
      python3 server.py --ip '0.0.0.0' --port 60808
      
  4. Run Demos

    • Streamlit Demo: Requires PyAudio for local execution.
      pip install PyAudio==0.2.14
      API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
      
    • Gradio Demo:
      API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py
      
  5. Local Testing

    • Test run with preset audio samples:
      python inference.py
      

For enhanced performance, consider utilizing cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

Mini-Omni is released under the MIT License, allowing for extensive use, modification, and distribution.

More Related APIs in Text To Speech