Introduction

Mini-Omni2 is an omni-interactive model capable of understanding image, audio, and text inputs. It supports real-time voice conversations and features multimodal understanding with the ability to interrupt and continue interactions.

Architecture

Mini-Omni2 utilizes multiple sequences for input and output processing. It concatenates image, audio, and text features to execute tasks comprehensively. The output involves text-guided delayed parallel responses to generate real-time speech.

Training

The training process for Mini-Omni2 is divided into three stages: encoder adaptation, modal alignment, and multimodal fine-tuning. This multi-stage training enhances the model's efficiency in handling diverse inputs.

Guide: Running Locally

  1. Set Up Environment

    • Create a new Conda environment:
      conda create -n omni python=3.10
      conda activate omni
      
    • Clone the repository:
      git clone https://github.com/gpt-omni/mini-omni2.git
      cd mini-omni2
      
    • Install dependencies:
      pip install -r requirements.txt
      
  2. Start Server

    • Install FFmpeg and start the server:
      sudo apt-get install ffmpeg
      conda activate omni
      python3 server.py --ip '0.0.0.0' --port 60808
      
  3. Run Streamlit Demo

    • Ensure PyAudio is installed and run the demo:
      pip install PyAudio==0.2.14
      API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
      
  4. Local Testing

    • Test with preset audio samples and questions:
      conda activate omni
      python inference_vision.py
      

Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

Mini-Omni2 is distributed under the MIT License, allowing for open-source usage and modifications.

More Related APIs in Any To Any