mini omni2
gpt-omniIntroduction
Mini-Omni2 is an omni-interactive model capable of understanding image, audio, and text inputs. It supports real-time voice conversations and features multimodal understanding with the ability to interrupt and continue interactions.
Architecture
Mini-Omni2 utilizes multiple sequences for input and output processing. It concatenates image, audio, and text features to execute tasks comprehensively. The output involves text-guided delayed parallel responses to generate real-time speech.
Training
The training process for Mini-Omni2 is divided into three stages: encoder adaptation, modal alignment, and multimodal fine-tuning. This multi-stage training enhances the model's efficiency in handling diverse inputs.
Guide: Running Locally
-
Set Up Environment
- Create a new Conda environment:
conda create -n omni python=3.10 conda activate omni
- Clone the repository:
git clone https://github.com/gpt-omni/mini-omni2.git cd mini-omni2
- Install dependencies:
pip install -r requirements.txt
- Create a new Conda environment:
-
Start Server
- Install FFmpeg and start the server:
sudo apt-get install ffmpeg conda activate omni python3 server.py --ip '0.0.0.0' --port 60808
- Install FFmpeg and start the server:
-
Run Streamlit Demo
- Ensure PyAudio is installed and run the demo:
pip install PyAudio==0.2.14 API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
- Ensure PyAudio is installed and run the demo:
-
Local Testing
- Test with preset audio samples and questions:
conda activate omni python inference_vision.py
- Test with preset audio samples and questions:
Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
Mini-Omni2 is distributed under the MIT License, allowing for open-source usage and modifications.