Omni Audio 2.6 B
NexaAIDevIntroduction
OmniAudio-2.6B is a 2.6 billion-parameter multimodal model designed for efficient audio and text processing directly on edge devices. It combines components such as Gemma-2-2b, Whisper turbo, and a custom projector module to provide seamless audio-text capabilities with minimal latency.
Architecture
OmniAudio-2.6B integrates various modalities into a single architecture that unifies ASR and LLM capabilities, reducing latency and resource overhead compared to traditional models. It supports robust on-device processing, making it suitable for real-time applications.
Training
The model's training involves three key stages:
- Pretraining: Focuses on audio-text alignment using the MLS English 10k transcription dataset, incorporating a special
<|transcribe|>
token to distinguish between transcription and completion tasks. - Supervised Fine-tuning (SFT): Enhances conversational abilities with synthetic datasets, generating contextually appropriate responses and rich audio-text pairs.
- Direct Preference Optimization (DPO): Refines model quality using the GPT-4o API, correcting inaccuracies and ensuring semantic alignment with Gemma2's text responses as a benchmark.
Guide: Running Locally
- Install Nexa-SDK:
Download and install the Nexa-SDK here to enable local on-device inference. - Run OmniAudio:
Execute the commandnexa run omniaudio -st
in your terminal. The q4_K_M version requires 1.30GB RAM and 1.60GB storage. - Cloud GPU Recommendation:
For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to run the model efficiently.
License
OmniAudio-2.6B is released under the Apache-2.0 license, allowing for use, modification, and distribution under specified conditions.