Omni Audio 2.6 B

NexaAIDev

Introduction

OmniAudio-2.6B is a 2.6 billion-parameter multimodal model designed for efficient audio and text processing directly on edge devices. It combines components such as Gemma-2-2b, Whisper turbo, and a custom projector module to provide seamless audio-text capabilities with minimal latency.

Architecture

OmniAudio-2.6B integrates various modalities into a single architecture that unifies ASR and LLM capabilities, reducing latency and resource overhead compared to traditional models. It supports robust on-device processing, making it suitable for real-time applications.

Training

The model's training involves three key stages:

  • Pretraining: Focuses on audio-text alignment using the MLS English 10k transcription dataset, incorporating a special <|transcribe|> token to distinguish between transcription and completion tasks.
  • Supervised Fine-tuning (SFT): Enhances conversational abilities with synthetic datasets, generating contextually appropriate responses and rich audio-text pairs.
  • Direct Preference Optimization (DPO): Refines model quality using the GPT-4o API, correcting inaccuracies and ensuring semantic alignment with Gemma2's text responses as a benchmark.

Guide: Running Locally

  1. Install Nexa-SDK:
    Download and install the Nexa-SDK here to enable local on-device inference.
  2. Run OmniAudio:
    Execute the command nexa run omniaudio -st in your terminal. The q4_K_M version requires 1.30GB RAM and 1.60GB storage.
  3. Cloud GPU Recommendation:
    For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to run the model efficiently.

License

OmniAudio-2.6B is released under the Apache-2.0 license, allowing for use, modification, and distribution under specified conditions.

More Related APIs in Audio Text To Text