ultravox v0_4_1 mistral nemo LLM Model

Introduction

Ultravox is a multimodal Speech Language Model (LLM) developed by Fixie.ai. It is designed to process both speech and text inputs, making it versatile for various applications such as voice agents and speech-to-speech translation. The model integrates a Mistral-Nemo-Instruct-2407 backbone with the Whisper encoder, allowing it to effectively generate text outputs from audio inputs.

Architecture

Ultravox combines a pretrained Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbone. It processes inputs by replacing a specific <|audio|> token with embeddings from audio, enabling the model to handle multimodal inputs. Future revisions aim to expand its vocabulary to generate semantic and acoustic tokens for voice output.

Training

The model leverages a knowledge-distillation loss mechanism, aligning its outputs with the text-based Mistral backbone. Training involves a mix of Automatic Speech Recognition (ASR) datasets and speech translation datasets, with the multi-modal adapter being the only trained component while the Whisper encoder and Mistral remain static. Training utilized BF16 mixed precision on 8x H100 GPUs.

Guide: Running Locally

Installation:
- Install necessary libraries: pip install transformers peft librosa.

Setup Model:

import transformers
import librosa

pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-mistral-nemo', trust_remote_code=True)
path = "<path-to-input-audio>"  # Provide the path to your audio file
audio, sr = librosa.load(path, sr=16000)

turns = [
  {
    "role": "system",
    "content": "You are a friendly and helpful character. You love to answer questions for people."
  },
]
output = pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)