ultravox v0_4_1 mistral nemo
fixie-aiIntroduction
Ultravox is a multimodal Speech Language Model (LLM) developed by Fixie.ai. It is designed to process both speech and text inputs, making it versatile for various applications such as voice agents and speech-to-speech translation. The model integrates a Mistral-Nemo-Instruct-2407 backbone with the Whisper encoder, allowing it to effectively generate text outputs from audio inputs.
Architecture
Ultravox combines a pretrained Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbone. It processes inputs by replacing a specific <|audio|>
token with embeddings from audio, enabling the model to handle multimodal inputs. Future revisions aim to expand its vocabulary to generate semantic and acoustic tokens for voice output.
Training
The model leverages a knowledge-distillation loss mechanism, aligning its outputs with the text-based Mistral backbone. Training involves a mix of Automatic Speech Recognition (ASR) datasets and speech translation datasets, with the multi-modal adapter being the only trained component while the Whisper encoder and Mistral remain static. Training utilized BF16 mixed precision on 8x H100 GPUs.
Guide: Running Locally
-
Installation:
- Install necessary libraries:
pip install transformers peft librosa
.
- Install necessary libraries:
-
Setup Model:
import transformers import librosa pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-mistral-nemo', trust_remote_code=True) path = "<path-to-input-audio>" # Provide the path to your audio file audio, sr = librosa.load(path, sr=16000) turns = [ { "role": "system", "content": "You are a friendly and helpful character. You love to answer questions for people." }, ] output = pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
-
Hardware Suggestion:
- For enhanced performance, consider using cloud GPUs like A100 or H100 for efficient processing.
License
Ultravox is licensed under the MIT License, allowing extensive use and modification of the software.