ultravox v0_4_1 llama 3_1 8b
fixie-aiUltravox Model Documentation
Introduction
Ultravox is a multimodal speech large language model (LLM) developed by Fixie.ai. It integrates a pretrained Llama 3.1-8B-Instruct and a Whisper-large-v3-turbo backbone to process both text and speech inputs. It is designed for applications such as voice agents and speech-to-speech translation.
Architecture
Ultravox uses a combination of a Llama 3.1-8B-Instruct backbone and the encoder part of Whisper-large-v3-turbo. The model accepts a text prompt with a special <|audio|>
pseudo-token, which is replaced with audio-derived embeddings. Future revisions aim to support semantic and acoustic audio token generation.
Training
The multimodal adapter of Ultravox is the only component trained, utilizing a knowledge-distillation loss to align with the text-based Llama backbone logits. The training data comprises ASR datasets and speech translation datasets. Training employs supervised speech instruction finetuning with BF16 mixed precision on 8x H100 GPUs. The model features a time-to-first-token of around 150ms with an A100-40GB GPU.
Guide: Running Locally
To use Ultravox locally, follow these steps:
- Install Required Libraries:
pip install transformers peft librosa
- Import Libraries:
import transformers import numpy as np import librosa
- Set Up Pipeline:
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-8b', trust_remote_code=True)
- Load Audio:
path = "<path-to-input-audio>" audio, sr = librosa.load(path, sr=16000)
- Define Conversation Turns:
turns = [ { "role": "system", "content": "You are a friendly and helpful character. You love to answer questions for people." }, ]
- Run the Model:
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
For better performance, consider using cloud GPUs like A100-40GB.
License
Ultravox is licensed under the MIT License, allowing for wide usage and modification with proper attribution.