ultravox v0_4_1 llama 3_1 8b

fixie-ai

Ultravox Model Documentation

Introduction

Ultravox is a multimodal speech large language model (LLM) developed by Fixie.ai. It integrates a pretrained Llama 3.1-8B-Instruct and a Whisper-large-v3-turbo backbone to process both text and speech inputs. It is designed for applications such as voice agents and speech-to-speech translation.

Architecture

Ultravox uses a combination of a Llama 3.1-8B-Instruct backbone and the encoder part of Whisper-large-v3-turbo. The model accepts a text prompt with a special <|audio|> pseudo-token, which is replaced with audio-derived embeddings. Future revisions aim to support semantic and acoustic audio token generation.

Training

The multimodal adapter of Ultravox is the only component trained, utilizing a knowledge-distillation loss to align with the text-based Llama backbone logits. The training data comprises ASR datasets and speech translation datasets. Training employs supervised speech instruction finetuning with BF16 mixed precision on 8x H100 GPUs. The model features a time-to-first-token of around 150ms with an A100-40GB GPU.

Guide: Running Locally

To use Ultravox locally, follow these steps:

  1. Install Required Libraries:
    pip install transformers peft librosa
    
  2. Import Libraries:
    import transformers
    import numpy as np
    import librosa
    
  3. Set Up Pipeline:
    pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-8b', trust_remote_code=True)
    
  4. Load Audio:
    path = "<path-to-input-audio>"
    audio, sr = librosa.load(path, sr=16000)
    
  5. Define Conversation Turns:
    turns = [
      {
        "role": "system",
        "content": "You are a friendly and helpful character. You love to answer questions for people."
      },
    ]
    
  6. Run the Model:
    pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
    

For better performance, consider using cloud GPUs like A100-40GB.

License

Ultravox is licensed under the MIT License, allowing for wide usage and modification with proper attribution.

More Related APIs in Audio Text To Text