ultravox v0_4_1 mistral nemo

fixie-ai

Introduction

Ultravox is a multimodal Speech Language Model (LLM) developed by Fixie.ai. It is designed to process both speech and text inputs, making it versatile for various applications such as voice agents and speech-to-speech translation. The model integrates a Mistral-Nemo-Instruct-2407 backbone with the Whisper encoder, allowing it to effectively generate text outputs from audio inputs.

Architecture

Ultravox combines a pretrained Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbone. It processes inputs by replacing a specific <|audio|> token with embeddings from audio, enabling the model to handle multimodal inputs. Future revisions aim to expand its vocabulary to generate semantic and acoustic tokens for voice output.

Training

The model leverages a knowledge-distillation loss mechanism, aligning its outputs with the text-based Mistral backbone. Training involves a mix of Automatic Speech Recognition (ASR) datasets and speech translation datasets, with the multi-modal adapter being the only trained component while the Whisper encoder and Mistral remain static. Training utilized BF16 mixed precision on 8x H100 GPUs.

Guide: Running Locally

  1. Installation:

    • Install necessary libraries: pip install transformers peft librosa.
  2. Setup Model:

    import transformers
    import librosa
    
    pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-mistral-nemo', trust_remote_code=True)
    path = "<path-to-input-audio>"  # Provide the path to your audio file
    audio, sr = librosa.load(path, sr=16000)
    
    turns = [
      {
        "role": "system",
        "content": "You are a friendly and helpful character. You love to answer questions for people."
      },
    ]
    output = pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
    
  3. Hardware Suggestion:

    • For enhanced performance, consider using cloud GPUs like A100 or H100 for efficient processing.

License

Ultravox is licensed under the MIT License, allowing extensive use and modification of the software.

More Related APIs in Audio Text To Text