Introduction

Hertz-dev is an open-source base model designed for full-duplex conversational audio processing. It features an 8.5 billion parameter transformer trained on 20 million hours of high-quality audio data, supporting both mono- and full-duplex generation. The model excels in tasks like live translation and classification due to its accurate modeling of human-like speech patterns, including pauses and emotional inflections.

Architecture

Hertz-dev leverages a transformer architecture with 8.5 billion parameters, trained extensively on a large dataset of conversational audio. It achieves state-of-the-art performance with an average real-world latency of 120ms on a single RTX 4090, significantly lower than previous models. This low latency is crucial for generating natural-sounding audio.

Training

The model has been trained on a vast dataset comprising 20 million unique hours of conversational audio. It serves as a base model without fine-tuning, Reinforcement Learning from Human Feedback (RLHF), or instruction-following behaviors. Users can fine-tune Hertz-dev for various audio modeling tasks.

Guide: Running Locally

  1. Clone the Repository:

    git clone https://github.com/Standard-Intelligence/hertz-dev
    cd hertz-dev
    
  2. Set Up Environment:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
    
    • For Ubuntu, install additional dependencies:
      sudo apt-get install libportaudio2
      
  3. Install PyTorch with CUDA Support:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
  4. Run Inference Notebook:
    Use inference.ipynb to generate audio completions. Note that Windows setups may require adjustments due to flash attention dependencies.

  5. Live Interaction:
    Use inference_client.py and inference_server.py for live interaction through a microphone. These are tested mainly on Ubuntu (server) and MacOS (client).

Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure with NVIDIA RTX 4090 or equivalent.

License

Hertz-dev is licensed under the Apache-2.0 License, allowing for open use and modification under the terms specified.

More Related APIs in Audio To Audio