Ichigo llama3.1 s instruct v0.4 G G U F

QuantFactory

Introduction

The Ichigo-llama3.1-s-instruct-v0.4-GGUF model, developed by Homebrew Research, is a quantized sound language model designed to handle both text and audio inputs. This model is a supervised fine-tuned version based on the Llama-3 architecture, optimized for multi-turn conversations and noise rejection using over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset.

Architecture

The model architecture is based on Llama-3 and supports English language input. It processes both text and sound to generate text outputs, enhancing capabilities in sound understanding for research applications.

Training

The model was trained using a cluster of 8x NVIDIA H100-SXM-80GB GPUs over 12 hours. Training involved the torchtune library, utilizing techniques such as Cosine with warmup learning rate scheduling, Adam optimizer, and a global batch size of 256. The model performance was evaluated using datasets like MMLU and AudioBench, showing improvements in specific benchmarks.

Guide: Running Locally

  1. Setup Environment:

    • Ensure the Python environment has access to necessary libraries such as torch, torchaudio, and transformers.
    • Download the required model files from Hugging Face Hub.
  2. Convert Audio to Sound Tokens:

    • Use the provided script to convert audio files into sound tokens using a pre-trained WhisperVQ model.
  3. Setup Model Pipeline:

    • Use the setup_pipeline function to initialize the model for text generation.
    • Adjust quantization settings if needed, especially if running on limited hardware.
  4. Generate Text:

    • Use the generate_text function to input messages and receive generated text outputs.
  5. Recommended Hardware:

    • Utilize cloud GPUs like AWS EC2 with NVIDIA GPUs or Google Cloud's GPU offerings for optimal performance.

License

The model is released under the Apache-2.0 license, allowing for flexible use and distribution.

More Related APIs