Ichigo llama3.1 s instruct v0.4 G G U F LLM Model

Introduction

The Ichigo-llama3.1-s-instruct-v0.4-GGUF model, developed by Homebrew Research, is a quantized sound language model designed to handle both text and audio inputs. This model is a supervised fine-tuned version based on the Llama-3 architecture, optimized for multi-turn conversations and noise rejection using over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset.

Architecture

The model architecture is based on Llama-3 and supports English language input. It processes both text and sound to generate text outputs, enhancing capabilities in sound understanding for research applications.

Training

The model was trained using a cluster of 8x NVIDIA H100-SXM-80GB GPUs over 12 hours. Training involved the torchtune library, utilizing techniques such as Cosine with warmup learning rate scheduling, Adam optimizer, and a global batch size of 256. The model performance was evaluated using datasets like MMLU and AudioBench, showing improvements in specific benchmarks.

Guide: Running Locally

Setup Environment:
- Ensure the Python environment has access to necessary libraries such as torch, torchaudio, and transformers.
- Download the required model files from Hugging Face Hub.
Convert Audio to Sound Tokens:
- Use the provided script to convert audio files into sound tokens using a pre-trained WhisperVQ model.
Setup Model Pipeline:
- Use the setup_pipeline function to initialize the model for text generation.
- Adjust quantization settings if needed, especially if running on limited hardware.
Generate Text:
- Use the generate_text function to input messages and receive generated text outputs.
Recommended Hardware:
- Utilize cloud GPUs like AWS EC2 with NVIDIA GPUs or Google Cloud's GPU offerings for optimal performance.

License

The model is released under the Apache-2.0 license, allowing for flexible use and distribution.

More Related APIs