Ichigo llama3.1 s instruct v0.4 G G U F
QuantFactoryIntroduction
The Ichigo-llama3.1-s-instruct-v0.4-GGUF model, developed by Homebrew Research, is a quantized sound language model designed to handle both text and audio inputs. This model is a supervised fine-tuned version based on the Llama-3 architecture, optimized for multi-turn conversations and noise rejection using over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset.
Architecture
The model architecture is based on Llama-3 and supports English language input. It processes both text and sound to generate text outputs, enhancing capabilities in sound understanding for research applications.
Training
The model was trained using a cluster of 8x NVIDIA H100-SXM-80GB GPUs over 12 hours. Training involved the torchtune library, utilizing techniques such as Cosine with warmup learning rate scheduling, Adam optimizer, and a global batch size of 256. The model performance was evaluated using datasets like MMLU and AudioBench, showing improvements in specific benchmarks.
Guide: Running Locally
-
Setup Environment:
- Ensure the Python environment has access to necessary libraries such as
torch
,torchaudio
, andtransformers
. - Download the required model files from Hugging Face Hub.
- Ensure the Python environment has access to necessary libraries such as
-
Convert Audio to Sound Tokens:
- Use the provided script to convert audio files into sound tokens using a pre-trained WhisperVQ model.
-
Setup Model Pipeline:
- Use the
setup_pipeline
function to initialize the model for text generation. - Adjust quantization settings if needed, especially if running on limited hardware.
- Use the
-
Generate Text:
- Use the
generate_text
function to input messages and receive generated text outputs.
- Use the
-
Recommended Hardware:
- Utilize cloud GPUs like AWS EC2 with NVIDIA GPUs or Google Cloud's GPU offerings for optimal performance.
License
The model is released under the Apache-2.0 license, allowing for flexible use and distribution.