Hymba 1.5 B Instruct
nvidiaIntroduction
Hymba-1.5B-Instruct is a 1.5 billion parameter model developed by NVIDIA, fine-tuned from the Hymba-1.5B-Base model. It is designed to handle complex tasks such as math reasoning, function calling, and role-playing. The fine-tuning process utilized open-source and synthetic datasets and incorporated supervised fine-tuning and direct preference optimization. The model is suitable for commercial applications.
Architecture
The Hymba-1.5B-Instruct model features an embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504 across 32 layers. It integrates hybrid attention using standard and Mamba heads in parallel, employs Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE). The architecture enhances efficiency by fusing attention heads and SSM heads and utilizing meta tokens for better input sequence processing.
Training
The model was trained between September 4, 2024, and November 10, 2024. It was fine-tuned using a variety of datasets to improve its performance on a range of tasks. The training process was designed to enhance the model's ability to perform complex language tasks efficiently.
Guide: Running Locally
Step 1: Environment Setup
-
Local Installation
Download and install the necessary packages with the providedsetup.sh
script, supporting CUDA 12.1/12.4:wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh bash setup.sh
-
Docker Installation
Alternatively, use the provided Docker image:docker pull ghcr.io/tilmto/hymba:v1 docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
Step 2: Chat with the Model
Load and use the model with a script using the transformers
library:
from transformers import AutoModelForCausalLM, AutoTokenizer, StopStringCriteria, StoppingCriteriaList
import torch
repo_name = "nvidia/Hymba-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)
prompt = input()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
stopping_criteria = StoppingCriteriaList([StopStringCriteria(tokenizer=tokenizer, stop_strings="</s>")])
outputs = model.generate(tokenized_chat, max_new_tokens=256, do_sample=False, temperature=0.7, use_cache=True, stopping_criteria=stopping_criteria)
input_length = tokenized_chat.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
print(f"Model response: {response}")
Cloud GPUs
For optimal performance, consider using cloud GPUs such as AWS EC2 P3 instances or Google Cloud's A100 GPUs.
License
Hymba-1.5B-Instruct is released under the NVIDIA Open Model License Agreement. For detailed terms, refer to the license document.