Hymba 1.5 B Instruct

nvidia

Introduction

Hymba-1.5B-Instruct is a 1.5 billion parameter model developed by NVIDIA, fine-tuned from the Hymba-1.5B-Base model. It is designed to handle complex tasks such as math reasoning, function calling, and role-playing. The fine-tuning process utilized open-source and synthetic datasets and incorporated supervised fine-tuning and direct preference optimization. The model is suitable for commercial applications.

Architecture

The Hymba-1.5B-Instruct model features an embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504 across 32 layers. It integrates hybrid attention using standard and Mamba heads in parallel, employs Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE). The architecture enhances efficiency by fusing attention heads and SSM heads and utilizing meta tokens for better input sequence processing.

Training

The model was trained between September 4, 2024, and November 10, 2024. It was fine-tuned using a variety of datasets to improve its performance on a range of tasks. The training process was designed to enhance the model's ability to perform complex language tasks efficiently.

Guide: Running Locally

Step 1: Environment Setup

  1. Local Installation
    Download and install the necessary packages with the provided setup.sh script, supporting CUDA 12.1/12.4:

    wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
    bash setup.sh
    
  2. Docker Installation
    Alternatively, use the provided Docker image:

    docker pull ghcr.io/tilmto/hymba:v1
    docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
    

Step 2: Chat with the Model

Load and use the model with a script using the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer, StopStringCriteria, StoppingCriteriaList
import torch

repo_name = "nvidia/Hymba-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

prompt = input()
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
stopping_criteria = StoppingCriteriaList([StopStringCriteria(tokenizer=tokenizer, stop_strings="</s>")])
outputs = model.generate(tokenized_chat, max_new_tokens=256, do_sample=False, temperature=0.7, use_cache=True, stopping_criteria=stopping_criteria)

input_length = tokenized_chat.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
print(f"Model response: {response}")

Cloud GPUs

For optimal performance, consider using cloud GPUs such as AWS EC2 P3 instances or Google Cloud's A100 GPUs.

License

Hymba-1.5B-Instruct is released under the NVIDIA Open Model License Agreement. For detailed terms, refer to the license document.

More Related APIs in Text Generation