Hymba 1.5 B Base
nvidiaIntroduction
Hymba-1.5B-Base is a base text-to-text model developed by NVIDIA, designed for natural language generation tasks. It features a hybrid architecture combining Mamba and Attention heads to enhance performance. The model introduces meta tokens to improve efficacy and employs a shared KV cache for efficiency. It is released under the NVIDIA Open Model License and is available for commercial use.
Architecture
The Hymba-1.5B-Base model features:
- Model Size: 1.5 billion parameters with 1600 embedding size.
- Layers and Heads: 32 layers with 25 attention heads.
- Hybrid Heads: Combines standard attention heads with Mamba heads.
- Attention Layers: 3 full attention layers, the rest are sliding window attention.
- Innovations: Utilizes Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE). Meta tokens are used to enhance input sequences, and cross-layer KV sharing is implemented for improved memory and computation efficiency.
Training
The model was trained from September 1, 2024, to November 10, 2024, to outperform sub-2B public models. It employs a hybrid-head architecture to achieve superior performance in small language models.
Guide: Running Locally
Environment Setup
-
Local Installation:
- Obtain the setup script:
wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh bash setup.sh
- Supports CUDA 12.1/12.4.
- Obtain the setup script:
-
Using Docker:
- Pull and run the Docker image:
docker pull ghcr.io/tilmto/hymba:v1 docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
- Pull and run the Docker image:
Running the Model
- Load and interact with the model using Python:
from transformers import AutoTokenizer, AutoModelForCausalLM import torch repo_name = "nvidia/Hymba-1.5B-Base" tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True) model = model.cuda().to(torch.bfloat16) prompt = input() inputs = tokenizer(prompt, return_tensors="pt").to('cuda') outputs = model.generate(**inputs, max_length=64, do_sample=False, temperature=0.7, use_cache=True) response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(f"Model response: {response}")
Cloud GPUs
For optimal performance, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure to run the Hymba-1.5B-Base model.
License
Hymba-1.5B-Base is released under the NVIDIA Open Model License. The detailed terms can be accessed here.