Hymba 1.5 B Base

nvidia

Introduction

Hymba-1.5B-Base is a base text-to-text model developed by NVIDIA, designed for natural language generation tasks. It features a hybrid architecture combining Mamba and Attention heads to enhance performance. The model introduces meta tokens to improve efficacy and employs a shared KV cache for efficiency. It is released under the NVIDIA Open Model License and is available for commercial use.

Architecture

The Hymba-1.5B-Base model features:

  • Model Size: 1.5 billion parameters with 1600 embedding size.
  • Layers and Heads: 32 layers with 25 attention heads.
  • Hybrid Heads: Combines standard attention heads with Mamba heads.
  • Attention Layers: 3 full attention layers, the rest are sliding window attention.
  • Innovations: Utilizes Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE). Meta tokens are used to enhance input sequences, and cross-layer KV sharing is implemented for improved memory and computation efficiency.

Training

The model was trained from September 1, 2024, to November 10, 2024, to outperform sub-2B public models. It employs a hybrid-head architecture to achieve superior performance in small language models.

Guide: Running Locally

Environment Setup

  1. Local Installation:

    • Obtain the setup script:
      wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
      bash setup.sh
      
    • Supports CUDA 12.1/12.4.
  2. Using Docker:

    • Pull and run the Docker image:
      docker pull ghcr.io/tilmto/hymba:v1
      docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
      

Running the Model

  • Load and interact with the model using Python:
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    repo_name = "nvidia/Hymba-1.5B-Base"
    tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
    model = model.cuda().to(torch.bfloat16)
    
    prompt = input()
    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
    outputs = model.generate(**inputs, max_length=64, do_sample=False, temperature=0.7, use_cache=True)
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    
    print(f"Model response: {response}")
    

Cloud GPUs

For optimal performance, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure to run the Hymba-1.5B-Base model.

License

Hymba-1.5B-Base is released under the NVIDIA Open Model License. The detailed terms can be accessed here.

More Related APIs in Text Generation