granite 3.0 3b a800m base G G U F

QuantFactory

Introduction

The Granite-3.0-3B-A800M-Base is a decoder-only language model developed by IBM's Granite Team. It supports a variety of text-to-text generation tasks and is designed to handle multiple languages and tasks including summarization, text classification, extraction, and question-answering. This model is a quantized version created using llama.cpp.

Architecture

Granite-3.0-3B-A800M-Base employs a sparse Mixture of Experts (MoE) transformer architecture with components such as Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss. The model features:

  • Embedding size: 1536
  • Number of layers: 32
  • Attention head size: 64
  • Number of attention heads: 24
  • Number of KV heads: 8
  • MLP hidden size: 512
  • MLP activation: SwiGLU
  • Number of Experts: 40
  • MoE TopK: 8
  • Sequence Length: 4096
  • Position Embedding: RoPE
  • Total Parameters: 3.3 billion
  • Active Parameters: 800 million

Training

Granite-3.0-3B-A800M-Base is trained in two stages:

  1. Stage 1 involves training on 8 trillion tokens from diverse domains such as web, code, academic sources, books, and math data.
  2. Stage 2 uses 2 trillion tokens from a curated mix of high-quality, multilingual, and instruction data to enhance task-specific performance.

Training is conducted on IBM's super computing cluster, Blue Vela, using NVIDIA H100 GPUs and 100% renewable energy.

Guide: Running Locally

To run Granite-3.0-3B-A800M-Base locally, follow these steps:

  1. Install Required Libraries:

    pip install torch torchvision torchaudio
    pip install accelerate
    pip install transformers
    
  2. Set Up and Run the Model:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    device = "auto"
    model_path = "ibm-granite/granite-3.0-3b-a800m-base"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
    model.eval()
    
    input_text = "Where is the Thomas J. Watson Research Center located?"
    input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
    output = model.generate(**input_tokens, max_length=4000)
    output = tokenizer.batch_decode(output)
    print(output)
    
  3. Cloud GPU Recommendation: For better performance, consider using cloud services that offer GPU instances, such as AWS EC2, Google Cloud Platform, or Azure.

License

The Granite-3.0-3B-A800M-Base model is licensed under the Apache 2.0 license. This allows for both personal and commercial use, with the requirement to provide proper attribution.

More Related APIs in Text Generation