granite 3.1 3b a800m base G G U F

QuantFactory

Introduction

Granite-3.1-3B-A800M-Base is a language model that extends its context length from 4K to 128K using a progressive training strategy. It supports a broad range of text-to-text generation tasks and is built on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. The model is primarily intended for tasks such as summarization, text classification, extraction, and question-answering.

Architecture

Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. The core components include Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss. Key architectural specifications are:

  • Embedding size: 1536
  • Number of layers: 32
  • Attention head size: 64
  • Number of attention heads: 24
  • MLP hidden size: 512
  • Number of experts: 40
  • Sequence length: 128K
  • RoPE position embedding
  • Total parameters: 3.3 billion

Training

The model undergoes a three-stage training process using a mix of open-source and proprietary data:

  1. Stage 1: Involves diverse domain data including web, code, academic sources, books, and math data.
  2. Stage 2: Focuses on high-quality curated data from the same domains with additional multilingual and instruction data.
  3. Stage 3: Incorporates synthetic long-context data in the form of QA/summary pairs.

Training is conducted on IBM's Blue Vela supercomputing cluster equipped with NVIDIA H100 GPUs.

Guide: Running Locally

To run Granite-3.1-3B-A800M-Base locally, follow these steps:

  1. Install Required Libraries:
    pip install torch torchvision torchaudio
    pip install accelerate
    pip install transformers
    
  2. Setup and Run Example:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "auto"
    model_path = "ibm-granite/granite-3.1-3b-a800m-base"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
    model.eval()
    input_text = "Where is the Thomas J. Watson Research Center located?"
    input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
    output = model.generate(**input_tokens, max_length=4000)
    output_text = tokenizer.batch_decode(output)
    print(output_text)
    

For optimal performance, consider using cloud services with access to powerful GPUs, such as AWS, Google Cloud, or Azure.

License

The Granite-3.1-3B-A800M-Base model is licensed under the Apache License 2.0.

More Related APIs in Text Generation