granite 3.1 3b a800m base G G U F LLM Model

Introduction

Granite-3.1-3B-A800M-Base is a language model that extends its context length from 4K to 128K using a progressive training strategy. It supports a broad range of text-to-text generation tasks and is built on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. The model is primarily intended for tasks such as summarization, text classification, extraction, and question-answering.

Architecture

Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. The core components include Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss. Key architectural specifications are:

Embedding size: 1536
Number of layers: 32
Attention head size: 64
Number of attention heads: 24
MLP hidden size: 512
Number of experts: 40
Sequence length: 128K
RoPE position embedding
Total parameters: 3.3 billion

Training

The model undergoes a three-stage training process using a mix of open-source and proprietary data:

Stage 1: Involves diverse domain data including web, code, academic sources, books, and math data.
Stage 2: Focuses on high-quality curated data from the same domains with additional multilingual and instruction data.
Stage 3: Incorporates synthetic long-context data in the form of QA/summary pairs.

Training is conducted on IBM's Blue Vela supercomputing cluster equipped with NVIDIA H100 GPUs.

Guide: Running Locally

To run Granite-3.1-3B-A800M-Base locally, follow these steps:

Install Required Libraries:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Setup and Run Example:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "auto"
model_path = "ibm-granite/granite-3.1-3b-a800m-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
input_text = "Where is the Thomas J. Watson Research Center located?"
input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
output = model.generate(**input_tokens, max_length=4000)
output_text = tokenizer.batch_decode(output)
print(output_text)

For optimal performance, consider using cloud services with access to powerful GPUs, such as AWS, Google Cloud, or Azure.

License

The Granite-3.1-3B-A800M-Base model is licensed under the Apache License 2.0.

More Related APIs in Text Generation