granite 3.1 3b a800m base G G U F
QuantFactoryIntroduction
Granite-3.1-3B-A800M-Base is a language model that extends its context length from 4K to 128K using a progressive training strategy. It supports a broad range of text-to-text generation tasks and is built on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. The model is primarily intended for tasks such as summarization, text classification, extraction, and question-answering.
Architecture
Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. The core components include Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss. Key architectural specifications are:
- Embedding size: 1536
- Number of layers: 32
- Attention head size: 64
- Number of attention heads: 24
- MLP hidden size: 512
- Number of experts: 40
- Sequence length: 128K
- RoPE position embedding
- Total parameters: 3.3 billion
Training
The model undergoes a three-stage training process using a mix of open-source and proprietary data:
- Stage 1: Involves diverse domain data including web, code, academic sources, books, and math data.
- Stage 2: Focuses on high-quality curated data from the same domains with additional multilingual and instruction data.
- Stage 3: Incorporates synthetic long-context data in the form of QA/summary pairs.
Training is conducted on IBM's Blue Vela supercomputing cluster equipped with NVIDIA H100 GPUs.
Guide: Running Locally
To run Granite-3.1-3B-A800M-Base locally, follow these steps:
- Install Required Libraries:
pip install torch torchvision torchaudio pip install accelerate pip install transformers
- Setup and Run Example:
from transformers import AutoModelForCausalLM, AutoTokenizer device = "auto" model_path = "ibm-granite/granite-3.1-3b-a800m-base" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device) model.eval() input_text = "Where is the Thomas J. Watson Research Center located?" input_tokens = tokenizer(input_text, return_tensors="pt").to(device) output = model.generate(**input_tokens, max_length=4000) output_text = tokenizer.batch_decode(output) print(output_text)
For optimal performance, consider using cloud services with access to powerful GPUs, such as AWS, Google Cloud, or Azure.
License
The Granite-3.1-3B-A800M-Base model is licensed under the Apache License 2.0.