Zamba2 7 B Instruct
ZyphraIntroduction
Zamba2-7B-Instruct is a model derived from Zamba2-7B by fine-tuning it on instruction-following and chat datasets. It is a hybrid model that integrates state-space and transformer blocks to enhance its performance in text generation tasks.
Architecture
The model utilizes a hybrid SSM-attention architecture, combining Mamba2 layers with shared attention layers. The attention layers use shared weights to reduce parameter costs. Additionally, LoRA projection matrices are applied to the shared MLP to enhance expressivity. The architecture supports a long-context mode, extending context from 4k to 16k by adjusting the rotary position embeddings.
Training
The model is fine-tuned to follow instructions and engage in chat-like interactions. It achieves strong benchmark scores in various tasks, indicating its capability for instruction-following and reasoning. The model's architecture allows for low inference latency, rapid generation, and a reduced memory footprint.
Guide: Running Locally
Prerequisites
- Clone the Zyphra's fork of transformers:
git clone https://github.com/Zyphra/transformers_zamba2.git cd transformers_zamba2
- Install the repository:
pip install -e .
- Install additional dependencies:
pip install accelerate
Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-7B-instruct")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-7B-instruct", device_map="cuda", torch_dtype=torch.bfloat16)
# Prepare input
sample = [{'role': 'user', 'content': "Your question here"}, {'role': 'assistant', 'content': "Response"}]
chat_sample = tokenizer.apply_chat_template(sample, tokenize=False)
input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda")
# Generate output
outputs = model.generate(**input_ids, max_new_tokens=150)
print(tokenizer.decode(outputs[0]))
For enhanced performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use with attribution.