A I21 Jamba 1.5 Mini

ai21labs

Introduction

AI21 Jamba 1.5 models are advanced hybrid SSM-Transformer models designed for efficient long-context instruction following. They provide up to 2.5 times faster inference than comparable models and are suitable for business applications, including function calling and structured output generation.

Architecture

The Jamba models feature a unique Joint Attention and Mamba (Jamba) architecture, offering superior long context handling, speed, and quality. They are optimized for both research and commercial use, supporting multiple languages and boasting a context length of 256K.

Training

Training details for AI21 Jamba 1.5 include examples for full fine-tuning, LoRA fine-tuning, and QLoRA fine-tuning using advanced techniques such as ExpertsInt8 quantization for efficient deployment on limited GPU resources.

Guide: Running Locally

  1. Prerequisites: Install necessary packages using pip:

    pip install mamba-ssm causal-conv1d>=1.2.0
    pip install vllm>=0.5.4
    
  2. Running with vLLM: Deploy on a minimum of 2 80GB GPUs:

    from vllm import LLM, SamplingParams
    from transformers import AutoTokenizer
    
    model = "ai21labs/AI21-Jamba-1.5-Mini"
    number_gpus = 2
    
    llm = LLM(model=model, max_model_len=200*1024, tensor_parallel_size=number_gpus)
    tokenizer = AutoTokenizer.from_pretrained(model)
    
    messages = [{"role": "system", "content": "You are an ancient oracle..."}]
    prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
    outputs = llm.generate(prompts, SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100))
    
  3. ExpertInt8 Quantization: Use vLLM version 0.5.5 or higher for efficient quantization:

    import os
    os.environ['VLLM_FUSED_MOE_CHUNK_SIZE'] = '32768'
    from vllm import LLM
    
    llm = LLM(model="ai21labs/AI21-Jamba-1.5-Mini", max_model_len=100*1024, quantization="experts_int8")
    
  4. Cloud GPUs: Consider using cloud providers for access to high-performance GPUs like NVIDIA A100.

License

The models are released under the Jamba Open Model License, allowing for full research and commercial use as per the license terms.

More Related APIs in Text Generation