Llama 3.2 90 B Vision

meta-llama

Introduction

The Llama 3.2-Vision collection by Meta is a suite of multimodal large language models optimized for image reasoning, visual recognition, and general question answering. These models support text and image inputs, providing enhanced capabilities for tasks like image captioning and visual question answering.

Architecture

Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It integrates a separately trained vision adapter with cross-attention layers to process image inputs alongside text. The models are instruction-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for improved alignment with human preferences.

Training

Llama 3.2-Vision models were trained on 6 billion image-text pairs, with a data cutoff in December 2023. Training employed Meta's custom GPU infrastructure, utilizing 2.02 million GPU hours on H100 hardware. The environmental impact was minimized through renewable energy sources, resulting in zero market-based greenhouse gas emissions.

Guide: Running Locally

  1. Install Transformers: Ensure you have transformers >= 4.45.0 by running pip install --upgrade transformers.
  2. Load Model and Processor:
    import requests
    import torch
    from PIL import Image
    from transformers import MllamaForConditionalGeneration, AutoProcessor
    
    model_id = "meta-llama/Llama-3.2-90B-Vision"
    model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
    processor = AutoProcessor.from_pretrained(model_id)
    
    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
    inputs = processor(image, prompt, return_tensors="pt").to(model.device)
    
    output = model.generate(**inputs, max_new_tokens=30)
    print(processor.decode(output[0]))
    
  3. Use Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, GCP, or Azure.

License

Llama 3.2 is licensed under the Llama 3.2 Community License. It permits use, reproduction, and modification under specific terms. Redistribution requires attribution and compliance with the Acceptable Use Policy. Commercial use is subject to additional conditions if exceeding certain user thresholds. The full license is available here.

More Related APIs in Image Text To Text