Llama 3.2 90 B Vision LLM Model

Introduction

The Llama 3.2-Vision collection by Meta is a suite of multimodal large language models optimized for image reasoning, visual recognition, and general question answering. These models support text and image inputs, providing enhanced capabilities for tasks like image captioning and visual question answering.

Architecture

Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It integrates a separately trained vision adapter with cross-attention layers to process image inputs alongside text. The models are instruction-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for improved alignment with human preferences.

Training

Llama 3.2-Vision models were trained on 6 billion image-text pairs, with a data cutoff in December 2023. Training employed Meta's custom GPU infrastructure, utilizing 2.02 million GPU hours on H100 hardware. The environmental impact was minimized through renewable energy sources, resulting in zero market-based greenhouse gas emissions.

Guide: Running Locally

Install Transformers: Ensure you have transformers >= 4.45.0 by running pip install --upgrade transformers.

Load Model and Processor:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-90B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

Use Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, GCP, or Azure.

License

Llama 3.2 is licensed under the Llama 3.2 Community License. It permits use, reproduction, and modification under specific terms. Redistribution requires attribution and compliance with the Acceptable Use Policy. Commercial use is subject to additional conditions if exceeding certain user thresholds. The full license is available here.

More Related APIs in Image Text To Text