Llama 3.2 90 B Vision
meta-llamaIntroduction
The Llama 3.2-Vision collection by Meta is a suite of multimodal large language models optimized for image reasoning, visual recognition, and general question answering. These models support text and image inputs, providing enhanced capabilities for tasks like image captioning and visual question answering.
Architecture
Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It integrates a separately trained vision adapter with cross-attention layers to process image inputs alongside text. The models are instruction-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for improved alignment with human preferences.
Training
Llama 3.2-Vision models were trained on 6 billion image-text pairs, with a data cutoff in December 2023. Training employed Meta's custom GPU infrastructure, utilizing 2.02 million GPU hours on H100 hardware. The environmental impact was minimized through renewable energy sources, resulting in zero market-based greenhouse gas emissions.
Guide: Running Locally
- Install Transformers: Ensure you have
transformers >= 4.45.0
by runningpip install --upgrade transformers
. - Load Model and Processor:
import requests import torch from PIL import Image from transformers import MllamaForConditionalGeneration, AutoProcessor model_id = "meta-llama/Llama-3.2-90B-Vision" model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") processor = AutoProcessor.from_pretrained(model_id) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" image = Image.open(requests.get(url, stream=True).raw) prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one" inputs = processor(image, prompt, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=30) print(processor.decode(output[0]))
- Use Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, GCP, or Azure.
License
Llama 3.2 is licensed under the Llama 3.2 Community License. It permits use, reproduction, and modification under specific terms. Redistribution requires attribution and compliance with the Acceptable Use Policy. Commercial use is subject to additional conditions if exceeding certain user thresholds. The full license is available here.