Llama 3.2 11 B Vision
meta-llamaIntroduction
The Llama 3.2-Vision model is a collection of multimodal large language models developed by Meta, designed for tasks involving image reasoning, visual recognition, captioning, and answering general questions about images. It offers significant improvements over previous models, providing enhanced capabilities for both open and closed multimodal tasks.
Architecture
Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It incorporates a vision adapter trained separately, which integrates with the Llama 3.1 model through cross-attention layers, enabling the processing of image inputs alongside text. The model is available in 11B and 90B parameter sizes and supports multiple languages for text-only tasks.
Training
The Llama 3.2-Vision model was trained using 6 billion image-text pairs with a pretraining data cutoff in December 2023. The model uses supervised fine-tuning and reinforcement learning with human feedback to align with user preferences. The training process was conducted on Meta's custom-built GPU cluster, consuming 2.02 million GPU hours and maintaining net-zero greenhouse gas emissions.
Guide: Running Locally
- Setup Environment:
- Ensure
transformers
version >= 4.45.0 is installed:pip install --upgrade transformers
- Ensure
- Load Model and Processor:
- Use the following Python code to load the model and processor:
import requests import torch from PIL import Image from transformers import MllamaForConditionalGeneration, AutoProcessor model_id = "meta-llama/Llama-3.2-11B-Vision" model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") processor = AutoProcessor.from_pretrained(model_id)
- Use the following Python code to load the model and processor:
- Inference Example:
- Run a sample inference using an image:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" image = Image.open(requests.get(url, stream=True).raw) prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one" inputs = processor(image, prompt, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=30) print(processor.decode(output[0]))
- Run a sample inference using an image:
Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure, which offer powerful GPUs suitable for running large models like Llama 3.2-Vision.
License
The use of Llama 3.2-Vision is governed by the Llama 3.2 Community License. Users are granted a non-exclusive, worldwide, non-transferable, royalty-free license to use, reproduce, and modify the Llama Materials. Redistribution of the materials requires attribution and compliance with the Acceptable Use Policy. Additional commercial terms apply for entities with over 700 million monthly active users. The full license details and acceptable use policy are available through Meta's documentation.