Llama 3.2 11 B Vision LLM Model

Introduction

The Llama 3.2-Vision model is a collection of multimodal large language models developed by Meta, designed for tasks involving image reasoning, visual recognition, captioning, and answering general questions about images. It offers significant improvements over previous models, providing enhanced capabilities for both open and closed multimodal tasks.

Architecture

Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It incorporates a vision adapter trained separately, which integrates with the Llama 3.1 model through cross-attention layers, enabling the processing of image inputs alongside text. The model is available in 11B and 90B parameter sizes and supports multiple languages for text-only tasks.

Training

The Llama 3.2-Vision model was trained using 6 billion image-text pairs with a pretraining data cutoff in December 2023. The model uses supervised fine-tuning and reinforcement learning with human feedback to align with user preferences. The training process was conducted on Meta's custom-built GPU cluster, consuming 2.02 million GPU hours and maintaining net-zero greenhouse gas emissions.

Guide: Running Locally

Setup Environment:
- Ensure transformers version >= 4.45.0 is installed:
```
pip install --upgrade transformers
```

Load Model and Processor:

Use the following Python code to load the model and processor:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

Inference Example:

Run a sample inference using an image:

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
inputs = processor(image, prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

Cloud GPUs: For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure, which offer powerful GPUs suitable for running large models like Llama 3.2-Vision.

License

The use of Llama 3.2-Vision is governed by the Llama 3.2 Community License. Users are granted a non-exclusive, worldwide, non-transferable, royalty-free license to use, reproduce, and modify the Llama Materials. Redistribution of the materials requires attribution and compliance with the Acceptable Use Policy. Additional commercial terms apply for entities with over 700 million monthly active users. The full license details and acceptable use policy are available through Meta's documentation.

More Related APIs in Image Text To Text