Llama 3.2 11 B Vision Instruct
meta-llamaIntroduction
Llama 3.2-Vision is a collection of multimodal large language models (LLMs) developed by Meta. These models are pre-trained and instruction-tuned for tasks such as visual recognition, image reasoning, captioning, and answering questions about images. The models are available in sizes of 11B and 90B parameters and are optimized for performance on industry benchmarks.
Architecture
Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It integrates a vision adapter, consisting of cross-attention layers, to process image inputs alongside the pre-trained language model. This design supports multimodal input and output capabilities, enhancing the model's ability to handle complex image-text tasks.
Training
The models were trained using a combination of supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Pretraining involved 6 billion image-text pairs, with fine-tuning using over 3 million synthetically generated examples. Training was conducted on Meta's custom-built GPU cluster, utilizing 2.02 million GPU hours of computation. Meta emphasizes responsible training practices, maintaining net zero greenhouse gas emissions.
Guide: Running Locally
To run Llama 3.2-Vision locally, follow these steps:
-
Install Prerequisites:
- Ensure you have Python installed.
- Install the latest version of
transformers
using pip:pip install --upgrade transformers
-
Download the Model:
- Use the
huggingface-cli
to download model checkpoints:huggingface-cli download meta-llama/Llama-3.2-11B-Vision-Instruct --include "original/*" --local-dir Llama-3.2-11B-Vision-Instruct
- Use the
-
Run Inference:
- Use the
transformers
library to load the model and processor, then execute inference tasks as shown in the provided Python snippet.
- Use the
-
Hardware Recommendations:
- For optimal performance, it is recommended to use cloud GPUs, such as NVIDIA V100 or A100, available from providers like AWS, Google Cloud, or Azure.
License
The use of Llama 3.2 is governed by the Llama 3.2 Community License. This agreement grants a non-exclusive, worldwide, non-transferable, and royalty-free license to use, reproduce, distribute, and modify the Llama Materials. Redistribution requires adherence to specific terms, including providing a copy of the license agreement and displaying "Built with Llama" prominently. The license also includes restrictions on use cases, particularly those violating laws or involving prohibited content. For full details, refer to the license documentation provided by Meta.