Llama 3.2 11 B Vision Instruct

meta-llama

Introduction

Llama 3.2-Vision is a collection of multimodal large language models (LLMs) developed by Meta. These models are pre-trained and instruction-tuned for tasks such as visual recognition, image reasoning, captioning, and answering questions about images. The models are available in sizes of 11B and 90B parameters and are optimized for performance on industry benchmarks.

Architecture

Llama 3.2-Vision builds upon the Llama 3.1 text-only model using an optimized transformer architecture. It integrates a vision adapter, consisting of cross-attention layers, to process image inputs alongside the pre-trained language model. This design supports multimodal input and output capabilities, enhancing the model's ability to handle complex image-text tasks.

Training

The models were trained using a combination of supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Pretraining involved 6 billion image-text pairs, with fine-tuning using over 3 million synthetically generated examples. Training was conducted on Meta's custom-built GPU cluster, utilizing 2.02 million GPU hours of computation. Meta emphasizes responsible training practices, maintaining net zero greenhouse gas emissions.

Guide: Running Locally

To run Llama 3.2-Vision locally, follow these steps:

  1. Install Prerequisites:

    • Ensure you have Python installed.
    • Install the latest version of transformers using pip:
      pip install --upgrade transformers
      
  2. Download the Model:

    • Use the huggingface-cli to download model checkpoints:
      huggingface-cli download meta-llama/Llama-3.2-11B-Vision-Instruct --include "original/*" --local-dir Llama-3.2-11B-Vision-Instruct
      
  3. Run Inference:

    • Use the transformers library to load the model and processor, then execute inference tasks as shown in the provided Python snippet.
  4. Hardware Recommendations:

    • For optimal performance, it is recommended to use cloud GPUs, such as NVIDIA V100 or A100, available from providers like AWS, Google Cloud, or Azure.

License

The use of Llama 3.2 is governed by the Llama 3.2 Community License. This agreement grants a non-exclusive, worldwide, non-transferable, and royalty-free license to use, reproduce, distribute, and modify the Llama Materials. Redistribution requires adherence to specific terms, including providing a copy of the license agreement and displaying "Built with Llama" prominently. The license also includes restrictions on use cases, particularly those violating laws or involving prohibited content. For full details, refer to the license documentation provided by Meta.

More Related APIs in Image Text To Text