Llama 3.2 90 B Vision Instruct

meta-llama

Introduction

Llama 3.2-Vision is a collection of multimodal large language models (LLMs) designed for image reasoning and text generation tasks. Developed by Meta, these models are optimized for visual recognition, captioning, and answering questions about images, outperforming many existing open-source and closed multimodal models.

Architecture

Llama 3.2-Vision is built on top of the Llama 3.1 text-only model, utilizing an optimized transformer architecture. It incorporates supervised fine-tuning and reinforcement learning with human feedback to align with human preferences. A separately trained vision adapter integrates with the language model using cross-attention layers to handle image recognition tasks.

Training

The models were pretrained on 6 billion image-text pairs, with additional instruction tuning using publicly available datasets and synthetically generated examples. The training process involved 2.02 million GPU hours on Meta's custom-built GPU infrastructure, with a focus on reducing greenhouse gas emissions.

Guide: Running Locally

To run Llama 3.2-Vision locally, you will need the transformers library version 4.45.0 or later. Here's a basic setup:

  1. Install Transformers:

    pip install --upgrade transformers
    
  2. Load the Model:

    import torch
    from transformers import MllamaForConditionalGeneration, AutoProcessor
    
    model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"
    model = MllamaForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    processor = AutoProcessor.from_pretrained(model_id)
    
  3. Prepare Input and Run Inference:
    Use the processor to prepare input data (image and text) and generate output using the model.

For higher performance, consider using cloud GPUs such as NVIDIA's H100, available through cloud providers.

License

Llama 3.2 is governed by the Llama 3.2 Community License. The license grants a non-exclusive, worldwide, non-transferable, and royalty-free limited license to use, reproduce, distribute, and modify the Llama Materials. Redistribution must include the license agreement and proper attribution. Compliance with applicable laws and regulations is mandatory.

More Related APIs in Image Text To Text