Idefics3 8 B Llama3

HuggingFaceM4

Introduction

Idefics3-8B-Llama3 is a state-of-the-art open multimodal model developed by Hugging Face. This model processes image and text inputs to generate text outputs, excelling in tasks such as image captioning and visual question answering. It significantly improves upon its predecessors, Idefics1 and Idefics2, particularly in areas like OCR, document understanding, and visual reasoning.

Architecture

Idefics3-8B is a multi-modal model built using two parent models: google/siglip-so400m-patch14-384 and meta-llama/Meta-Llama-3.1-8B-Instruct. It utilizes the Transformers library and processes up to 169 visual tokens for images sized 364x364 pixels. The model's architecture supports a wide array of multimodal tasks by encoding images and text inputs.

Training

The training process for Idefics3-8B involves supervised fine-tuning without reinforcement learning from human feedback (RLHF). This results in the model sometimes producing short responses, which may require iterative prompting. It leverages a variety of datasets, including OBELICS, The Cauldron, Docmatix, and WebSight, to enhance its capabilities in different tasks.

Guide: Running Locally

  1. Prerequisites: Ensure you have Python installed along with the necessary libraries, including torch and transformers.

  2. Load the Model:

    from transformers import AutoProcessor, AutoModelForVision2Seq
    
    processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")
    model = AutoModelForVision2Seq.from_pretrained(
        "HuggingFaceM4/Idefics3-8B-Llama3", torch_dtype=torch.bfloat16
    ).to("cuda:0")
    
  3. Prepare Inputs: Load images and text for processing.

    from transformers.image_utils import load_image
    
    image1 = load_image("image_url_1")
    image2 = load_image("image_url_2")
    # Prepare your input text and images
    
  4. Run Inference:

    inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
    inputs = {k: v.to("cuda:0") for k, v in inputs.items()}
    generated_ids = model.generate(**inputs, max_new_tokens=500)
    generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
    print(generated_texts)
    
  5. Optimize Performance: Use half precision (e.g., torch.bfloat16) and adjust image resolution settings if necessary.

Cloud GPUs: For enhanced performance, consider using cloud-based GPU providers such as AWS, Google Cloud, or Azure.

License

Idefics3-8B-Llama3 is released under the Apache 2.0 license, allowing for both personal and commercial usage with proper attribution. It builds upon the pre-trained models provided by Google and Meta.

More Related APIs in Image Text To Text