Idefics3 8 B Llama3
HuggingFaceM4Introduction
Idefics3-8B-Llama3 is a state-of-the-art open multimodal model developed by Hugging Face. This model processes image and text inputs to generate text outputs, excelling in tasks such as image captioning and visual question answering. It significantly improves upon its predecessors, Idefics1 and Idefics2, particularly in areas like OCR, document understanding, and visual reasoning.
Architecture
Idefics3-8B is a multi-modal model built using two parent models: google/siglip-so400m-patch14-384 and meta-llama/Meta-Llama-3.1-8B-Instruct. It utilizes the Transformers library and processes up to 169 visual tokens for images sized 364x364 pixels. The model's architecture supports a wide array of multimodal tasks by encoding images and text inputs.
Training
The training process for Idefics3-8B involves supervised fine-tuning without reinforcement learning from human feedback (RLHF). This results in the model sometimes producing short responses, which may require iterative prompting. It leverages a variety of datasets, including OBELICS, The Cauldron, Docmatix, and WebSight, to enhance its capabilities in different tasks.
Guide: Running Locally
-
Prerequisites: Ensure you have Python installed along with the necessary libraries, including
torch
andtransformers
. -
Load the Model:
from transformers import AutoProcessor, AutoModelForVision2Seq processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3") model = AutoModelForVision2Seq.from_pretrained( "HuggingFaceM4/Idefics3-8B-Llama3", torch_dtype=torch.bfloat16 ).to("cuda:0")
-
Prepare Inputs: Load images and text for processing.
from transformers.image_utils import load_image image1 = load_image("image_url_1") image2 = load_image("image_url_2") # Prepare your input text and images
-
Run Inference:
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt") inputs = {k: v.to("cuda:0") for k, v in inputs.items()} generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) print(generated_texts)
-
Optimize Performance: Use half precision (e.g.,
torch.bfloat16
) and adjust image resolution settings if necessary.
Cloud GPUs: For enhanced performance, consider using cloud-based GPU providers such as AWS, Google Cloud, or Azure.
License
Idefics3-8B-Llama3 is released under the Apache 2.0 license, allowing for both personal and commercial usage with proper attribution. It builds upon the pre-trained models provided by Google and Meta.