Ferret U I Llama8b
jadechoghariFerret-UI-Llama8b
Introduction
Ferret-UI is the first UI-centric multimodal large language model (MLLM) designed for referring, grounding, and reasoning tasks. Built on Gemma-2B and Llama-3-8B, it is capable of executing complex UI tasks. This specific model is the Llama-3-8B version of Ferret-UI, developed based on a paper by Apple.
Architecture
The model is a multimodal large language model (MLLM) that integrates with various tasks related to UI interactions. It is constructed using the Transformers library and is designed for image-text-to-text processing, supporting functionalities such as text generation and conversational AI.
Training
Details regarding the training process are not explicitly provided in the documentation. However, given the architecture, it is likely trained using large datasets for image and text processing, focusing on grounding and reasoning capabilities.
Guide: Running Locally
To run the model locally, follow these basic steps:
-
Download Required Files: Use the
wget
command to download necessary Python scripts:wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py
-
Usage Example:
from inference import inference_and_run image_path = "appstore_reminders.png" prompt = "Describe the image in details" # Call the function without a box inference_text = inference_and_run(image_path, prompt) print("Inference Text:", inference_text) # Task with bounding boxes box = [189, 906, 404, 970] inference_text = inference_and_run( image_path=image_path, prompt=prompt, conv_mode="ferret_llama_3", model_path="jadechoghari/Ferret-UI-Llama8b", box=box ) print("Inference Text:", inference_text)
-
Grounding Prompts: Use predefined templates to provide or request bounding boxes for objects in images.
For optimal performance, consider using cloud GPUs such as those provided by AWS, Azure, or Google Cloud.
License
The specific licensing details for Ferret-UI-Llama8b have not been provided in the documentation. Ensure to verify licensing terms on the Hugging Face model card page before use.