Ferret U I Llama8b

jadechoghari

Ferret-UI-Llama8b

Introduction

Ferret-UI is the first UI-centric multimodal large language model (MLLM) designed for referring, grounding, and reasoning tasks. Built on Gemma-2B and Llama-3-8B, it is capable of executing complex UI tasks. This specific model is the Llama-3-8B version of Ferret-UI, developed based on a paper by Apple.

Architecture

The model is a multimodal large language model (MLLM) that integrates with various tasks related to UI interactions. It is constructed using the Transformers library and is designed for image-text-to-text processing, supporting functionalities such as text generation and conversational AI.

Training

Details regarding the training process are not explicitly provided in the documentation. However, given the architecture, it is likely trained using large datasets for image and text processing, focusing on grounding and reasoning capabilities.

Guide: Running Locally

To run the model locally, follow these basic steps:

  1. Download Required Files: Use the wget command to download necessary Python scripts:

    wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py
    wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py
    wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py
    wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py
    wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py
    
  2. Usage Example:

    from inference import inference_and_run
    image_path = "appstore_reminders.png"
    prompt = "Describe the image in details"
    
    # Call the function without a box
    inference_text = inference_and_run(image_path, prompt)
    print("Inference Text:", inference_text)
    
    # Task with bounding boxes
    box = [189, 906, 404, 970]
    inference_text = inference_and_run(
        image_path=image_path, 
        prompt=prompt, 
        conv_mode="ferret_llama_3", 
        model_path="jadechoghari/Ferret-UI-Llama8b", 
        box=box
    )
    print("Inference Text:", inference_text)
    
  3. Grounding Prompts: Use predefined templates to provide or request bounding boxes for objects in images.

For optimal performance, consider using cloud GPUs such as those provided by AWS, Azure, or Google Cloud.

License

The specific licensing details for Ferret-UI-Llama8b have not been provided in the documentation. Ensure to verify licensing terms on the Hugging Face model card page before use.

More Related APIs in Image Text To Text