Show U I 2 B

showlab

Introduction

ShowUI is a lightweight 2-billion parameter vision-language-action model designed for graphical user interface (GUI) agents. It integrates visual and textual understanding to perform actions on computer interfaces.

Architecture

ShowUI leverages the Qwen2-VL-2B-Instruct base model, focusing on vision-language integration for interactive tasks. It is designed to interpret and act upon GUI elements using a combination of visual and language inputs, enabling it to perform tasks like clicking, typing, and navigating through interfaces.

Training

The model is trained to understand GUI tasks through datasets such as ShowUI-desktop-8K. It uses a combination of visual and language inputs to generate actions that facilitate navigation and interaction with digital interfaces. The training process includes grounding visual information to coordinates and processing language instructions to perform specific actions.

Guide: Running Locally

Basic Steps

  1. Load Model: Import necessary packages and load the ShowUI model using Hugging Face's transformers library.

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "showlab/ShowUI-2B",
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
    
  2. UI Grounding: Utilize the model to process an image and a text query, generating coordinates for GUI interaction.

    img_url = 'examples/image.png'
    query = "Your query here"
    # Process input and generate output
    
  3. UI Navigation: Define actions for navigation tasks using system prompts and process the output.

    system_prompt = _NAV_SYSTEM.format(_APP='web', _ACTION_SPACE=action_map['web'])
    # Process input and generate output
    

Suggest Cloud GPUs

For optimal performance, especially with large models, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure for efficient computation.

License

ShowUI is released under the MIT License, allowing for broad usage and modification.

More Related APIs