Show U I 2 B
showlabIntroduction
ShowUI is a lightweight 2-billion parameter vision-language-action model designed for graphical user interface (GUI) agents. It integrates visual and textual understanding to perform actions on computer interfaces.
Architecture
ShowUI leverages the Qwen2-VL-2B-Instruct base model, focusing on vision-language integration for interactive tasks. It is designed to interpret and act upon GUI elements using a combination of visual and language inputs, enabling it to perform tasks like clicking, typing, and navigating through interfaces.
Training
The model is trained to understand GUI tasks through datasets such as ShowUI-desktop-8K. It uses a combination of visual and language inputs to generate actions that facilitate navigation and interaction with digital interfaces. The training process includes grounding visual information to coordinates and processing language instructions to perform specific actions.
Guide: Running Locally
Basic Steps
-
Load Model: Import necessary packages and load the ShowUI model using Hugging Face's
transformers
library.from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model = Qwen2VLForConditionalGeneration.from_pretrained( "showlab/ShowUI-2B", torch_dtype=torch.bfloat16, device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
-
UI Grounding: Utilize the model to process an image and a text query, generating coordinates for GUI interaction.
img_url = 'examples/image.png' query = "Your query here" # Process input and generate output
-
UI Navigation: Define actions for navigation tasks using system prompts and process the output.
system_prompt = _NAV_SYSTEM.format(_APP='web', _ACTION_SPACE=action_map['web']) # Process input and generate output
Suggest Cloud GPUs
For optimal performance, especially with large models, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure for efficient computation.
License
ShowUI is released under the MIT License, allowing for broad usage and modification.