U Ground V1 2 B

osunlp

Introduction

UGround-V1-2B is a GUI visual grounding model based on Qwen2-VL architecture, designed to map image descriptions to precise screen coordinates. It is part of a series of models aimed at enhancing the interaction between visual and textual data, supporting various applications such as mobile and desktop UI navigation.

Architecture

The model is built on the Qwen2-VL framework, which supports multimodal inputs, including text and images. It employs techniques such as Multimodal Rotary Position Embedding (M-ROPE) and Naive Dynamic Resolution to handle arbitrary image resolutions and improve the model's ability to process visual and textual information.

Training

UGround-V1-2B is trained to perform tasks like GUI navigation by understanding and interpreting visual inputs. The training includes exposure to diverse data sets to enhance the model's ability to generalize across different visual contexts.

Guide: Running Locally

  1. Environment Setup: Ensure you have Python and essential libraries installed. Use pip install transformers qwen-vl-utils to set up dependencies.
  2. Model Download: Download the UGround-V1-2B model from the Hugging Face repository.
  3. Inference Setup: Use the transformers library to load the model and process inputs. GPU support is recommended for efficient processing.
  4. Running Inference: Prepare input data, such as images or videos, and run inference using the model.generate method to obtain outputs.

Cloud GPUs

To enhance performance, it is advisable to use cloud GPU services like AWS, Google Cloud, or Azure for running the model, especially for large-scale or intensive tasks.

License

UGround-V1-2B is released under the Apache-2.0 license, which allows for both personal and commercial use while requiring attribution to the original authors.

More Related APIs in Image Text To Text