U Ground V1 2 B
osunlpIntroduction
UGround-V1-2B is a GUI visual grounding model based on Qwen2-VL architecture, designed to map image descriptions to precise screen coordinates. It is part of a series of models aimed at enhancing the interaction between visual and textual data, supporting various applications such as mobile and desktop UI navigation.
Architecture
The model is built on the Qwen2-VL framework, which supports multimodal inputs, including text and images. It employs techniques such as Multimodal Rotary Position Embedding (M-ROPE) and Naive Dynamic Resolution to handle arbitrary image resolutions and improve the model's ability to process visual and textual information.
Training
UGround-V1-2B is trained to perform tasks like GUI navigation by understanding and interpreting visual inputs. The training includes exposure to diverse data sets to enhance the model's ability to generalize across different visual contexts.
Guide: Running Locally
- Environment Setup: Ensure you have Python and essential libraries installed. Use
pip install transformers qwen-vl-utils
to set up dependencies. - Model Download: Download the UGround-V1-2B model from the Hugging Face repository.
- Inference Setup: Use the
transformers
library to load the model and process inputs. GPU support is recommended for efficient processing. - Running Inference: Prepare input data, such as images or videos, and run inference using the
model.generate
method to obtain outputs.
Cloud GPUs
To enhance performance, it is advisable to use cloud GPU services like AWS, Google Cloud, or Azure for running the model, especially for large-scale or intensive tasks.
License
UGround-V1-2B is released under the Apache-2.0 license, which allows for both personal and commercial use while requiring attribution to the original authors.