P T A 1
AskUIIntroduction
PTA-1 is a vision-language model designed for computer and phone automation. It is based on Florence-2 and features 270 million parameters, enabling efficient GUI text and element localization. This model allows for low-latency computer automation through local execution.
Architecture
PTA-1 uses a vision-language approach to interpret screenshots and identify elements in graphical user interfaces. It inputs a screenshot and a description of the target element to output a bounding box for the target. The model is built upon Microsoft/Florence-2-base, optimized for effective performance despite its relatively small size.
Training
The training of PTA-1 involved evaluating its performance on various datasets, achieving competitive results compared to larger models. Its high benchmark scores are noted to be influenced by data biases, suggesting the need for fine-tuning based on specific data distributions.
Guide: Running Locally
To run PTA-1 locally, follow these steps:
-
Requirements: Ensure you have
torch
,timm
,einops
,Pillow
, andtransformers
installed. -
Setup: Use the following code snippet to load and run the model:
import torch from PIL import Image from transformers import AutoProcessor, AutoModelForCausalLM device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model = AutoModelForCausalLM.from_pretrained("AskUI/PTA-1", torch_dtype=torch_dtype, trust_remote_code=True).to(device) processor = AutoProcessor.from_pretrained("AskUI/PTA-1", trust_remote_code=True) task_prompt = "<OPEN_VOCABULARY_DETECTION>" prompt = task_prompt + "description of the target element" image = Image.open("path to screenshot").convert("RGB") inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype) generated_ids = model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, do_sample=False, num_beams=3, ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] parsed_answer = processor.post_process_generation(generated_text, task="<OPEN_VOCABULARY_DETECTION>", image_size=(image.width, image.height)) print(parsed_answer)
-
Cloud GPUs: For enhanced performance, consider using cloud-based GPUs such as those offered by AWS, Google Cloud, or Azure.
License
PTA-1 is distributed under the MIT License, allowing for wide usage and modification with minimal restrictions.