Introduction

PTA-1 is a vision-language model designed for computer and phone automation. It is based on Florence-2 and features 270 million parameters, enabling efficient GUI text and element localization. This model allows for low-latency computer automation through local execution.

Architecture

PTA-1 uses a vision-language approach to interpret screenshots and identify elements in graphical user interfaces. It inputs a screenshot and a description of the target element to output a bounding box for the target. The model is built upon Microsoft/Florence-2-base, optimized for effective performance despite its relatively small size.

Training

The training of PTA-1 involved evaluating its performance on various datasets, achieving competitive results compared to larger models. Its high benchmark scores are noted to be influenced by data biases, suggesting the need for fine-tuning based on specific data distributions.

Guide: Running Locally

To run PTA-1 locally, follow these steps:

  1. Requirements: Ensure you have torch, timm, einops, Pillow, and transformers installed.

  2. Setup: Use the following code snippet to load and run the model:

    import torch
    from PIL import Image
    from transformers import AutoProcessor, AutoModelForCausalLM
    
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
    
    model = AutoModelForCausalLM.from_pretrained("AskUI/PTA-1", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
    processor = AutoProcessor.from_pretrained("AskUI/PTA-1", trust_remote_code=True)
    
    task_prompt = "<OPEN_VOCABULARY_DETECTION>"
    prompt = task_prompt + "description of the target element"
    
    image = Image.open("path to screenshot").convert("RGB")
    
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
    
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    
    parsed_answer = processor.post_process_generation(generated_text, task="<OPEN_VOCABULARY_DETECTION>", image_size=(image.width, image.height))
    
    print(parsed_answer)
    
  3. Cloud GPUs: For enhanced performance, consider using cloud-based GPUs such as those offered by AWS, Google Cloud, or Azure.

License

PTA-1 is distributed under the MIT License, allowing for wide usage and modification with minimal restrictions.

More Related APIs in Image Text To Text