O S Atlas Base 7 B

OS-Copilot

Introduction

OS-Atlas is a series of models designed specifically for GUI agents, focusing on tasks that require interaction with graphical user interfaces. It provides models for both GUI grounding and generating single-step actions in GUI agent tasks.

Architecture

The OS-Atlas-Base-7B model is finetuned from Qwen2-VL-7B-Instruct and is part of a suite of models, including OS-Atlas-Pro variants, tailored for GUI tasks. These models are capable of interpreting images of any size, with outputs normalized to a 0-1000 range for coordinates.

Training

The OS-Atlas models are finetuned from the Qwen2-VL-7B-Instruct base model. They are designed to interpret and generate actions based on GUI inputs, making them suitable for applications involving GUI agent tasks.

Guide: Running Locally

To run the OS-Atlas-Base-7B model locally, follow these steps:

  1. Install Dependencies:

    pip install transformers
    pip install qwen-vl-utils
    
  2. Download Example Image: Save an example image to your current directory for testing.

  3. Inference Code: Use the following code template for inference.

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    from qwen_vl_utils import process_vision_info
    
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "OS-Copilot/OS-Atlas-Base-7B", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
    
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "./web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
                },
                {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
            ],
        }
    ]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
    inputs = inputs.to("cuda")
    
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)
    print(output_text)
    
  4. Suggested Cloud GPUs: Utilize cloud GPU services like AWS, Google Cloud, or Azure for enhanced performance and resource availability.

License

The OS-Atlas-Base-7B model is licensed under the Apache 2.0 License, allowing for both personal and commercial use.

More Related APIs in Image Text To Text