O S Genesis 7 B A C

OS-Copilot

Introduction

OS-Genesis is a system designed to automate the creation of GUI agent trajectory data through an approach called reverse task synthesis. This method aims to train GUI agents to perform effectively on dynamic benchmarks such as AndroidWorld and WebArena without requiring human supervision.

Architecture

OS-Genesis employs a model architecture based on the Qwen2-VL-7B-Instruct model. It is part of the OS-Genesis AC Family Models, which includes various versions such as OS-Genesis-4B-AC, OS-Genesis-7B-AC, and OS-Genesis-8B-AC. These models are fine-tuned for specific tasks using datasets designed for mobile action control, enabling them to perform text and image-based inference.

Training

The training data for OS-Genesis models is sourced from specific datasets tailored to mobile action tasks. The models are evaluated using the AndroidControl Benchmark to ensure their performance in interpreting and generating GUI interactions. The OS-Genesis-7B-AC model, in particular, has been finetuned from Qwen2-VL-7B-Instruct to enhance its capabilities in handling GUI tasks.

Guide: Running Locally

To run OS-Genesis locally, follow these steps:

  1. Install Dependencies: Ensure that the necessary Python packages are installed:

    pip install transformers
    pip install qwen-vl-utils
    
  2. Load Model: Use the transformers library to load the model and processor:

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    from qwen_vl_utils import process_vision_info
    
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "OS-Copilot/OS-Genesis-7B-AC", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
    
  3. Prepare Input Data: Process your image and text inputs as required:

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "path/to/image.png"},
                {"type": "text", "text": "Your text instructions here."},
            ],
        }
    ]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    ).to("cuda")
    
  4. Inference: Generate output using the model:

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)
    print(output_text)
    

To run these models efficiently, especially for extensive inference tasks, consider using cloud GPUs like those offered by AWS, Google Cloud, or Azure.

License

The OS-Genesis models and related resources are licensed under the Apache 2.0 License, allowing for wide usage and modification under specified terms.

More Related APIs in Image Text To Text