O S Genesis 7 B A C
OS-CopilotIntroduction
OS-Genesis is a system designed to automate the creation of GUI agent trajectory data through an approach called reverse task synthesis. This method aims to train GUI agents to perform effectively on dynamic benchmarks such as AndroidWorld and WebArena without requiring human supervision.
Architecture
OS-Genesis employs a model architecture based on the Qwen2-VL-7B-Instruct model. It is part of the OS-Genesis AC Family Models, which includes various versions such as OS-Genesis-4B-AC, OS-Genesis-7B-AC, and OS-Genesis-8B-AC. These models are fine-tuned for specific tasks using datasets designed for mobile action control, enabling them to perform text and image-based inference.
Training
The training data for OS-Genesis models is sourced from specific datasets tailored to mobile action tasks. The models are evaluated using the AndroidControl Benchmark to ensure their performance in interpreting and generating GUI interactions. The OS-Genesis-7B-AC model, in particular, has been finetuned from Qwen2-VL-7B-Instruct to enhance its capabilities in handling GUI tasks.
Guide: Running Locally
To run OS-Genesis locally, follow these steps:
-
Install Dependencies: Ensure that the necessary Python packages are installed:
pip install transformers pip install qwen-vl-utils
-
Load Model: Use the
transformers
library to load the model and processor:from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2VLForConditionalGeneration.from_pretrained( "OS-Copilot/OS-Genesis-7B-AC", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
-
Prepare Input Data: Process your image and text inputs as required:
messages = [ { "role": "user", "content": [ {"type": "image", "image": "path/to/image.png"}, {"type": "text", "text": "Your text instructions here."}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to("cuda")
-
Inference: Generate output using the model:
generated_ids = model.generate(**inputs, max_new_tokens=128) output_text = processor.batch_decode(generated_ids, skip_special_tokens=False) print(output_text)
To run these models efficiently, especially for extensive inference tasks, consider using cloud GPUs like those offered by AWS, Google Cloud, or Azure.
License
The OS-Genesis models and related resources are licensed under the Apache 2.0 License, allowing for wide usage and modification under specified terms.