Computer Use Agent

Qwen3-VL operates as a visual agent for desktop and web environments, enabling automated computer control through element recognition, localization, and reasoning about UI components and their functions.

Capability Overview

The computer use agent feature enables you to:

Recognize desktop UI elements and controls
Understand web page structure and components
Locate interactive elements on screen
Enable automated computer control
Perform web automation tasks
Support GUI testing and interaction
Invoke tools and complete complex tasks

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/desktop_screenshot.jpg",
            },
            {"type": "text", "text": "Identify the UI elements and describe how to perform a specific task on this screen."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full computer use agent cookbook with interactive examples:

View on GitHub

​Capability Overview

​Example Usage

​Try it Yourself

Capability Overview

Example Usage

Try it Yourself