Skip to main content
Qwen3-VL operates as a visual agent for desktop and web environments, enabling automated computer control through element recognition, localization, and reasoning about UI components and their functions.

Capability Overview

The computer use agent feature enables you to:
  • Recognize desktop UI elements and controls
  • Understand web page structure and components
  • Locate interactive elements on screen
  • Enable automated computer control
  • Perform web automation tasks
  • Support GUI testing and interaction
  • Invoke tools and complete complex tasks

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/desktop_screenshot.jpg",
            },
            {"type": "text", "text": "Identify the UI elements and describe how to perform a specific task on this screen."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full computer use agent cookbook with interactive examples: Open in Colab View on GitHub