Mobile Agent - Qwen3-VL

Qwen3-VL operates as a visual agent for mobile devices, recognizing UI elements, understanding their functions, and enabling automated mobile phone control through localization and reasoning.

Capability Overview

The mobile agent feature enables you to:

Recognize mobile UI elements and controls
Understand element functions and purposes
Locate interactive components on screen
Enable automated mobile phone control
Perform UI testing and interaction
Support mobile automation workflows

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/mobile_screenshot.jpg",
            },
            {"type": "text", "text": "Identify all clickable elements and their functions on this mobile screen."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full mobile agent cookbook with interactive examples:

View on GitHub

​Capability Overview

​Example Usage

​Try it Yourself

Capability Overview

Example Usage

Try it Yourself