Spatial Understanding

Qwen3-VL provides advanced spatial perception capabilities, enabling the model to see, understand, and reason about spatial information. This includes judging object positions, viewpoints, occlusions, and spatial relationships.

Capability Overview

The spatial understanding feature enables you to:

Judge object positions and locations
Understand viewpoints and perspectives
Detect and reason about occlusions
Analyze spatial relationships between objects
Provide depth and distance estimation
Support embodied AI and robotics applications

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/scene.jpg",
            },
            {"type": "text", "text": "Describe the spatial relationships between objects in this scene."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full spatial understanding cookbook with interactive examples:

View on GitHub

​Capability Overview

​Example Usage

​Try it Yourself

Capability Overview

Example Usage

Try it Yourself