Skip to main content
Qwen3-VL provides advanced spatial perception capabilities, enabling the model to see, understand, and reason about spatial information. This includes judging object positions, viewpoints, occlusions, and spatial relationships.

Capability Overview

The spatial understanding feature enables you to:
  • Judge object positions and locations
  • Understand viewpoints and perspectives
  • Detect and reason about occlusions
  • Analyze spatial relationships between objects
  • Provide depth and distance estimation
  • Support embodied AI and robotics applications

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/scene.jpg",
            },
            {"type": "text", "text": "Describe the spatial relationships between objects in this scene."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full spatial understanding cookbook with interactive examples: Open in Colab View on GitHub