2D Object Grounding

Qwen3-VL provides precise 2D object grounding capabilities using relative position coordinates. The model supports both bounding boxes and points, enabling diverse combinations of positioning and labeling tasks.

Capability Overview

The 2D grounding feature enables you to:

Locate objects using bounding boxes
Pinpoint specific locations with points
Use relative position coordinates
Combine multiple grounding formats
Perform object detection and localization
Support diverse positioning tasks

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/image.jpg",
            },
            {"type": "text", "text": "Locate all the objects in this image with bounding boxes."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full 2D grounding cookbook with interactive examples:

View on GitHub

​Capability Overview

​Example Usage

​Try it Yourself

Capability Overview

Example Usage

Try it Yourself