Skip to main content
Qwen3-VL provides precise 2D object grounding capabilities using relative position coordinates. The model supports both bounding boxes and points, enabling diverse combinations of positioning and labeling tasks.

Capability Overview

The 2D grounding feature enables you to:
  • Locate objects using bounding boxes
  • Pinpoint specific locations with points
  • Use relative position coordinates
  • Combine multiple grounding formats
  • Perform object detection and localization
  • Support diverse positioning tasks

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/image.jpg",
            },
            {"type": "text", "text": "Locate all the objects in this image with bounding boxes."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full 2D grounding cookbook with interactive examples: Open in Colab View on GitHub