2D Object Grounding
Qwen3-VL provides precise 2D object grounding capabilities, allowing you to locate and label objects within images using relative position coordinates. The model supports both bounding boxes and point-based grounding for diverse positioning and labeling tasks.Capabilities
Qwen3-VL’s 2D grounding uses relative position coordinates to:- Bounding Boxes: Draw rectangular boxes around objects
- Point Grounding: Mark specific locations with coordinate points
- Flexible Combinations: Mix boxes and points for complex annotation tasks
- Multi-object Detection: Ground multiple objects simultaneously
- Relative Coordinates: Position-independent coordinate system (0-1 range)
How It Works
Coordinate System
Positions are expressed as relative coordinates:- X-axis: 0 (left edge) to 1 (right edge)
- Y-axis: 0 (top edge) to 1 (bottom edge)
Grounding Formats
- Bounding Boxes:
[x_min, y_min, x_max, y_max] - Points:
[x, y] - Combined: Mix both formats in a single response
Use Cases
- Object Detection: Locate and label objects in images
- Annotation: Generate training data for computer vision models
- Visual Search: Find specific items within images
- Quality Control: Identify defects or anomalies in products
- Spatial Analysis: Analyze object positions and distributions
- Interactive Applications: Enable click-to-identify features
Try It Out
Explore 2D object grounding with our interactive cookbook:2D Grounding Cookbook
Using relative position coordinates, it supports both boxes and points, allowing for diverse combinations of positioning and labeling tasks.
Key Features
- High Precision: Accurate object localization
- Format Flexibility: Support for boxes, points, and combinations
- Multi-object Support: Ground multiple items in one pass
- Resolution Independent: Works across different image sizes
- Natural Language Queries: Describe what to ground in plain text
Advanced Capabilities
Referring Expression Comprehension
Ground objects based on natural language descriptions:- “The red car on the left”
- “The person wearing glasses”
- “The largest apple in the bowl”
Dense Captioning
Generate descriptions for multiple grounded regions in an image.Visual Question Answering with Grounding
Answer questions while providing visual evidence through grounding.Related Capabilities
- 3D Grounding - 3D bounding boxes for spatial scenes
- Omni Recognition - Identify what objects to ground
- Spatial Understanding - Understand spatial relationships
- Video Understanding - Grounding in video frames