# Qwen3-VL

## Docs

- [Text Generation](https://mintlify.wiki/QwenLM/Qwen3-VL/api/generation.md): Generate text responses with the model.generate() method
- [Model Loading](https://mintlify.wiki/QwenLM/Qwen3-VL/api/model-loading.md): Load Qwen3-VL models with AutoModelForImageTextToText
- [Processor](https://mintlify.wiki/QwenLM/Qwen3-VL/api/processor.md): Process images, videos, and text with AutoProcessor
- [Training Arguments](https://mintlify.wiki/QwenLM/Qwen3-VL/api/training/arguments.md): Configuration dataclasses for model initialization, data processing, and training
- [Data Processor](https://mintlify.wiki/QwenLM/Qwen3-VL/api/training/data-processor.md): Vision-language data processing, dataset loading, and collation for Qwen-VL fine-tuning
- [QwenTrainer](https://mintlify.wiki/QwenLM/Qwen3-VL/api/training/trainer.md): Custom trainer implementation with optimized attention mechanisms and optimizer configurations for Qwen-VL fine-tuning
- [fetch_image](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/fetch-image.md): Load and process images from various sources
- [fetch_video](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/fetch-video.md): Extract and process video frames for vision-language models
- [qwen-vl-utils Overview](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/overview.md): Python utilities for processing vision and language information with Qwen-VL models
- [process_vision_info](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/process-vision-info.md): Extract and process vision information from conversation messages
- [smart_resize](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/smart-resize.md): Intelligently resize images while maintaining aspect ratio and constraints
- [Computer Use Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/computer-use.md): Computer and web control with GUI interaction
- [Document Parsing](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/document-parsing.md): Advanced document parsing with layout, text, and Qwen HTML format
- [2D Object Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/grounding-2d.md): Precise object grounding with bounding boxes and points
- [3D Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/grounding-3d.md): Accurate 3D bounding boxes for indoor and outdoor objects
- [Mobile Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/mobile-agent.md): Mobile phone control and GUI interaction
- [OCR & Key Information Extraction](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/ocr.md): General OCR with 32 language support and key information extraction
- [Omni Recognition](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/omni-recognition.md): Identify animals, plants, people, landmarks, and products with Qwen3-VL
- [Spatial Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/spatial-understanding.md): See, understand, and reason about spatial information
- [Video Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/video-understanding.md): Video OCR, long video understanding, and video grounding
- [Visual Coding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/visual-coding.md): Generate Draw.io, HTML, CSS, and JavaScript from images and videos
- [Key Features](https://mintlify.wiki/QwenLM/Qwen3-VL/concepts/key-features.md): Comprehensive overview of Qwen3-VL's capabilities including visual agents, spatial perception, video understanding, and more
- [Model Architecture](https://mintlify.wiki/QwenLM/Qwen3-VL/concepts/model-architecture.md): Technical details of Qwen3-VL's architecture including Interleaved-MRoPE, DeepStack, and Text-Timestamp Alignment
- [Model Variants](https://mintlify.wiki/QwenLM/Qwen3-VL/concepts/model-variants.md): Overview of Qwen3-VL model sizes, architectures, and editions
- [DashScope API Service](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/api-service.md): Use Qwen3-VL through the DashScope API with OpenAI-compatible client
- [Docker Deployment](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/docker.md): Deploy Qwen3-VL using pre-built Docker images
- [Deployment Overview](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/overview.md): Learn about different deployment options for Qwen3-VL models
- [SGLang Deployment](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/sglang.md): Deploy Qwen3-VL with SGLang for efficient inference and serving
- [vLLM Deployment](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/vllm.md): Deploy Qwen3-VL with vLLM for fast inference and serving
- [2D Object Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/2d-grounding.md): Precise object localization using bounding boxes and points with relative coordinates
- [3D Object Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/3d-grounding.md): Accurate 3D bounding boxes for indoor and outdoor objects
- [Computer Use Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/computer-use.md): Control computers and web interfaces with visual agent capabilities for element localization and reasoning
- [Document Parsing](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/document-parsing.md): Advanced document parsing with layout, position information, and Qwen HTML format
- [Long Document Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/long-documents.md): Rigorous semantic comprehension of ultra-long documents with extended context
- [Mobile Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/mobile-agent.md): Locate UI elements and control mobile phone interfaces with visual agent capabilities
- [Multimodal Coding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/multimodal-coding.md): Generate accurate code from images and videos, including Draw.io, HTML, CSS, and JavaScript
- [OCR & Key Information Extraction](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/ocr-extraction.md): Advanced text recognition in natural scenes with multi-language support and key information extraction
- [Omni Recognition](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/omni-recognition.md): Recognize animals, plants, people, scenic spots, cars, merchandise, and various objects
- [Using OpenAI Client with DashScope API](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/openai-api.md): Learn how to use the OpenAI client to interact with Qwen3-VL models through DashScope API
- [Spatial Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/spatial-understanding.md): See, understand, and reason about spatial information and relationships
- [Streaming Responses](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/streaming.md): Enable streaming responses with vLLM and SGLang for real-time inference
- [Thinking with Images](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/thinking-with-images.md): Use image zoom and search tools for precise comprehension of fine-grained visual details
- [Video Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/video-understanding.md): Advanced video OCR, long video understanding, and temporal video grounding
- [Running Web UI Demo](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/web-ui.md): Set up and run the interactive web-based interface for Qwen3-VL
- [Dataset Preparation](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/dataset-preparation.md): Format your training data for Qwen3-VL fine-tuning
- [LoRA Fine-tuning](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/lora.md): Parameter-efficient fine-tuning with LoRA for Qwen3-VL
- [Training Overview](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/overview.md): Learn how to fine-tune Qwen3-VL models with the official training framework
- [Training Configuration](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/training-configuration.md): Configure datasets and sampling rates for Qwen3-VL training
- [Training Script](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/training-script.md): Complete training script reference with all parameters for Qwen3-VL fine-tuning
- [Basic Usage](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/basic-usage.md): Learn how to perform basic inference with Qwen3-VL using transformers
- [Batch Inference](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/batch-inference.md): Process multiple requests efficiently with batching and padding
- [Generation Parameters](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/generation-parameters.md): Configure temperature, top_p, max_new_tokens, and other sampling parameters for optimal output quality
- [Image Processing](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/image-processing.md): Process single and multiple images with resolution control
- [Pixel Control](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/pixel-control.md): Fine-grained control over image and video resolution using processor settings and qwen-vl-utils
- [Video Processing](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/video-processing.md): Process video inputs with frame sampling and resolution control
- [Installation](https://mintlify.wiki/QwenLM/Qwen3-VL/installation.md): Install transformers, qwen-vl-utils, and dependencies to get started with Qwen3-VL.
- [Introduction](https://mintlify.wiki/QwenLM/Qwen3-VL/introduction.md): Meet Qwen3-VL, the most powerful vision-language model in the Qwen series with superior text understanding, visual perception, and agent capabilities.
- [Quick Start](https://mintlify.wiki/QwenLM/Qwen3-VL/quickstart.md): Get started with Qwen3-VL in minutes with a simple image inference example
- [Benchmarks](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/benchmarks.md): Performance evaluation of Qwen3-VL models across visual and text-centric tasks
- [Changelog](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/changelog.md): Release history and updates for Qwen3-VL
- [FAQ](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/faq.md): Frequently asked questions about Qwen3-VL models and usage
- [Model Cards](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/model-cards.md): Complete overview of all Qwen3-VL model variants with download links
- [Troubleshooting](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/troubleshooting.md): Common issues and solutions when working with Qwen3-VL