# Qwen3-VL ## Docs - [Text Generation](https://mintlify.wiki/QwenLM/Qwen3-VL/api/generation.md): Generate text responses with the model.generate() method - [Model Loading](https://mintlify.wiki/QwenLM/Qwen3-VL/api/model-loading.md): Load Qwen3-VL models with AutoModelForImageTextToText - [Processor](https://mintlify.wiki/QwenLM/Qwen3-VL/api/processor.md): Process images, videos, and text with AutoProcessor - [Training Arguments](https://mintlify.wiki/QwenLM/Qwen3-VL/api/training/arguments.md): Configuration dataclasses for model initialization, data processing, and training - [Data Processor](https://mintlify.wiki/QwenLM/Qwen3-VL/api/training/data-processor.md): Vision-language data processing, dataset loading, and collation for Qwen-VL fine-tuning - [QwenTrainer](https://mintlify.wiki/QwenLM/Qwen3-VL/api/training/trainer.md): Custom trainer implementation with optimized attention mechanisms and optimizer configurations for Qwen-VL fine-tuning - [fetch_image](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/fetch-image.md): Load and process images from various sources - [fetch_video](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/fetch-video.md): Extract and process video frames for vision-language models - [qwen-vl-utils Overview](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/overview.md): Python utilities for processing vision and language information with Qwen-VL models - [process_vision_info](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/process-vision-info.md): Extract and process vision information from conversation messages - [smart_resize](https://mintlify.wiki/QwenLM/Qwen3-VL/api/utils/smart-resize.md): Intelligently resize images while maintaining aspect ratio and constraints - [Computer Use Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/computer-use.md): Computer and web control with GUI interaction - [Document Parsing](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/document-parsing.md): Advanced document parsing with layout, text, and Qwen HTML format - [2D Object Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/grounding-2d.md): Precise object grounding with bounding boxes and points - [3D Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/grounding-3d.md): Accurate 3D bounding boxes for indoor and outdoor objects - [Mobile Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/mobile-agent.md): Mobile phone control and GUI interaction - [OCR & Key Information Extraction](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/ocr.md): General OCR with 32 language support and key information extraction - [Omni Recognition](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/omni-recognition.md): Identify animals, plants, people, landmarks, and products with Qwen3-VL - [Spatial Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/spatial-understanding.md): See, understand, and reason about spatial information - [Video Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/video-understanding.md): Video OCR, long video understanding, and video grounding - [Visual Coding](https://mintlify.wiki/QwenLM/Qwen3-VL/capabilities/visual-coding.md): Generate Draw.io, HTML, CSS, and JavaScript from images and videos - [Key Features](https://mintlify.wiki/QwenLM/Qwen3-VL/concepts/key-features.md): Comprehensive overview of Qwen3-VL's capabilities including visual agents, spatial perception, video understanding, and more - [Model Architecture](https://mintlify.wiki/QwenLM/Qwen3-VL/concepts/model-architecture.md): Technical details of Qwen3-VL's architecture including Interleaved-MRoPE, DeepStack, and Text-Timestamp Alignment - [Model Variants](https://mintlify.wiki/QwenLM/Qwen3-VL/concepts/model-variants.md): Overview of Qwen3-VL model sizes, architectures, and editions - [DashScope API Service](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/api-service.md): Use Qwen3-VL through the DashScope API with OpenAI-compatible client - [Docker Deployment](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/docker.md): Deploy Qwen3-VL using pre-built Docker images - [Deployment Overview](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/overview.md): Learn about different deployment options for Qwen3-VL models - [SGLang Deployment](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/sglang.md): Deploy Qwen3-VL with SGLang for efficient inference and serving - [vLLM Deployment](https://mintlify.wiki/QwenLM/Qwen3-VL/deployment/vllm.md): Deploy Qwen3-VL with vLLM for fast inference and serving - [2D Object Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/2d-grounding.md): Precise object localization using bounding boxes and points with relative coordinates - [3D Object Grounding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/3d-grounding.md): Accurate 3D bounding boxes for indoor and outdoor objects - [Computer Use Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/computer-use.md): Control computers and web interfaces with visual agent capabilities for element localization and reasoning - [Document Parsing](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/document-parsing.md): Advanced document parsing with layout, position information, and Qwen HTML format - [Long Document Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/long-documents.md): Rigorous semantic comprehension of ultra-long documents with extended context - [Mobile Agent](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/mobile-agent.md): Locate UI elements and control mobile phone interfaces with visual agent capabilities - [Multimodal Coding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/multimodal-coding.md): Generate accurate code from images and videos, including Draw.io, HTML, CSS, and JavaScript - [OCR & Key Information Extraction](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/ocr-extraction.md): Advanced text recognition in natural scenes with multi-language support and key information extraction - [Omni Recognition](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/omni-recognition.md): Recognize animals, plants, people, scenic spots, cars, merchandise, and various objects - [Using OpenAI Client with DashScope API](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/openai-api.md): Learn how to use the OpenAI client to interact with Qwen3-VL models through DashScope API - [Spatial Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/spatial-understanding.md): See, understand, and reason about spatial information and relationships - [Streaming Responses](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/streaming.md): Enable streaming responses with vLLM and SGLang for real-time inference - [Thinking with Images](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/thinking-with-images.md): Use image zoom and search tools for precise comprehension of fine-grained visual details - [Video Understanding](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/video-understanding.md): Advanced video OCR, long video understanding, and temporal video grounding - [Running Web UI Demo](https://mintlify.wiki/QwenLM/Qwen3-VL/examples/web-ui.md): Set up and run the interactive web-based interface for Qwen3-VL - [Dataset Preparation](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/dataset-preparation.md): Format your training data for Qwen3-VL fine-tuning - [LoRA Fine-tuning](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/lora.md): Parameter-efficient fine-tuning with LoRA for Qwen3-VL - [Training Overview](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/overview.md): Learn how to fine-tune Qwen3-VL models with the official training framework - [Training Configuration](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/training-configuration.md): Configure datasets and sampling rates for Qwen3-VL training - [Training Script](https://mintlify.wiki/QwenLM/Qwen3-VL/fine-tuning/training-script.md): Complete training script reference with all parameters for Qwen3-VL fine-tuning - [Basic Usage](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/basic-usage.md): Learn how to perform basic inference with Qwen3-VL using transformers - [Batch Inference](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/batch-inference.md): Process multiple requests efficiently with batching and padding - [Generation Parameters](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/generation-parameters.md): Configure temperature, top_p, max_new_tokens, and other sampling parameters for optimal output quality - [Image Processing](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/image-processing.md): Process single and multiple images with resolution control - [Pixel Control](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/pixel-control.md): Fine-grained control over image and video resolution using processor settings and qwen-vl-utils - [Video Processing](https://mintlify.wiki/QwenLM/Qwen3-VL/inference/video-processing.md): Process video inputs with frame sampling and resolution control - [Installation](https://mintlify.wiki/QwenLM/Qwen3-VL/installation.md): Install transformers, qwen-vl-utils, and dependencies to get started with Qwen3-VL. - [Introduction](https://mintlify.wiki/QwenLM/Qwen3-VL/introduction.md): Meet Qwen3-VL, the most powerful vision-language model in the Qwen series with superior text understanding, visual perception, and agent capabilities. - [Quick Start](https://mintlify.wiki/QwenLM/Qwen3-VL/quickstart.md): Get started with Qwen3-VL in minutes with a simple image inference example - [Benchmarks](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/benchmarks.md): Performance evaluation of Qwen3-VL models across visual and text-centric tasks - [Changelog](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/changelog.md): Release history and updates for Qwen3-VL - [FAQ](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/faq.md): Frequently asked questions about Qwen3-VL models and usage - [Model Cards](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/model-cards.md): Complete overview of all Qwen3-VL model variants with download links - [Troubleshooting](https://mintlify.wiki/QwenLM/Qwen3-VL/resources/troubleshooting.md): Common issues and solutions when working with Qwen3-VL