Introduction - Qwen3-VL

Overview

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment.

Quick Start

Get started with Qwen3-VL in minutes with a simple inference example

Installation

Install transformers, qwen-vl-utils, and dependencies

Model Variants

Explore different model sizes from 2B to 235B parameters

GitHub Repository

View source code, examples, and contribute

Key Features

Visual Agent

Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks.

Visual Coding Boost

Generates Draw.io/HTML/CSS/JS from images/videos.

Advanced Spatial Perception

Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI.

Long Context & Video Understanding

Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing.

Enhanced Multimodal Reasoning

Excels in STEM/Math—causal analysis and logical, evidence-based answers.

Upgraded Visual Recognition

Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc.

Expanded OCR

Supports 32 languages (up from 10); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing.

Text Understanding on par with pure LLMs

Seamless text–vision fusion for lossless, unified comprehension.

Model Variants

Qwen3-VL offers a range of model sizes to suit different deployment scenarios:

2B Instruct/Thinking

Lightweight edge deployment

4B Instruct/Thinking

Balanced performance

8B Instruct/Thinking

Strong capabilities

32B Instruct/Thinking

High performance

30B-A3B Instruct/Thinking

MoE architecture

235B-A22B Instruct/Thinking

Flagship model with MoE

All models are available on Hugging Face and ModelScope.

Architecture Innovations

Core Improvements

Interleaved-MRoPE

Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning.

DeepStack

Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment.

Text–Timestamp Alignment

Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling.

Resources

Research Paper

Read the full technical report

Blog Post

Latest announcements and insights

Cookbooks

Explore practical examples and tutorials

Demo

Try the interactive demo

Next Steps

Run Your First Inference

Follow our quickstart guide to run Qwen3-VL with your first image

Deploy to Production

Learn how to deploy Qwen3-VL with vLLM or SGLang

Quick Start

​Overview

Quick Start

Installation

Model Variants

GitHub Repository

​Key Features

​Model Variants

2B Instruct/Thinking

4B Instruct/Thinking

8B Instruct/Thinking

32B Instruct/Thinking

30B-A3B Instruct/Thinking

235B-A22B Instruct/Thinking

​Architecture Innovations

​Core Improvements

​Resources

Research Paper

Blog Post

Cookbooks

Demo

​Next Steps

Run Your First Inference

Deploy to Production

Overview

Key Features

Model Variants

Architecture Innovations

Core Improvements

Resources

Next Steps