Skip to main content
Qwen3-VL Logo

Overview

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment.

Quick Start

Get started with Qwen3-VL in minutes with a simple inference example

Installation

Install transformers, qwen-vl-utils, and dependencies

Model Variants

Explore different model sizes from 2B to 235B parameters

GitHub Repository

View source code, examples, and contribute

Key Features

Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks.
Generates Draw.io/HTML/CSS/JS from images/videos.
Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI.
Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing.
Excels in STEM/Math—causal analysis and logical, evidence-based answers.
Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc.
Supports 32 languages (up from 10); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing.
Seamless text–vision fusion for lossless, unified comprehension.

Model Variants

Qwen3-VL offers a range of model sizes to suit different deployment scenarios:

2B Instruct/Thinking

Lightweight edge deployment

4B Instruct/Thinking

Balanced performance

8B Instruct/Thinking

Strong capabilities

32B Instruct/Thinking

High performance

30B-A3B Instruct/Thinking

MoE architecture

235B-A22B Instruct/Thinking

Flagship model with MoE
All models are available on Hugging Face and ModelScope.

Architecture Innovations

Qwen3-VL Architecture

Core Improvements

1

Interleaved-MRoPE

Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning.
2

DeepStack

Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment.
3

Text–Timestamp Alignment

Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling.

Resources

Research Paper

Read the full technical report

Blog Post

Latest announcements and insights

Cookbooks

Explore practical examples and tutorials

Demo

Try the interactive demo

Next Steps

Run Your First Inference

Follow our quickstart guide to run Qwen3-VL with your first image

Deploy to Production

Learn how to deploy Qwen3-VL with vLLM or SGLang