Skip to main content
Qwen3-VL is available in multiple sizes and configurations to meet different deployment scenarios, from edge devices to cloud infrastructure.

Model Sizes

Qwen3-VL comes in five different parameter scales:

Dense Architecture Models

Model SizeParametersUse CaseHardware Requirement
2B2 BillionEdge devices, mobileConsumer GPUs
4B4 BillionLightweight deploymentSingle GPU
8B8 BillionBalanced performanceSingle GPU (16GB+)
32B32 BillionHigh-performance tasksMulti-GPU

Mixture-of-Experts (MoE) Architecture

Model SizeTotal ParametersActive ParametersUse Case
30B-A3B30 Billion3 BillionEfficient large-scale
235B-A22B235 Billion22 BillionState-of-the-art performance
MoE models activate only a subset of parameters per inference, providing better efficiency and performance tradeoffs compared to dense models of similar total parameter count.

Model Editions

Each model size is available in two editions:

Instruct Edition

Optimized for direct task execution
  • Fast inference and response generation
  • Suitable for production deployments
  • Optimized for instruction-following
  • Lower computational overhead
Example: Qwen3-VL-8B-Instruct

Thinking Edition

Enhanced reasoning with explicit thought processes
  • Provides step-by-step reasoning
  • Better performance on complex tasks
  • Useful for debugging and interpretability
  • Longer output sequences
Example: Qwen3-VL-8B-Thinking
Thinking editions are particularly effective for:
  • STEM and mathematical reasoning
  • Complex spatial understanding
  • Multi-step problem solving
  • Tasks requiring causal analysis

Architecture Comparison

Dense vs MoE

Dense Models
  • All parameters active during inference
  • Predictable memory usage
  • Simpler deployment
  • Best for: Resource-constrained environments
MoE Models
  • Sparse activation patterns
  • Higher capacity with lower compute
  • Requires expert parallelism support
  • Best for: Maximum performance scenarios

Available Models

Released Models (HuggingFace)

Model Selection Guide

By Use Case

Edge/Mobile Applications
  • Choose: 2B or 4B Instruct
  • Rationale: Low memory footprint, fast inference
General Purpose Vision-Language Tasks
  • Choose: 8B Instruct
  • Rationale: Best balance of performance and efficiency
Complex Reasoning Tasks
  • Choose: 8B or 32B Thinking
  • Rationale: Enhanced reasoning capabilities
Maximum Performance
  • Choose: 235B-A22B Instruct/Thinking
  • Rationale: State-of-the-art results across benchmarks
Production with Budget Constraints
  • Choose: 30B-A3B Instruct
  • Rationale: MoE efficiency with strong performance

By Hardware

# Single Consumer GPU (8-16GB)
model = "Qwen3-VL-2B-Instruct"

# Single High-End GPU (24GB+)
model = "Qwen3-VL-8B-Instruct"

# Multi-GPU (A100/H100)
model = "Qwen3-VL-32B-Instruct"

# H100+ with FP8 Support
model = "Qwen3-VL-235B-A22B-Instruct-FP8"

Quantized Versions

For memory-constrained deployments, FP8 quantized versions are available:
  • Requires NVIDIA H100+ and CUDA 12+
  • Minimal performance degradation
  • Significant memory savings
  • Faster inference throughput
Available: All models have FP8 variants in the HuggingFace collection

Performance Benchmarks

For detailed performance comparisons across different model sizes and editions, see the Benchmarks page.
MoE models require specialized deployment configuration. Ensure your inference framework supports expert parallelism (e.g., vLLM with --enable-expert-parallel flag).