Skip to main content

Prerequisites

Before installing Qwen3-VL, ensure you have:

Python

Python 3.8 or higher

CUDA

NVIDIA GPU with CUDA support (recommended)

PyTorch

PyTorch 2.0 or higher

pip

pip package manager

Basic Installation

1

Install Transformers

Qwen3-VL requires transformers version 4.57.0 or higher:
pip install "transformers>=4.57.0"
This is the minimum requirement to run Qwen3-VL with Hugging Face Transformers.
2

Install qwen-vl-utils (Optional but Recommended)

For advanced vision processing capabilities:
pip install qwen-vl-utils==0.0.14
For faster video loading, install with the decord feature:
pip install "qwen-vl-utils[decord]"
3

Verify Installation

Test your installation:
from transformers import AutoModelForImageTextToText, AutoProcessor

# This should run without errors
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
print("Installation successful!")

Performance Optimizations

Flash Attention 2

For significantly faster inference, especially with multi-image and video scenarios:
pip install -U flash-attn --no-build-isolation
Flash Attention 2 requires:
  • Compatible NVIDIA GPU (Ampere or newer)
  • CUDA 11.6 or higher
  • Models loaded in torch.float16 or torch.bfloat16
Usage example:
import torch
from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Video Processing Backends

Qwen3-VL supports multiple video decoding backends: Switch backends by setting an environment variable:
export FORCE_QWENVL_VIDEO_READER=torchcodec  # or decord, torchvision

Deployment Installation

For production deployment with vLLM or SGLang:
# Install accelerate and qwen-vl-utils
pip install accelerate
pip install qwen-vl-utils==0.0.14

# Install latest vLLM (>= 0.11.0)
uv pip install -U vllm
For detailed deployment instructions, see the Deployment Guide.

China Mainland Users

For users in mainland China, we recommend using ModelScope:
from modelscope import snapshot_download

# Download model checkpoint
model_dir = snapshot_download('qwen/Qwen3-VL-8B-Instruct')

# Load from local directory
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
    model_dir,
    dtype="auto",
    device_map="auto"
)

Docker Installation

Use our pre-built Docker images for a simplified setup:
docker run --gpus all --ipc=host --network=host --rm --name qwen3vl \
  -it qwenllm/qwenvl:qwen3vl-cu128 bash
The Docker image includes:
  • Pre-configured environment
  • All dependencies
  • CUDA 12.8 support
You only need to install GPU drivers on the host machine.

Installation Verification

Run this complete test to verify your installation:
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

print("Testing Qwen3-VL installation...")

# Check transformers version
import transformers
print(f"Transformers version: {transformers.__version__}")
assert transformers.__version__ >= "4.57.0", "Please upgrade transformers"

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")

# Try loading processor
try:
    processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
    print("✓ Processor loaded successfully")
except Exception as e:
    print(f"✗ Error loading processor: {e}")

# Check qwen-vl-utils
try:
    import qwen_vl_utils
    print(f"✓ qwen-vl-utils version: {qwen_vl_utils.__version__}")
except ImportError:
    print("○ qwen-vl-utils not installed (optional)")

# Check flash-attn
try:
    import flash_attn
    print("✓ Flash Attention 2 available")
except ImportError:
    print("○ Flash Attention 2 not installed (optional)")

print("\nInstallation verification complete!")

Troubleshooting

Make sure you’ve installed transformers:
pip install "transformers>=4.57.0"
Try these solutions:
  1. Use a smaller model (e.g., 2B or 4B instead of 235B)
  2. Enable quantization with FP8 models
  3. Use device_map="auto" for automatic device placement
  4. Reduce batch size or max_new_tokens
# Use FP8 quantized model
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct-FP8",
    dtype="auto",
    device_map="auto"
)
Flash Attention 2 requires:
  • NVIDIA GPU with Ampere architecture or newer
  • CUDA 11.6+
  • Proper build tools
If installation fails, you can skip it and use default attention:
# Don't specify attn_implementation
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    dtype="auto",
    device_map="auto"
)
If you encounter video processing issues:
  1. Try a different backend:
    export FORCE_QWENVL_VIDEO_READER=torchcodec
    
  2. Ensure you have the required dependencies:
    • torchcodec: Requires FFmpeg
    • decord: Linux only, may need to build from source
    • torchvision: Requires version >= 0.19.0 for URL support
  3. Use local video files instead of URLs as a workaround

Next Steps

Now that you have Qwen3-VL installed, you can:

Quick Start

Run your first inference with an image

Advanced Usage

Learn about pixel control, batching, and optimization

Deployment

Deploy Qwen3-VL with vLLM or SGLang

Cookbooks

Explore practical examples and use cases