Video Processing

Basic Video Inference

Qwen3-VL supports video input through URLs, local paths, or frame sequences.

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Messages containing a video url (or a local path) and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Video Input Formats

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

Frame Sampling Control

Using FPS

Control the frame sampling rate using the fps parameter:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Set fps = 4 (default is 2)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    fps=4
)
inputs = inputs.to(model.device)

Using Fixed Frame Count

Specify the exact number of frames to sample:

# Set num_frames = 128 and overwrite fps to None
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    num_frames=128,
    fps=None,
)
inputs = inputs.to(model.device)

When using num_frames, set fps=None to avoid conflicts between the two parameters.

Video Resolution Control

Configure video resolution budget using the processor:

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Budget for video processor
# Set the number of visual tokens to 256-16384 
# (32× spatial compression + 2× temporal compression)
processor.video_processor.size = {
    "longest_edge": 16384*32*32*2, 
    "shortest_edge": 256*32*32*2
}

Understanding Video Size Parameters

longest_edge: Maximum total pixels across all frames (T × H × W ≤ longest_edge)
shortest_edge: Minimum total pixel budget for the video
For Qwen3-VL: 32× spatial compression + 2× temporal compression

Performance Optimization

Recommended for Video Processing:Enable flash_attention_2 for better memory efficiency with videos:

import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Installation:

pip install -U flash-attn --no-build-isolation

Video Backends

When using qwen-vl-utils, three video decoding backends are supported:

torchvision: Default backend (slower)
decord: Faster decoding (Linux recommended)
torchcodec: Fastest, recommended (requires FFmpeg)

# Install with decord support
pip install qwen-vl-utils[decord]

# Or set backend manually
export FORCE_QWENVL_VIDEO_READER=torchcodec

Backend Compatibility

Backend	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌
torchcodec	✅	✅

Basic Video Inference

Video Input Formats

Frame Sampling Control

Using FPS

Using Fixed Frame Count

Video Resolution Control

Understanding Video Size Parameters

Performance Optimization

Video Backends

Backend Compatibility

Next Steps

Pixel Control

Generation Parameters

​Basic Video Inference

​Video Input Formats

​Frame Sampling Control

​Using FPS

​Using Fixed Frame Count

​Video Resolution Control

​Understanding Video Size Parameters

​Performance Optimization

​Video Backends

​Backend Compatibility

​Next Steps

Pixel Control

Generation Parameters

Basic Video Inference

Video Input Formats

Frame Sampling Control

Using FPS

Using Fixed Frame Count

Video Resolution Control

Understanding Video Size Parameters

Performance Optimization

Video Backends

Backend Compatibility

Next Steps