Skip to main content

Basic Video Inference

Qwen3-VL supports video input through URLs, local paths, or frame sequences.
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Messages containing a video url (or a local path) and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Video Input Formats

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

Frame Sampling Control

Using FPS

Control the frame sampling rate using the fps parameter:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Set fps = 4 (default is 2)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    fps=4
)
inputs = inputs.to(model.device)

Using Fixed Frame Count

Specify the exact number of frames to sample:
# Set num_frames = 128 and overwrite fps to None
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    num_frames=128,
    fps=None,
)
inputs = inputs.to(model.device)
When using num_frames, set fps=None to avoid conflicts between the two parameters.

Video Resolution Control

Configure video resolution budget using the processor:
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Budget for video processor
# Set the number of visual tokens to 256-16384 
# (32× spatial compression + 2× temporal compression)
processor.video_processor.size = {
    "longest_edge": 16384*32*32*2, 
    "shortest_edge": 256*32*32*2
}

Understanding Video Size Parameters

  • longest_edge: Maximum total pixels across all frames (T × H × W ≤ longest_edge)
  • shortest_edge: Minimum total pixel budget for the video
  • For Qwen3-VL: 32× spatial compression + 2× temporal compression

Performance Optimization

Recommended for Video Processing:Enable flash_attention_2 for better memory efficiency with videos:
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
Installation:
pip install -U flash-attn --no-build-isolation

Video Backends

When using qwen-vl-utils, three video decoding backends are supported:
  • torchvision: Default backend (slower)
  • decord: Faster decoding (Linux recommended)
  • torchcodec: Fastest, recommended (requires FFmpeg)
# Install with decord support
pip install qwen-vl-utils[decord]

# Or set backend manually
export FORCE_QWENVL_VIDEO_READER=torchcodec

Backend Compatibility

BackendHTTPHTTPS
torchvision >= 0.19.0
torchvision < 0.19.0
decord
torchcodec

Next Steps

Pixel Control

Advanced video resolution control with qwen-vl-utils

Generation Parameters

Configure temperature, top_p, and other parameters