Long Document Understanding

Qwen3-VL achieves rigorous semantic comprehension of ultra-long documents, supporting native 256K context length that can be extended to 1M tokens. This enables processing of entire books and extensive documents with full recall.

Capability Overview

The long document understanding feature enables you to:

Process documents up to 256K tokens natively
Extend context to 1M tokens with YaRN
Handle entire books and lengthy reports
Maintain full recall across long contexts
Parse complex document structures
Extract information from multi-page documents

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# For multi-page documents, provide multiple images
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "page1.jpg"},
            {"type": "image", "image": "page2.jpg"},
            {"type": "image", "image": "page3.jpg"},
            {"type": "text", "text": "Summarize the key points from this document."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full long document understanding cookbook with interactive examples:

View on GitHub

​Capability Overview

​Example Usage

​Try it Yourself

Capability Overview

Example Usage

Try it Yourself