Skip to main content
Qwen3-VL achieves rigorous semantic comprehension of ultra-long documents, supporting native 256K context length that can be extended to 1M tokens. This enables processing of entire books and extensive documents with full recall.

Capability Overview

The long document understanding feature enables you to:
  • Process documents up to 256K tokens natively
  • Extend context to 1M tokens with YaRN
  • Handle entire books and lengthy reports
  • Maintain full recall across long contexts
  • Parse complex document structures
  • Extract information from multi-page documents

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# For multi-page documents, provide multiple images
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "page1.jpg"},
            {"type": "image", "image": "page2.jpg"},
            {"type": "image", "image": "page3.jpg"},
            {"type": "text", "text": "Summarize the key points from this document."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full long document understanding cookbook with interactive examples: Open in Colab View on GitHub