Overview
Generation parameters control the sampling behavior and output characteristics of Qwen3-VL. Proper configuration can significantly impact response quality, creativity, and coherence.Basic Generation Parameters
Key Parameters
max_new_tokens
Controls the maximum length of the generated response.- Default: 128-512 tokens for most use cases
- Document analysis: 1024-2048 tokens
- Long videos: Up to 4096 tokens
temperature
Controls randomness in generation. Higher values produce more creative but less focused outputs.0.1-0.3: Factual tasks, OCR, data extraction0.6-0.8: General description, Q&A0.9-1.2: Creative writing, brainstorming
top_p (Nucleus Sampling)
Limits sampling to the smallest set of tokens whose cumulative probability exceedsp.
top_k
Limits sampling to the top K most probable tokens.do_sample
Enables sampling-based generation. Set toFalse for greedy decoding.
Official Evaluation Settings
Instruct Models
Recommended settings for Qwen3-VL-Instruct models:Thinking Models
Recommended settings for Qwen3-VL-Thinking models:Advanced Parameters
Repetition Control
Seed for Reproducibility
Stop Sequences
Task-Specific Configurations
Image Description
OCR and Text Extraction
Video Understanding
Creative Tasks
Complete Example
Performance Tips
Optimization Recommendations:
-
Use flash_attention_2 for faster generation:
- Batch similar requests for better throughput
- Adjust max_new_tokens based on expected output length to save computation
-
Use greedy decoding (
do_sample=False) for deterministic, factual tasks - Enable KV cache (enabled by default) for faster multi-turn conversations
Next Steps
Batch Inference
Process multiple requests for better throughput
Basic Usage
Review basic inference patterns