AutoModelForImageTextToText.from_pretrained
Load a Qwen3-VL model from a pretrained checkpoint using the Transformers library.Parameters
Model identifier from Hugging Face Hub or local path to model directory.Examples:
"Qwen/Qwen3-VL-235B-A22B-Instruct""Qwen/Qwen3-VL-30B-A3B-Instruct""Qwen/Qwen3-VL-32B-Instruct""Qwen/Qwen3-VL-8B-Instruct""Qwen/Qwen3-VL-4B-Instruct""Qwen/Qwen3-VL-2B-Instruct"
Data type for model weights. Controls precision and memory usage.Options:
"auto"- Automatically select optimal dtypetorch.bfloat16- Brain floating point 16-bit (recommended)torch.float16- Half precision floating pointtorch.float32- Full precision floating point
Device allocation strategy for model layers. Enables multi-GPU and CPU offloading.Options:
"auto"- Automatically distribute across available devices"cuda"- Load entire model on single GPU"cpu"- Load model on CPU- Custom dict mapping layers to devices
Attention mechanism implementation. Use Flash Attention 2 for better performance.Options:
"flash_attention_2"- Fast and memory-efficient attention (recommended)"eager"- Standard PyTorch attention"sdpa"- Scaled dot-product attention
- Compatible GPU (Ampere, Ada, Hopper architecture)
- dtype must be
torch.float16ortorch.bfloat16 - Install:
pip install -U flash-attn --no-build-isolation
Returns
Returns aQwen3VLForConditionalGeneration model instance ready for inference.
Example Usage
Notes
For multi-image and video scenarios, Flash Attention 2 is strongly recommended for better acceleration and memory efficiency.
Memory Requirements
Model memory usage varies by precision:| Model Size | BF16 | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| 2B | ~4 GB | ~4 GB | ~2 GB | ~1 GB |
| 8B | ~16 GB | ~16 GB | ~8 GB | ~4 GB |
| 32B | ~64 GB | ~64 GB | ~32 GB | ~16 GB |
| 235B-A22B | ~264 GB | ~264 GB | ~132 GB | ~66 GB |