Training Script - Qwen3-VL

Complete Training Script

Here’s the complete training script with full parameter documentation:

launch_training.sh

#!/bin/bash
# Complete QwenVL Training Launch Script with Full Parameter Documentation

# ======================
# Distributed Configuration
# ======================
MASTER_ADDR="127.0.0.1"                     # [Required] Master node IP for multi-GPU training
MASTER_PORT=$(shuf -i 20000-29999 -n 1)     # Random port to avoid conflicts
NPROC_PER_NODE=$(nvidia-smi --list-gpus | wc -l)  # Automatically detects available GPUs

# ======================
# Path Configuration
# ======================
MODEL_PATH="/path/to/Qwen2.5-VL-3B-Instruct"  # [ModelArguments] Pretrained model path
OUTPUT_DIR="./checkpoints"                   # Directory for saving checkpoints
CACHE_DIR="./cache"                          # [TrainingArguments] Cache directory for models

# ======================
# Model Configuration
# ======================
DATASETS="your_dataset%100"                  # [DataArguments] Dataset with sampling rate

# ======================
# Training Hyperparameters
# ======================
torchrun --nproc_per_node=$NPROC_PER_NODE \
         --master_addr=$MASTER_ADDR \
         --master_port=$MASTER_PORT \
         qwenvl/train/train_qwen.py \
         # Core Arguments
         --model_name_or_path $MODEL_PATH \  # [ModelArguments] Model identifier
         --tune_mm_llm True \                # [TrainingArguments] Train LLM or not
         --tune_mm_vision False \            # [TrainingArguments] Train VIT or not
         --tune_mm_mlp False \               # [TrainingArguments] Train MLP or not
         --dataset_use $DATASETS \           # [DataArguments] Dataset specification
         --output_dir $OUTPUT_DIR \          # Output directory for checkpoints
         --cache_dir $CACHE_DIR \            # [TrainingArguments] Model cache location
         
         # Precision & Memory
         --bf16 \                            # Use bfloat16 precision (Ampere+ GPUs)
         --per_device_train_batch_size 4 \   # Batch size per GPU
         --gradient_accumulation_steps 4 \   # Effective batch size multiplier
         
         # Learning Rate Configuration
         --learning_rate 2e-7 \              # Base learning rate
         --mm_projector_lr 1e-5 \            # [TrainingArguments] Projector-specific LR
         --vision_tower_lr 1e-6 \            # [TrainingArguments] Vision encoder LR
         --optim adamw_torch \               # [TrainingArguments] Optimizer selection
         
         # Sequence Configuration
         --model_max_length 4096 \           # [TrainingArguments] Max sequence length
         --data_flatten True \               # [DataArguments] Concatenate batch sequences
         --data_packing True \               # [DataArguments] Using packing data
         
         # Image Processing
         --max_pixels 576\*28\*28 \               # [DataArguments] Max image pixels (H*W) for image
         --min_pixels 16\*28\*28 \                # [DataArguments] Min image pixels for image
         # Video Processing
         --video_fps 2 \                          # [DataArguments] video fps
         --video_max_frames 8 \                   # [DataArguments] Max frames per video
         --video_min_frames 4 \                   # [DataArguments] Min frames per video
         --video_max_pixels 1664\*28\*28 \        # [DataArguments] Max pixels per video
         --video_min_pixels 256\*28\*28 \         # [DataArguments] Min pixels per video
         
         # Training Schedule
         --num_train_epochs 3 \              # Total training epochs
         --warmup_ratio 0.03 \               # LR warmup proportion
         --lr_scheduler_type "cosine" \      # Learning rate schedule
         --weight_decay 0.01 \               # L2 regularization strength
         
         # Logging & Checkpoints
         --logging_steps 10 \               # Log metrics interval
         --save_steps 500 \                 # Checkpoint save interval
         --save_total_limit 3 \             # Max checkpoints to keep

         # Lora Config
         --lora_enable True \                 # [TrainingArguments] Enable LoRA
         --lora_r 8 \                         # [TrainingArguments] LoRA r
         --lora_alpha 16 \                    # [TrainingArguments] LoRA alpha 
         --lora_dropout 0.0 \                # [TrainingArguments] LoRA dropout

         # Advanced Options
         --deepspeed zero3.json \           # DeepSpeed configuration

Parameter Categories

The script accepts arguments in three main categories:

Model Arguments

Parameter	Description	Default
`--model_name_or_path`	Path or identifier for pretrained model	Required

Training Arguments

Component Training Flags

Parameter	Description	Recommended
`--tune_mm_llm`	Whether to train the language model	`True`
`--tune_mm_vision`	Whether to train the vision encoder	`False` (for mixed image/video)
`--tune_mm_mlp`	Whether to train the MLP projector	`False`

When training with both image and video data, set --tune_mm_vision False to avoid instability.

Precision & Memory

Parameter	Description	Value
`--bf16`	Use bfloat16 precision (requires Ampere+ GPUs)	Flag
`--per_device_train_batch_size`	Batch size per GPU	`4`
`--gradient_accumulation_steps`	Gradient accumulation steps	`4`
`--cache_dir`	Cache directory for models	`./cache`

Learning Rate Configuration

Parameter	Description	Range
`--learning_rate`	Base learning rate for the model	`1e-6` to `2e-7`
`--mm_projector_lr`	Learning rate for multimodal projector	`1e-5`
`--vision_tower_lr`	Learning rate for vision encoder	`1e-6`
`--optim`	Optimizer type	`adamw_torch`

The suggested learning rate range is from 1e-6 to 2e-7. Start with 2e-7 for stable training.

Sequence Configuration

Parameter	Description	Value
`--model_max_length`	Maximum sequence length	`4096`

Data Arguments

Dataset Selection

Parameter	Description	Example
`--dataset_use`	Dataset names with sampling rates	`"my_dataset%100"`

Data Processing

Parameter	Description	Default
`--data_flatten`	Concatenate batch sequences into one	`True`
`--data_packing`	Use packed data (requires preprocessing)	`True`

data_flatten=True means data in a batch are concatenated into one sequence
data_packing=True requires preprocessing with tools/pack_data.py

Image Processing

Parameter	Description	Value
`--max_pixels`	Maximum image pixels (H×W)	`5762828`
`--min_pixels`	Minimum image pixels	`162828`

Video Processing

Parameter	Description	Value
`--video_fps`	Video frames per second	`2`
`--video_max_frames`	Maximum frames per video	`8`
`--video_min_frames`	Minimum frames per video	`4`
`--video_max_pixels`	Maximum pixels per video	`16642828`
`--video_min_pixels`	Minimum pixels per video	`2562828`

Training resolution is critical for model performance. Ensure --max_pixels and --min_pixels are properly set for your use case.

Training Schedule

Parameter	Description	Value
`--num_train_epochs`	Total training epochs	`3`
`--warmup_ratio`	Learning rate warmup proportion	`0.03`
`--lr_scheduler_type`	Learning rate schedule	`cosine`
`--weight_decay`	L2 regularization strength	`0.01`

Logging & Checkpoints

Parameter	Description	Value
`--logging_steps`	Interval for logging metrics	`10`
`--save_steps`	Interval for saving checkpoints	`500`
`--save_total_limit`	Maximum checkpoints to keep	`3`

Advanced Options

DeepSpeed Configuration

--deepspeed zero3.json

Provide a DeepSpeed configuration file for distributed training optimization.

The Qwen3VL MoE model does not support DeepSpeed with ZeRO-3. Additionally, Hugging Face’s official implementation does not include support for load balancing loss currently.

Flash Attention

To enable Flash Attention 2, add the following to your model’s config.json:

config.json

{
  "_attn_implementation": "flash_attention_2",
  ...
}

Hardware Requirements

Training Qwen2.5-VL-3B

Minimum requirements:

4x GPUs with 24GB VRAM (e.g., RTX 3090, RTX 4090)
With DeepSpeed ZeRO-3 and gradient checkpointing

Training Qwen2.5-VL-32B

Recommended configuration:

8x 80GB GPUs (e.g., A100, H100)
Refer to scripts/sft_32b.sh for configuration

Example Usage

Basic Training

bash launch_training.sh

Single GPU Training

NPROC_PER_NODE=1 bash launch_training.sh

Multi-node Training

On the master node:

MASTER_ADDR="192.168.1.1" bash launch_training.sh

On worker nodes:

MASTER_ADDR="192.168.1.1" bash launch_training.sh

Monitoring Training

Monitor your training progress:

# View logs
tail -f checkpoints/training.log

# Monitor GPU usage
watch -n 1 nvidia-smi

Troubleshooting

Out of Memory (OOM) Errors

Try these solutions:

Reduce --per_device_train_batch_size
Increase --gradient_accumulation_steps
Reduce --model_max_length
Enable gradient checkpointing in DeepSpeed config
Use DeepSpeed ZeRO-3 for larger models

Slow Training Speed

Optimize performance:

Enable Flash Attention 2 in config.json
Use --data_packing True with preprocessed data
Ensure --bf16 is enabled on Ampere+ GPUs
Check if GPU utilization is at 100%
Increase batch size if memory allows

Training Instability

If you see NaN losses or diverging training:

Lower the learning rate (try 1e-7 or 5e-8)
Set --tune_mm_vision False when using mixed image/video data
Increase warmup ratio to 0.05 or 0.1
Check data for corrupted images or invalid annotations
Reduce --max_pixels if processing very large images

​Complete Training Script

​Parameter Categories

​Model Arguments

​Training Arguments

​Component Training Flags

​Precision & Memory

​Learning Rate Configuration

​Sequence Configuration

​Data Arguments

​Dataset Selection

​Data Processing

​Image Processing

​Video Processing

​Training Schedule

​Logging & Checkpoints

​Advanced Options

​DeepSpeed Configuration

​Flash Attention

​Hardware Requirements

​Training Qwen2.5-VL-3B

​Training Qwen2.5-VL-32B

​Example Usage

​Basic Training

​Single GPU Training

​Multi-node Training

​Monitoring Training

​Troubleshooting

Complete Training Script

Parameter Categories

Model Arguments

Training Arguments

Component Training Flags

Precision & Memory

Learning Rate Configuration

Sequence Configuration

Data Arguments

Dataset Selection

Data Processing

Image Processing

Video Processing

Training Schedule

Logging & Checkpoints

Advanced Options

DeepSpeed Configuration

Flash Attention

Hardware Requirements

Training Qwen2.5-VL-3B

Training Qwen2.5-VL-32B

Example Usage

Basic Training

Single GPU Training

Multi-node Training

Monitoring Training

Troubleshooting