Complete Training Script
Here’s the complete training script with full parameter documentation:launch_training.sh
Parameter Categories
The script accepts arguments in three main categories:Model Arguments
| Parameter | Description | Default |
|---|---|---|
--model_name_or_path | Path or identifier for pretrained model | Required |
Training Arguments
Component Training Flags
| Parameter | Description | Recommended |
|---|---|---|
--tune_mm_llm | Whether to train the language model | True |
--tune_mm_vision | Whether to train the vision encoder | False (for mixed image/video) |
--tune_mm_mlp | Whether to train the MLP projector | False |
Precision & Memory
| Parameter | Description | Value |
|---|---|---|
--bf16 | Use bfloat16 precision (requires Ampere+ GPUs) | Flag |
--per_device_train_batch_size | Batch size per GPU | 4 |
--gradient_accumulation_steps | Gradient accumulation steps | 4 |
--cache_dir | Cache directory for models | ./cache |
Learning Rate Configuration
| Parameter | Description | Range |
|---|---|---|
--learning_rate | Base learning rate for the model | 1e-6 to 2e-7 |
--mm_projector_lr | Learning rate for multimodal projector | 1e-5 |
--vision_tower_lr | Learning rate for vision encoder | 1e-6 |
--optim | Optimizer type | adamw_torch |
Sequence Configuration
| Parameter | Description | Value |
|---|---|---|
--model_max_length | Maximum sequence length | 4096 |
Data Arguments
Dataset Selection
| Parameter | Description | Example |
|---|---|---|
--dataset_use | Dataset names with sampling rates | "my_dataset%100" |
Data Processing
| Parameter | Description | Default |
|---|---|---|
--data_flatten | Concatenate batch sequences into one | True |
--data_packing | Use packed data (requires preprocessing) | True |
data_flatten=Truemeans data in a batch are concatenated into one sequencedata_packing=Truerequires preprocessing withtools/pack_data.py
Image Processing
| Parameter | Description | Value |
|---|---|---|
--max_pixels | Maximum image pixels (H×W) | 576*28*28 |
--min_pixels | Minimum image pixels | 16*28*28 |
Video Processing
| Parameter | Description | Value |
|---|---|---|
--video_fps | Video frames per second | 2 |
--video_max_frames | Maximum frames per video | 8 |
--video_min_frames | Minimum frames per video | 4 |
--video_max_pixels | Maximum pixels per video | 1664*28*28 |
--video_min_pixels | Minimum pixels per video | 256*28*28 |
Training Schedule
| Parameter | Description | Value |
|---|---|---|
--num_train_epochs | Total training epochs | 3 |
--warmup_ratio | Learning rate warmup proportion | 0.03 |
--lr_scheduler_type | Learning rate schedule | cosine |
--weight_decay | L2 regularization strength | 0.01 |
Logging & Checkpoints
| Parameter | Description | Value |
|---|---|---|
--logging_steps | Interval for logging metrics | 10 |
--save_steps | Interval for saving checkpoints | 500 |
--save_total_limit | Maximum checkpoints to keep | 3 |
Advanced Options
DeepSpeed Configuration
Flash Attention
To enable Flash Attention 2, add the following to your model’sconfig.json:
config.json
Hardware Requirements
Training Qwen2.5-VL-3B
Minimum requirements:- 4x GPUs with 24GB VRAM (e.g., RTX 3090, RTX 4090)
- With DeepSpeed ZeRO-3 and gradient checkpointing
Training Qwen2.5-VL-32B
Recommended configuration:- 8x 80GB GPUs (e.g., A100, H100)
- Refer to
scripts/sft_32b.shfor configuration
Example Usage
Basic Training
Single GPU Training
Multi-node Training
On the master node:Monitoring Training
Monitor your training progress:Troubleshooting
Out of Memory (OOM) Errors
Out of Memory (OOM) Errors
Try these solutions:
- Reduce
--per_device_train_batch_size - Increase
--gradient_accumulation_steps - Reduce
--model_max_length - Enable gradient checkpointing in DeepSpeed config
- Use DeepSpeed ZeRO-3 for larger models
Slow Training Speed
Slow Training Speed
Optimize performance:
- Enable Flash Attention 2 in config.json
- Use
--data_packing Truewith preprocessed data - Ensure
--bf16is enabled on Ampere+ GPUs - Check if GPU utilization is at 100%
- Increase batch size if memory allows
Training Instability
Training Instability
If you see NaN losses or diverging training:
- Lower the learning rate (try
1e-7or5e-8) - Set
--tune_mm_vision Falsewhen using mixed image/video data - Increase warmup ratio to
0.05or0.1 - Check data for corrupted images or invalid annotations
- Reduce
--max_pixelsif processing very large images