Training Arguments

ModelArguments

Configuration for model initialization and component fine-tuning.

model_name_or_path

str

default:"Qwen/Qwen2.5-VL-3B-Instruct"

Path to pretrained model or model identifier from Hugging Face Hub. Supports:

Qwen/Qwen2-VL-*
Qwen/Qwen2.5-VL-*
Qwen/Qwen3-VL-*
Qwen/Qwen3-VL-MoE-*
Local paths to saved models

tune_mm_llm

bool

default:"False"

Whether to fine-tune the language model (LLM) component. When False, the LLM parameters are frozen during training.

tune_mm_mlp

bool

default:"False"

Whether to fine-tune the multimodal projector (MLP/merger) component. This module projects vision features into the language model space.

tune_mm_vision

bool

default:"False"

Whether to fine-tune the vision tower component. When False, the vision encoder parameters are frozen.

Usage Example

from dataclasses import dataclass
from qwenvl.train.argument import ModelArguments

model_args = ModelArguments(
    model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
    tune_mm_llm=True,        # Fine-tune language model
    tune_mm_mlp=True,        # Fine-tune projector
    tune_mm_vision=False     # Keep vision tower frozen
)

DataArguments

Configuration for data processing and vision-language inputs.

dataset_use

str

default:""

Comma-separated list of dataset names to use for training. Dataset names should be registered in the data list configuration.Example: "vqa,caption,ocr"

data_flatten

bool

default:"False"

Whether to flatten data sequences for packed training. When enabled, multiple sequences can be packed into a single training example for efficiency.

data_packing

bool

default:"False"

Whether to enable data packing. Packs multiple examples together to minimize padding and improve GPU utilization.

base_interval

int

default:"2"

Base interval for vision processing grid calculations.

Image Processing Parameters

max_pixels

int

default:"451584"

Maximum number of pixels for image inputs. Default is 28 * 28 * 576 = 451,584 pixels.Controls the maximum resolution after dynamic resolution processing.

min_pixels

int

default:"12544"

Minimum number of pixels for image inputs. Default is 28 * 28 * 16 = 12,544 pixels.Controls the minimum resolution for image processing.

Video Processing Parameters

video_max_frames

int

default:"8"

Maximum number of frames to extract from video inputs. Videos longer than this will be sampled.

video_min_frames

int

default:"4"

Minimum number of frames to extract from video inputs.

video_max_pixels

int

default:"802816"

Maximum number of pixels per frame for video inputs. Default is 1024 * 28 * 28 = 802,816 pixels.

video_min_pixels

int

default:"200704"

Minimum number of pixels per frame for video inputs. Default is 256 * 28 * 28 = 200,704 pixels.

video_fps

float

default:"2.0"

Target frames per second for video sampling. Videos will be resampled to this FPS before frame extraction.

Usage Example

from qwenvl.train.argument import DataArguments

data_args = DataArguments(
    dataset_use="vqa,caption",
    data_packing=True,
    data_flatten=True,
    max_pixels=28 * 28 * 768,      # Higher resolution
    min_pixels=28 * 28 * 16,
    video_max_frames=16,            # More frames
    video_fps=2.0
)

TrainingArguments

Extends transformers.TrainingArguments with additional parameters for vision-language model training.

Base Parameters

cache_dir

str

default:"None"

Directory to store downloaded models and datasets cache.

optim

str

default:"adamw_torch"

Optimizer to use. Options include:

adamw_torch - PyTorch AdamW
adamw_hf - Hugging Face AdamW
sgd - Stochastic Gradient Descent
adafactor - Memory-efficient Adafactor

model_max_length

int

default:"512"

Maximum sequence length. Sequences will be right-padded and truncated to this length.Consider increasing for long-form VQA or detailed image descriptions.

Component-Specific Learning Rates

mm_projector_lr

float

default:"None"

Learning rate for the multimodal projector (merger) module. When set, overrides the base learning rate for projector parameters.Typical values: 1e-4 to 5e-4 (often higher than base LR).

vision_tower_lr

float

default:"None"

Learning rate for the vision tower. When set, overrides the base learning rate for vision encoder parameters.Typical values: 1e-6 to 1e-5 (often lower than base LR to preserve pretrained features).

LoRA Configuration

lora_enable

bool

default:"False"

Whether to use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

lora_r

int

default:"64"

LoRA rank. Higher values provide more capacity but increase trainable parameters.Typical values: 8, 16, 32, 64, 128

lora_alpha

int

default:"128"

LoRA scaling parameter. Controls the magnitude of LoRA updates.Often set to 2x the LoRA rank (e.g., lora_alpha = 2 * lora_r).

lora_dropout

float

default:"0.0"

Dropout probability for LoRA layers. Can help with regularization.Typical values: 0.0 to 0.1

Usage Example

from qwenvl.train.argument import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    
    # Learning rates
    learning_rate=2e-5,           # Base LLM learning rate
    mm_projector_lr=2e-4,         # Higher for projector
    vision_tower_lr=1e-5,         # Lower for vision tower
    
    # Sequence length
    model_max_length=2048,
    
    # Optimizer
    optim="adamw_torch",
    weight_decay=0.01,
    
    # LoRA configuration
    lora_enable=True,
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    
    # Standard training arguments
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    warmup_steps=100,
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
)

Complete Training Configuration

from qwenvl.train.argument import (
    ModelArguments,
    DataArguments, 
    TrainingArguments
)

# Model configuration
model_args = ModelArguments(
    model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
    tune_mm_llm=True,
    tune_mm_mlp=True,
    tune_mm_vision=False
)

# Data configuration
data_args = DataArguments(
    dataset_use="custom_vqa,coco_caption",
    data_packing=True,
    max_pixels=28 * 28 * 768,
    video_max_frames=8,
    video_fps=2.0
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./qwen-vl-finetuned",
    learning_rate=2e-5,
    mm_projector_lr=2e-4,
    vision_tower_lr=1e-5,
    model_max_length=2048,
    optim="adamw_torch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    lora_enable=True,
    lora_r=64,
    lora_alpha=128,
)

​ModelArguments

​Usage Example

​DataArguments

​Image Processing Parameters

​Video Processing Parameters

​Usage Example

​TrainingArguments

​Base Parameters

​Component-Specific Learning Rates

​LoRA Configuration

​Usage Example

​Complete Training Configuration

ModelArguments

Usage Example

DataArguments

Image Processing Parameters

Video Processing Parameters

Usage Example

TrainingArguments

Base Parameters

Component-Specific Learning Rates

LoRA Configuration

Usage Example

Complete Training Configuration