Skip to main content

ModelArguments

Configuration for model initialization and component fine-tuning.
model_name_or_path
str
default:"Qwen/Qwen2.5-VL-3B-Instruct"
Path to pretrained model or model identifier from Hugging Face Hub. Supports:
  • Qwen/Qwen2-VL-*
  • Qwen/Qwen2.5-VL-*
  • Qwen/Qwen3-VL-*
  • Qwen/Qwen3-VL-MoE-*
  • Local paths to saved models
tune_mm_llm
bool
default:"False"
Whether to fine-tune the language model (LLM) component. When False, the LLM parameters are frozen during training.
tune_mm_mlp
bool
default:"False"
Whether to fine-tune the multimodal projector (MLP/merger) component. This module projects vision features into the language model space.
tune_mm_vision
bool
default:"False"
Whether to fine-tune the vision tower component. When False, the vision encoder parameters are frozen.

Usage Example

from dataclasses import dataclass
from qwenvl.train.argument import ModelArguments

model_args = ModelArguments(
    model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
    tune_mm_llm=True,        # Fine-tune language model
    tune_mm_mlp=True,        # Fine-tune projector
    tune_mm_vision=False     # Keep vision tower frozen
)

DataArguments

Configuration for data processing and vision-language inputs.
dataset_use
str
default:""
Comma-separated list of dataset names to use for training. Dataset names should be registered in the data list configuration.Example: "vqa,caption,ocr"
data_flatten
bool
default:"False"
Whether to flatten data sequences for packed training. When enabled, multiple sequences can be packed into a single training example for efficiency.
data_packing
bool
default:"False"
Whether to enable data packing. Packs multiple examples together to minimize padding and improve GPU utilization.
base_interval
int
default:"2"
Base interval for vision processing grid calculations.

Image Processing Parameters

max_pixels
int
default:"451584"
Maximum number of pixels for image inputs. Default is 28 * 28 * 576 = 451,584 pixels.Controls the maximum resolution after dynamic resolution processing.
min_pixels
int
default:"12544"
Minimum number of pixels for image inputs. Default is 28 * 28 * 16 = 12,544 pixels.Controls the minimum resolution for image processing.

Video Processing Parameters

video_max_frames
int
default:"8"
Maximum number of frames to extract from video inputs. Videos longer than this will be sampled.
video_min_frames
int
default:"4"
Minimum number of frames to extract from video inputs.
video_max_pixels
int
default:"802816"
Maximum number of pixels per frame for video inputs. Default is 1024 * 28 * 28 = 802,816 pixels.
video_min_pixels
int
default:"200704"
Minimum number of pixels per frame for video inputs. Default is 256 * 28 * 28 = 200,704 pixels.
video_fps
float
default:"2.0"
Target frames per second for video sampling. Videos will be resampled to this FPS before frame extraction.

Usage Example

from qwenvl.train.argument import DataArguments

data_args = DataArguments(
    dataset_use="vqa,caption",
    data_packing=True,
    data_flatten=True,
    max_pixels=28 * 28 * 768,      # Higher resolution
    min_pixels=28 * 28 * 16,
    video_max_frames=16,            # More frames
    video_fps=2.0
)

TrainingArguments

Extends transformers.TrainingArguments with additional parameters for vision-language model training.

Base Parameters

cache_dir
str
default:"None"
Directory to store downloaded models and datasets cache.
optim
str
default:"adamw_torch"
Optimizer to use. Options include:
  • adamw_torch - PyTorch AdamW
  • adamw_hf - Hugging Face AdamW
  • sgd - Stochastic Gradient Descent
  • adafactor - Memory-efficient Adafactor
model_max_length
int
default:"512"
Maximum sequence length. Sequences will be right-padded and truncated to this length.Consider increasing for long-form VQA or detailed image descriptions.

Component-Specific Learning Rates

mm_projector_lr
float
default:"None"
Learning rate for the multimodal projector (merger) module. When set, overrides the base learning rate for projector parameters.Typical values: 1e-4 to 5e-4 (often higher than base LR).
vision_tower_lr
float
default:"None"
Learning rate for the vision tower. When set, overrides the base learning rate for vision encoder parameters.Typical values: 1e-6 to 1e-5 (often lower than base LR to preserve pretrained features).

LoRA Configuration

lora_enable
bool
default:"False"
Whether to use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
lora_r
int
default:"64"
LoRA rank. Higher values provide more capacity but increase trainable parameters.Typical values: 8, 16, 32, 64, 128
lora_alpha
int
default:"128"
LoRA scaling parameter. Controls the magnitude of LoRA updates.Often set to 2x the LoRA rank (e.g., lora_alpha = 2 * lora_r).
lora_dropout
float
default:"0.0"
Dropout probability for LoRA layers. Can help with regularization.Typical values: 0.0 to 0.1

Usage Example

from qwenvl.train.argument import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    
    # Learning rates
    learning_rate=2e-5,           # Base LLM learning rate
    mm_projector_lr=2e-4,         # Higher for projector
    vision_tower_lr=1e-5,         # Lower for vision tower
    
    # Sequence length
    model_max_length=2048,
    
    # Optimizer
    optim="adamw_torch",
    weight_decay=0.01,
    
    # LoRA configuration
    lora_enable=True,
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    
    # Standard training arguments
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    warmup_steps=100,
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
)

Complete Training Configuration

from qwenvl.train.argument import (
    ModelArguments,
    DataArguments, 
    TrainingArguments
)

# Model configuration
model_args = ModelArguments(
    model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
    tune_mm_llm=True,
    tune_mm_mlp=True,
    tune_mm_vision=False
)

# Data configuration
data_args = DataArguments(
    dataset_use="custom_vqa,coco_caption",
    data_packing=True,
    max_pixels=28 * 28 * 768,
    video_max_frames=8,
    video_fps=2.0
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./qwen-vl-finetuned",
    learning_rate=2e-5,
    mm_projector_lr=2e-4,
    vision_tower_lr=1e-5,
    model_max_length=2048,
    optim="adamw_torch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    lora_enable=True,
    lora_r=64,
    lora_alpha=128,
)