Overview
The data processor module provides classes and functions for handling vision-language training data, including dataset loading, preprocessing, packing, and batching for Qwen-VL models.Constants
Label value used to ignore tokens during loss calculation.
Token ID representing image placeholders in the vocabulary.
Token ID representing video placeholders in the vocabulary.
Text placeholder for images in conversation data.
Text placeholder for videos in conversation data.
LazySupervisedDataset
Dataset class for lazy loading of vision-language training data.Constructor
Qwen-VL processor containing tokenizer, image processor, and video processor.
Data configuration including dataset names, pixel limits, and video parameters.
Properties
Approximate token length for each sample (text tokens + 128 per image).
Token lengths with modality information. Positive for multimodal samples, negative for text-only.
Pre-calculated token lengths if available in dataset metadata, otherwise returns array of ones.
Methods
__getitem__(i)
Retrieves a single training sample with retry logic.
Sample index.
input_ids- Tokenized input sequencelabels- Target labels with IGNORE_INDEX for non-answer tokensposition_ids- 3D position IDs for RoPEattention_mask- Sequence length for Flash Attentionpixel_values- Image pixel tensors (if images present)image_grid_thw- Image grid dimensions (time, height, width)pixel_values_videos- Video pixel tensors (if videos present)video_grid_thw- Video grid dimensions
- 3 attempts on the current sample
- 3 attempts on the next sample if current fails
- Raises exception if all retries fail
Utility Functions
update_processor_pixels()
Updates processor image and video resolution parameters.
Processor to update.
Data arguments containing pixel and frame limits.
- Image processor:
min_pixels,max_pixels - Video processor:
min_pixels,max_pixels,min_frames,max_frames,fps
preprocess_qwen_visual()
Preprocesses vision-language data into model inputs.
List containing a single data sample with:
conversations- List of conversation turnsimage- Image path(s) or list of pathsvideo- Video path(s) or list of pathsdata_path- Base directory path
Qwen-VL processor for tokenization and vision processing.
input_ids- Tokenized sequenceslabels- Labels with assistant responses markedpixel_values,image_grid_thw- Image datapixel_values_videos,video_grid_thw- Video data
read_jsonl()
Reads JSONL format annotation files.
Path to JSONL file.
Data Collators
DataCollatorForSupervisedDataset
Collates batches for standard training.Tokenizer for padding configuration.
- Pads
input_idsandlabelsto batch max length - Truncates sequences to
model_max_length - Concatenates all images and videos across batch
- Concatenates grid dimension tensors
- Creates attention mask from padding
input_ids- Padded input sequenceslabels- Padded labelsattention_mask- Padding maskposition_ids- 3D position IDspixel_values- Concatenated imagesimage_grid_thw- Image grid dimensionspixel_values_videos- Concatenated videosvideo_grid_thw- Video grid dimensions
FlattenedDataCollatorForSupervisedDataset
Collates batches for packed/flattened training.- Concatenates sequences without padding between samples
- Uses cumulative sequence length tensor for attention mask
- Optimized for Flash Attention variable-length inputs
- Maximizes GPU utilization by eliminating padding
attention_mask- Cumulative sequence length indices (for Flash Attention)- Other fields same as standard collator
Dataset Factory
make_supervised_data_module()
Creates dataset and collator for training.
Qwen-VL processor instance.
Data configuration arguments.
train_dataset-LazySupervisedDatasetinstanceeval_dataset-None(evaluation not implemented)data_collator-FlattenedDataCollatorForSupervisedDatasetif packing/flatten enabled, otherwiseDataCollatorForSupervisedDataset
Data Format
Conversation Format
Dataset annotations should follow this structure:Multi-Image Example
Video Example
Usage Example
Advanced Features
Data Packing
Whendata_packing=True, multiple training examples are packed into single sequences:
- Reduces padding overhead
- Improves GPU utilization
- Increases effective batch size
- Requires Flash Attention for variable-length support
Custom Dataset Registration
Register custom datasets indata_list.py:
Position IDs for RoPE
The dataset generates 3D rotary position embeddings using model-specific functions:get_rope_index_2()- Qwen2-VL modelsget_rope_index_25()- Qwen2.5-VL modelsget_rope_index_3()- Qwen3-VL models