VL-Thinking: An R1-Derived Visual Instruction Tuning Dataset for Thinkable LVLMs


1UC Santa Cruz, 2Amazon Research, 3UT Dallas

Dataset Card 📚


Name # Original Samples # Rewritten # Verified
CLEVR_Math 15,000 14,748 9,771
GeoQA170K 14,019 11,745 7,794
Synthesis 29,998 29,998 26,672
ArxivQA 14,992 14,810 14,109
ALLaVA-LAION 36,977 30,191 18,123
Total 76,469

Dataset Statistics. These datasets cover questions from different domains (math, general) and questions of different types (close-ended, open-ended). For datasets with duplicated images, we select unique images for higher diversity. More data are on the way.

Generation Pipeline

Pipeline Overview. We propose a four-step procedure for data generation: Captioning, Visual-Language CoT Generation, Answer Rewriting and Answer Verification.

Examples

Acknowledgement

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

BibTeX


@misc{vl-thinking2025,
    title={VL-Thinking: An R1-Derived Visual Instruction Tuning Dataset for Thinkable LVLMs},
    author={Hardy Chen and Haoqin Tu and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/UCSC-VLAA/VL-Thinking}},
    }