This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing "pseudo reasoning paths" imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.
Data generation pipeline. We first generate initial reasoning traces by feeding detailed captions and visual questions into DeepSeek-R1.These outputs are then rewritten for improved fluency and verified for correctness using a GPT-based verifier. the resulting data is split into VLAA-Thinking-SFT and VLAA-Thinking-RL.
Name | Data Type | # Original | # Pipeline | # Final SFT | # Final RL |
---|---|---|---|---|---|
CLEVR_Math | Closed-end | 35,000 | 28,018 | 5,923 | 2,000 |
GeoQA170K | Closed-end | - | - | - | 6,499 |
Math PUMA | Closed-end | 30,000 | 26,672 | 19,258 | 6,696 |
ArxivQA | Closed-end | 54,399 | 51,348 | 34,604 | 1,000 |
DocVQA | Closed-end | 10,194 | 8,206 | 4,897 | 1,000 |
VizWiz | Closed-end | 20,523 | 6,528 | 4,266 | 1,000 |
ALLaVA-LAION | Open-end | 47,066 | 18,123 | 10,496 | 3,000 |
LLaVA-CoT-COCO | Closed-end | 3,000 | 3,000 | 8,727 | 2,000 |
LLaVA-CoT-VisualGenome | Closed-end | 3,000 | 3,000 | 38,242 | 2,000 |
Total | Closed- & Open-end | 203,182 | 144,895 | 126,413 | 25,195 |
Data statistics of VLAA-Thinking We present the original volume of metadata (#Original), the data size after the distillation pipeline (#Pipeline), the size of sampled examples for SFT (#Final SFT) and RL (#Final RL), respectively. Note that we only use GeoQA170K with verifiable answers for the RL split.
Examples from LVLMs trained with different strategies for reasoning.
Left: response from a model trained with SFT, showing pseudo reasoning traces and a number of pseudo self-reflective cues (i.e., aha-moments) imitated from R1.
Right: response from a model trained with RL (GRPO), showing native reasoning ability and authentic aha-moments emerged from RL training.
Wrong reasoning steps are colored red and aha-moments are highlighted.
Delta percentage performance change of different models trained with supervised fine-tuning (SFT) only.
Impact of SFT with 5K and 10K samples before GRPO. SFT jeopardizes GRPO performance.
Mixed reward module. The proposed framework comprises 2 reward formats (rule-based and open-ended) and 5 types of verifiable rewards (digit, MCQ, math, IoU and general reasoning).
Benchmark performance. Evaluation on 6 math reasoning benchmarks held by Open LMM Reasoning Leaderboard. VLAA-Thinker models significantly outperform baselines and other models.
We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.
@misc{vl-thinking2025,
title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}