VLAA-Thinking

Abstract

This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing "pseudo reasoning paths" imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

🏭 VLAA-Thinking Data Generation

Data generation pipeline. We first generate initial reasoning traces by feeding detailed captions and visual questions into DeepSeek-R1.These outputs are then rewritten for improved fluency and verified for correctness using a GPT-based verifier. the resulting data is split into VLAA-Thinking-SFT and VLAA-Thinking-RL.

📚 VLAA-Thinking Dataset Card

Name	Data Type	# Original	# Pipeline	# Final SFT	# Final RL
CLEVR_Math	Closed-end	35,000	28,018	5,923	2,000
GeoQA170K	Closed-end	-	-	-	6,499
Math PUMA	Closed-end	30,000	26,672	19,258	6,696
ArxivQA	Closed-end	54,399	51,348	34,604	1,000
DocVQA	Closed-end	10,194	8,206	4,897	1,000
VizWiz	Closed-end	20,523	6,528	4,266	1,000
ALLaVA-LAION	Open-end	47,066	18,123	10,496	3,000
LLaVA-CoT-COCO	Closed-end	3,000	3,000	8,727	2,000
LLaVA-CoT-VisualGenome	Closed-end	3,000	3,000	38,242	2,000
Total	Closed- & Open-end	203,182	144,895	126,413	25,195

Data statistics of VLAA-Thinking We present the original volume of metadata (#Original), the data size after the distillation pipeline (#Pipeline), the size of sampled examples for SFT (#Final SFT) and RL (#Final RL), respectively. Note that we only use GeoQA170K with verifiable answers for the RL split.

⚔️ SFT vs RL for Reasoning

Pseudo vs Native Reasoning

Examples from LVLMs trained with different strategies for reasoning.

Left: response from a model trained with SFT, showing pseudo reasoning traces and a number of pseudo self-reflective cues (i.e., aha-moments) imitated from R1.

Right: response from a model trained with RL (GRPO), showing native reasoning ability and authentic aha-moments emerged from RL training.

Wrong reasoning steps are colored red and aha-moments are highlighted.

SFT Performance

Delta percentage performance change of different models trained with supervised fine-tuning (SFT) only.

SFT+GRPO vs GRPO Performance

Impact of SFT with 5K and 10K samples before GRPO. SFT jeopardizes GRPO performance.

💡 GRPO with Mixed Reward

Mixed Reward Module

Mixed reward module. The proposed framework comprises 2 reward formats (rule-based and open-ended) and 5 types of verifiable rewards (digit, MCQ, math, IoU and general reasoning).

Performance

Benchmark performance. Evaluation on 6 math reasoning benchmarks held by Open LMM Reasoning Leaderboard. VLAA-Thinker models significantly outperform baselines and other models.

Acknowledgement

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

BibTeX


@misc{chen2025sftrlearlyinvestigation,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models}, 
  author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
  year={2025},
  eprint={2504.11468},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2504.11468}, 
}