SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models


1UC Santa Cruz, 2UT Dallas, 3Amazon Research, 4The Pennsylvania State University

Abstract

This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing "pseudo reasoning paths" imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

🏭 VLAA-Thinking Data Generation

Data generation pipeline. We first generate initial reasoning traces by feeding detailed captions and visual questions into DeepSeek-R1.These outputs are then rewritten for improved fluency and verified for correctness using a GPT-based verifier. the resulting data is split into VLAA-Thinking-SFT and VLAA-Thinking-RL.

📚 VLAA-Thinking Dataset Card


Name Data Type # Original # Pipeline # Final SFT # Final RL
CLEVR_Math Closed-end 35,000 28,018 5,923 2,000
GeoQA170K Closed-end - - - 6,499
Math PUMA Closed-end 30,000 26,672 19,258 6,696
ArxivQA Closed-end 54,399 51,348 34,604 1,000
DocVQA Closed-end 10,194 8,206 4,897 1,000
VizWiz Closed-end 20,523 6,528 4,266 1,000
ALLaVA-LAION Open-end 47,066 18,123 10,496 3,000
LLaVA-CoT-COCO Closed-end 3,000 3,000 8,727 2,000
LLaVA-CoT-VisualGenome Closed-end 3,000 3,000 38,242 2,000
Total Closed- & Open-end 203,182 144,895 126,413 25,195

Data statistics of VLAA-Thinking We present the original volume of metadata (#Original), the data size after the distillation pipeline (#Pipeline), the size of sampled examples for SFT (#Final SFT) and RL (#Final RL), respectively. Note that we only use GeoQA170K with verifiable answers for the RL split.

Examples

⚔️ SFT vs RL for Reasoning

Pseudo vs Native Reasoning

Examples from LVLMs trained with different strategies for reasoning.

Left: response from a model trained with SFT, showing pseudo reasoning traces and a number of pseudo self-reflective cues (i.e., aha-moments) imitated from R1.

Right: response from a model trained with RL (GRPO), showing native reasoning ability and authentic aha-moments emerged from RL training.

Wrong reasoning steps are colored red and aha-moments are highlighted.

SFT Performance

Delta percentage performance change of different models trained with supervised fine-tuning (SFT) only.

SFT+GRPO vs GRPO Performance

Impact of SFT with 5K and 10K samples before GRPO. SFT jeopardizes GRPO performance.

💡 GRPO with Mixed Reward

Mixed Reward Module

Mixed reward module. The proposed framework comprises 2 reward formats (rule-based and open-ended) and 5 types of verifiable rewards (digit, MCQ, math, IoU and general reasoning).

Performance

Benchmark performance. Evaluation on 6 math reasoning benchmarks held by Open LMM Reasoning Leaderboard. VLAA-Thinker models significantly outperform baselines and other models.

Acknowledgement

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

BibTeX


@misc{vl-thinking2025,
    title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
    author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
    }