VLM-CapCurriculum: From Seeing to Thinking

Abstract

Recent advances in VLMs emphasize long chain-of-thought reasoning; yet we find that performance on visual tasks is primarily limited by a lack of visual perception, not reasoning. We systematically decompose VLM post-training into three capability stages — visual perception, textual reasoning, visual reasoning — and show three things that change how to post-train a VLM:

a Perception needs its own data. Reasoning data alone introduces a "perceptual tax"; dedicated perception samples lift visual benchmarks without sacrificing reasoning gains.
b Capability is a new curriculum axis. Beyond difficulty, training data can be ordered by the capability it targets. The two axes are orthogonal and additive: combining them lifts Qwen3-VL-8B from 58.56 → 62.99.
c Stage order matters. Perception → textual reasoning → visual reasoning works; reversing it (visual → textual → perception) drops visual-math accuracy by 4.6 pts. Perception is the scaffold that has to come first.

Across four backbones the staged recipe lifts accuracy while producing 20.8% shorter reasoning traces — better seeing means less thinking.

Teaser: longer thinking can not fix incorrect perception — Longer thinking cannot fix incorrect perception. Re-checking the image during reasoning leads to the same perception error (Case A). When perception is correct, reasoning is concise and converges quickly (Case B).

Three findings

What changed our minds about VLM post-training.

(a)

Perception needs its own data

Reasoning data alone introduces a "perceptual tax" — adding dedicated perception samples lifts RealWorldQA by +3.7% and stabilises MMStar.

(b)

Capability — a new curriculum axis

Beyond difficulty, training data can be ordered by the capability it targets. The two axes are orthogonal and additive: combining them lifts Qwen3-VL-8B from 58.56 → 62.99.

(c)

Stage order matters

Perception → textual reasoning → visual reasoning is the right order. Flipping it to visual reasoning → textual reasoning → perception drops the visual-math average by 4.6 pts on Qwen2.5-VL-7B — perception is the scaffold that has to come first.

Method

Three stages, one unified prompt, GRPO end-to-end.

Pipeline: Stage 1 perception → Stage 2 textual reasoning → Stage 3 visual reasoning — We synthesise perception data from DOCCI captions, filter to keep only samples a strong VLM misses from the image but solves from the caption, then run GRPO across three stages with a single shared system prompt.

Staged > Merged across four backbones

Same data budget, same total steps. The only thing that changes is the schedule.

Merged baseline — common practice

All three capability datasets (perception ∪ textual reasoning ∪ visual reasoning) are pooled and shuffled, then trained jointly with GRPO for the same total number of steps as the staged recipe.

Staged (ours) capability-axis curriculum

The same data, run sequentially as perception → textual reasoning → visual reasoning, with each stage trained for the same number of epochs.

Click a backbone to switch the table.

Setting	Visual Math AVG	Perception AVG	Overall AVG

Visual math = MathVista / MathVision / MathVerse(VI) / WeMath. Perception = A-OKVQA / RealWorldQA / MMStar / POPE. Best per column in bold.

A new curriculum dimension

Capability axis × Difficulty axis — orthogonal, additive.

Capability ✓ · Difficulty ✗

60.53

our staged recipe

Capability ✓ · Difficulty ✓

62.99

additive sweet spot

Capability ✗ · Difficulty ✗

58.56

merged baseline

Capability ✗ · Difficulty ✓

60.36

prior curriculum work

Average over 7 benchmarks on Qwen3-VL-8B (paper §4.5).

Curriculum learning has historically meant ordering training samples by difficulty. Our staged recipe surfaces a second, orthogonal axis: what capability each batch trains.

Empirically the two axes stack additively — combining them lifts Qwen3-VL-8B from 58.56 to 62.99, beating either axis alone by > 2 pts. This reframes post-training as choosing a trajectory through a 2D space rather than along a single difficulty line.

Launch scripts → training/examples/curriculum/

Better perception ⇒ shorter reasoning

Same problem, two trajectories. Staged sees correctly the first time and stops; merged loops on a wrong perception.

Side-by-side case study: merged training keeps re-checking the image with the same perception error; staged training perceives correctly and answers concisely. — Merged training assigns the side of length 73 to the wrong angle and re-checks the image repeatedly with the same error. Staged training perceives correctly and the chain of thought terminates quickly.

Validation response length over training steps: staged training drops below merged in Stage 3 — Average validation response length on Qwen3-VL-8B. Staged matches merged in Stage 2 and diverges in Stage 3, ending **−20.8%** shorter while accuracy is **+1.46 pts** higher.

Resources

Everything needed to reproduce the paper.

GitHub

UCSC-VLAA/VLM-CapCurriculum

Data pipeline, training scripts, evaluation patches, quickstart for Qwen3-VL-8B-Staged.

🤗 Collection

VLM-CapCurriculum (ICML 2026)

All four staged-training models and three capability-stage datasets in one hub.

InternVL3.5-8B-Staged

Cite

@inproceedings{vlmcapcurriculum2026,
  title  = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models},
  author = {Juncheng Wu and Hardy Chen and Haoqin Tu and Xianfeng Tang and Freda Shi and Hui Liu and Hanqing Lu and Cihang Xie and Yuyin Zhou},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year   = {2026}
}

Author list and final bibinfo will be filled in once the paper is de-anonymized — see the Roadmap.