ICML 2026

From Seeing to Thinking

Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

See first, then think. Visual perception — not reasoning length — is the dominant bottleneck for VLMs. We post-train along a new capability axis, orthogonal to the classic difficulty curriculum.

Juncheng Wu1,2, Hardy Chen2, Haoqin Tu2, Xianfeng Tang, Freda Shi3,4, Hui Liu, Hanqing Lu1, Cihang Xie2, Yuyin Zhou2

1Amazon  ·  2UC Santa Cruz  ·  3University of Waterloo  ·  4Vector Institute  ·  Corresp. jwu418@ucsc.edu

86.9%
VLM errors are perception, not reasoning
+1.46%
Math accuracy over merged baseline
−20.8%
Shorter reasoning traces
4 / 4
Backbones improved over merged

Abstract

Recent advances in VLMs emphasize long chain-of-thought reasoning; yet we find that performance on visual tasks is primarily limited by a lack of visual perception, not reasoning. We systematically decompose VLM post-training into three capability stages — visual perception, textual reasoning, visual reasoning — and show three things that change how to post-train a VLM:

  • a Perception needs its own data. Reasoning data alone introduces a "perceptual tax"; dedicated perception samples lift visual benchmarks without sacrificing reasoning gains.
  • b Capability is a new curriculum axis. Beyond difficulty, training data can be ordered by the capability it targets. The two axes are orthogonal and additive: combining them lifts Qwen3-VL-8B from 58.56 → 62.99.
  • c Stage order matters. Perception → textual reasoning → visual reasoning works; reversing it (visual → textual → perception) drops visual-math accuracy by 4.6 pts. Perception is the scaffold that has to come first.

Across four backbones the staged recipe lifts accuracy while producing 20.8% shorter reasoning traces — better seeing means less thinking.

Teaser: longer thinking can not fix incorrect perception
Longer thinking cannot fix incorrect perception. Re-checking the image during reasoning leads to the same perception error (Case A). When perception is correct, reasoning is concise and converges quickly (Case B).

Three findings

What changed our minds about VLM post-training.

(a)

Perception needs its own data

Reasoning data alone introduces a "perceptual tax" — adding dedicated perception samples lifts RealWorldQA by +3.7% and stabilises MMStar.

(b)

Capability — a new curriculum axis

Beyond difficulty, training data can be ordered by the capability it targets. The two axes are orthogonal and additive: combining them lifts Qwen3-VL-8B from 58.56 → 62.99.

(c)

Stage order matters

Perception → textual reasoning → visual reasoning is the right order. Flipping it to visual reasoning → textual reasoning → perception drops the visual-math average by 4.6 pts on Qwen2.5-VL-7B — perception is the scaffold that has to come first.

Method

Three stages, one unified prompt, GRPO end-to-end.

Pipeline: Stage 1 perception → Stage 2 textual reasoning → Stage 3 visual reasoning
We synthesise perception data from DOCCI captions, filter to keep only samples a strong VLM misses from the image but solves from the caption, then run GRPO across three stages with a single shared system prompt.

Staged > Merged across four backbones

Same data budget, same total steps. The only thing that changes is the schedule.

Merged baseline — common practice

All three capability datasets (perception ∪ textual reasoning ∪ visual reasoning) are pooled and shuffled, then trained jointly with GRPO for the same total number of steps as the staged recipe.

Staged (ours) capability-axis curriculum

The same data, run sequentially as perception → textual reasoning → visual reasoning, with each stage trained for the same number of epochs.

Click a backbone to switch the table.

Setting Visual Math AVG Perception AVG Overall AVG

Visual math = MathVista / MathVision / MathVerse(VI) / WeMath. Perception = A-OKVQA / RealWorldQA / MMStar / POPE. Best per column in bold.

A new curriculum dimension

Capability axis × Difficulty axis — orthogonal, additive.

Capability ✓  ·  Difficulty ✗
60.53
our staged recipe
Capability ✓  ·  Difficulty ✓
62.99
additive sweet spot
Capability ✗  ·  Difficulty ✗
58.56
merged baseline
Capability ✗  ·  Difficulty ✓
60.36
prior curriculum work
Average over 7 benchmarks on Qwen3-VL-8B (paper §4.5).

Curriculum learning has historically meant ordering training samples by difficulty. Our staged recipe surfaces a second, orthogonal axis: what capability each batch trains.

Empirically the two axes stack additively — combining them lifts Qwen3-VL-8B from 58.56 to 62.99, beating either axis alone by > 2 pts. This reframes post-training as choosing a trajectory through a 2D space rather than along a single difficulty line.

Launch scripts → training/examples/curriculum/

Better perception ⇒ shorter reasoning

Same problem, two trajectories. Staged sees correctly the first time and stops; merged loops on a wrong perception.

Side-by-side case study: merged training keeps re-checking the image with the same perception error; staged training perceives correctly and answers concisely.
Merged training assigns the side of length 73 to the wrong angle and re-checks the image repeatedly with the same error. Staged training perceives correctly and the chain of thought terminates quickly.
Validation response length over training steps: staged training drops below merged in Stage 3
Average validation response length on Qwen3-VL-8B. Staged matches merged in Stage 2 and diverges in Stage 3, ending −20.8% shorter while accuracy is +1.46 pts higher.

Resources

Everything needed to reproduce the paper.

GitHub
UCSC-VLAA/VLM-CapCurriculum
Data pipeline, training scripts, evaluation patches, quickstart for Qwen3-VL-8B-Staged.
🤗 Collection
VLM-CapCurriculum (ICML 2026)
All four staged-training models and three capability-stage datasets in one hub.
model
Qwen3-VL-8B-Staged
model
Qwen2.5-VL-7B-Staged
model
InternVL3-8B-Staged
model
InternVL3.5-8B-Staged
dataset
Perception-Data
dataset
TextReasoning-Data
dataset
VisualReasoning-Data
roadmap
What's still pending

Cite

@inproceedings{vlmcapcurriculum2026,
  title  = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models},
  author = {Juncheng Wu and Hardy Chen and Haoqin Tu and Xianfeng Tang and Freda Shi and Hui Liu and Hanqing Lu and Cihang Xie and Yuyin Zhou},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year   = {2026}
}

Author list and final bibinfo will be filled in once the paper is de-anonymized — see the Roadmap.