ReasoningEval

Abstract

Recent advances in reasoning-enhanced Large Language Models (LLMs) such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into knowledge and reasoning parts. We propose a fine-grained evaluation framework that judges: (1) the correctness of the knowledge used, and (2) the quality of the reasoning. To quantify these, we introduce two novel metrics: Knowledge Index (KI) for knowledge accuracy and Information Gain (InfoGain) for reasoning informativeness. We conduct a case study on R1-distilled and the base Qwen models with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains, which uncovers several key insights: (1) the general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain via SFT or RL; (2) while SFT improves final accuracy in both domains, it often compromises reasoning, as reflected by average 38.9% lower InfoGain scores against untrained models---yet it remains essential in the medical domain, where domain knowledge is critical for accuracy; and (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness. We hope this work can encourage future research toward more reliable LLM reasoning.

Evaluation Pipeline

Knowledge and Reasoning Are Two Distinct Evaluation Aspects

Figure 1. A reasoning step may effectively reduce uncertainty toward the final answer despite relying on incorrect knowledge (e.g., Step 3), or it may present factually correct but irrelevant/redundant knowledge that hinders reasoning efficiency (e.g., Step 4). Accuracy alone fails to capture these nuances. We introduce two complementary metrics that separately evaluate knowledge correctness and reasoning informativeness.

Step-by-Step Reasoning Evaluation

Figure 2. Our evaluation pipeline: (a) We decompose the model’s reasoning into reasoning steps, then evaluate the (b) Information Gain: how much a reasoning step reduces uncertainty toward the final answer, calculated as the probability gap between adjacent response steps; and (c) Knowledge Index: the factual correctness of each step by verifying extracted knowledge against external ground truth sources.

Knowledge and Reasoning in Medical Reasoning

Knowledge Index and Information Gain exhibit distinct trends

Table 1. While the reasoning abilities of the two Qwen-Base variants remain comparable, the RL-ed Qwen-Base demonstrates slightly better medical knowledge according to our KI metric (64.2 vs. 63.4). Similarly, although the RL-enhanced Qwen-R1 shows a minor improvement over its SFT-only version in general performance, it underperforms in knowledge evaluation, trailing by 2.2 points in the KI metric (54.3).

The challenging nature of different benchmarks may inherit from different aspects

Figure 3. Correlations between the proposed two metrics and accuracy. Different tasks require different levels of knowledge and/or reasoning capabilities in LLMs.

⚔️ SFT vs RL for Reasoning in Different Domains

SFT improves final accuracy while compromises reasoning efficiency, RL improves Both

Figure 4. Both SFT and RL improve final accuracy. In math tasks, RL-ed model achieves higher accuracy; while in medical tasks, SFT-ed model achieves higher accuracy. However, SFT reduces reasoning efficiency (lower Information Gain) in both domains, while RL leads to consistent improvements.

SFT remains crucial in medical by providing domain knowledge

Figure 5. In knowledge-intensive tasks like medical reasoning, SFT is essential as it provides necessary domain knowledge and leads to higher knowledge index (KI) scores.

Challenge of Reasoning Across Domains

Trained Qwen-Base outperforms its R1-distilled counterpart

Table 2. Qwen-Base consistently outperforms the R1-distilled variant across the evaluated benchmarks, whether using SFT alone or in combination with subsequent RL.

Acknowledgement

This work was partially funded by an unrestricted gift from Google. We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

BibTeX


@misc{wu2025knowledgereasoningcloselook,
      title={Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains}, 
      author={Juncheng Wu and Sheng Liu and Haoqin Tu and Hang Yu and Xiaoke Huang and James Zou and Cihang Xie and Yuyin Zhou},
      year={2025},
      eprint={2506.02126},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.02126}, 
}

Knowledge or Reasoning ? A Close Look at How LLMs Think Across Domains