MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

Abstract

Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MEDFRAMEQA—the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MEDFRAMEQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs—both proprietary and open source, with and without explicit reasoning modules—on MEDFRAMEQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems..

Comparison of MedFrameQA with Existing Benchmarks

Table 1. MedFrameQA supports multi-image reasoning within real-world clinical video scenarios and paired reasoning across frames. The paired reasoning in MedFrameQA is derived from the transcripts from original video clips.

MedFrameQA Pipeline

Figure 1. MedFrameQA generation pipeline contains four stages: (a) Medical Video Collection: Collecting 3,420 medical videos via clinical search queries; (b) Frame-Caption Pairing: Extracting keyframes and aligning with transcribed captions; (c) Multi-Frame Merging: Merging clinically related frame-caption pairs into multi-frame clips; (d) Question-Answer Generation: Generating multi-image VQA from the multi-frame clips.

Data Distribution of MedFrameQA

Figure 2. The data distribution of MedFrameQA. In figure (a), we show the distribution across body systems; (b) presents the distribution across organs; (c) shows the distribution across imaging modalities; (d) provides a word cloud of keywords in MedFrameQA; and (e) reports the distribution of frame counts per question.

Accuracy by Human Body System on MedFrameQA

Table 2. Accuracy on ten advanced Multimodal LLMs on MedFrameQA with system-wise performance. In general, all assessed models demonstrate persistently low accuracy on MedFrameQA, with system-wise performance of substantial variability in task difficulty. We report results for nine systems: Central Nervous System (CNS), Respiratory System (RES), Circulatory System (CIR), Digestive System (DIG), Urinary System (URI), Reproductive System (REP), Endocrine System (END), Musculoskeletal System (MSK), Auxiliary (AUX) and their average accuracy (%).

Accuracy by Modality and Frame Count on MedFrameQA

Table 3. We report the accuracy of models on questions in MedFrameQA grouped by frame count with standard deviation (SD) and by modality. We empirically observe that accuracy fluctuates with increasing frame count and varies significantly across common imaging modalities.

Failure Case Study of `o1` on MedFrameQA

Figure 3. Failure case study of o1 on MedFrameQA. Negligence of important information across multiple frames. In this case, o1 overlooked critical features in the second and third frames, which ultimately led to the selection of an incorrect answer.

Figure 4. Failure case study of o1 on MedFrameQA. A mistake originating from a single image can result in significant errors in subsequent reasoning. In this case, o1 made a directional error when interpreting the first frame, which propagated through its reasoning process and ultimately led to an incorrect answer.

Examples of `2-5` Images Input on MedFrameQA

Acknowledge

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

BibTeX


@article{yu2025medframeqamultiimagemedicalvqa,
  title={MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning}, 
  author={Yu, Suhao and Wang, Haojin and Wu, Juncheng and Xie, Cihang and Zhou, Yuyin},
  journal = {arXiv preprint arXiv:2505.16964},
  year={2025}
}