ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Abstract

LLM-based clinical agents have largely been evaluated on pre-curated patient context — a setting that hides what is arguably the hardest part of clinical reasoning: deciding where to look, what to retrieve, and how to integrate evidence across modalities. We introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking in clinical decision support, and ship it with three concrete artifacts:

a ClinSeekAgent (the pipeline). 20 tools across raw ehr.* tables, browser.* search, and image.* CXR analysis. The agent decides which to invoke and when to stop.
b ClinSeek-Bench. Each example is paired into Curated Input (the source benchmark's pre-selected evidence) and Automated Evidence-Seeking (only the patient ID + cutoff + tools). Same task, same label, only the access pattern changes.
c ClinSeek-35B-A3B. SFT of Qwen3.5-35B-A3B on Claude Opus 4.6 trajectories collected from ClinSeekAgent. Open-source state-of-the-art on AgentEHR-Bench, reaching 94.4% of the teacher.

On ClinSeek-Bench, ClinSeekAgent lifts Claude Opus 4.6 multimodal F1 from 47.5 → 62.6 (+15.1), improves 5/6 evaluated multimodal agents overall, and improves 7/9 agents on text-only risk prediction. The student inherits not just answers but tool-use behaviour: post-SFT, ehr.run_sql_query usage jumps from 2.0% → 12.5%.

Three findings

What changes once the agent has to find its own evidence.

(a)

Active search > pre-curated context

When the agentic model has tools, it finds signals the benchmark's hand-picked context missed. Claude Opus 4.6: +15.1 F1 on multimodal, +3.2 on text-only overall. MiniMax M2.5: +4.2. Risk prediction improves for 7 / 9 evaluated agents.

(b)

Effectiveness scales with agentic capability

The gains concentrate in agents that already plan and use tools well — Opus 4.6 and Sonnet 4.6 improve on both text and multimodal; several smaller OSS agents (Kimi K2.5, GLM-4.7) regress when handed the raw tool space. The pipeline works through the agent's own skill rather than as a substitute for it — which raises the natural question: can we transfer that skill into a smaller open model?

(c)

Yes — distillable into an open student

We collect ClinSeekAgent trajectories from a strong teacher (Opus 4.6) and SFT them into Qwen3.5-35B-A3B. The resulting ClinSeek-35B-A3B reaches 34.0 avg F1 on AgentEHR-Bench — +11.9 over its base, ahead of Kimi K2.5 (29.9), MiniMax M2.5 (27.7), GLM-4.7 (27.6), and at 94.4% of its teacher.

Method: a 20-tool multi-source space

Each ClinSeekAgent run is an open-ended trajectory τ = (x, (a₁,o₁), …, (a_K,o_K), ŷ): a patient-level task, alternating tool actions and observations, terminating with a final answer. The agent decides ordering, depth, and termination; we do not prescribe a retrieval procedure. EHR queries are restricted to records strictly before the prediction-time cutoff, so the agent never peeks at the future.

Three complementary evidence sources, one OpenAI-tool-call interface:

ehr.* 11 tools

Raw EHR retrieval

Schema inspection, temporal record retrieval, free-form SQL, candidate-term grounding (BioLORD semantic search over reference vocabularies), and a finish action. The MCP server enforces the prediction-time cutoff per patient.

browser.* 3 tools

External medical knowledge

search · open · find. Backed by either a local browser or the Serper API. Used for clinical definitions, drug references, and benchmark-specific taxonomies (e.g. the 25-phenotype Harutyunyan 2019 labels).

image.* 6 tools

Medical-image analysis

DICOM preprocessing, CXR classification (TorchXRayVision), report generation, phrase grounding and anatomical segmentation (MAIRA-2), plus an image visualiser. Provides structured findings beyond the agent's native vision.

Inference-time results: paired Curated vs. ClinSeekAgent

Same task definition, same labels — only the evidence access pattern changes. We evaluate twelve agents on ClinSeek-Bench.

Curated Input baseline — common practice

The source benchmark's hand-selected EHR context (up to 100 events from the last 24 h) is rendered into the prompt; the model answers directly with no tools.

ClinSeekAgent (ours) automated evidence-seeking

The curated context is stripped. The model receives only the patient ID, prediction-time cutoff, and access to the 20-tool space, and must retrieve evidence over multiple turns.

Click a tab to switch the result table.

Per-subtask gain heatmap on text-only EHR-Bench: green = ClinSeekAgent wins over Curated Input, red = loses. Risk-prediction tasks (Mortality, LengthOfStay, ED Hospitalization) dominate the green column. — Per-subtask gains of ClinSeekAgent over Curated Input across nine agents on text-only EHR tasks. Green = win, red = loss. Gains concentrate in risk prediction.

Where it works: a multimodal phenotype case

Side-by-side trajectory comparison on a phenotype task. ClinSeekAgent calls an image expert for the CXR, then SQL on the EHR for vitals, then browser search for the CCS taxonomy. F1 = 83.3. Curated-Input fails for lack of imaging signal. — On a phenotype question, the agent calls a CXR expert, runs SQL across ICU events, and browses the 25-phenotype taxonomy — three complementary signals — to reach **F1 = 83.3**. The Curated-Input baseline fails for lack of imaging evidence.

Where it still loses: next-row decision making

Failure case on a decision-making task: the agent retrieves wide context but overlooks the row-local pattern that determines the next event, while the Curated-Input baseline picks it up correctly. — On *next-row* decision-making tasks, broad retrieval can hurt: the agent collects abundant but tangential evidence and overlooks the table-local pattern. The Curated-Input baseline sees the same recent rows directly and gets it right. Closing this gap is open work.

Distillation: ClinSeek-35B-A3B

The inference-time results above lean on strong agentic capability in the model. To transfer that capability into an open student, we run Claude Opus 4.6 through ClinSeekAgent on the text-based training split, collect the resulting trajectories in their native tool-call format, and SFT Qwen3.5-35B-A3B on them (8× H200, TP=2 / EP=8, 52k sequence length, cosine 2e-5).

Open-source state-of-the-art on AgentEHR-Bench

Parameter-efficiency scatter on AgentEHR-Bench: ClinSeek-35B-A3B (34.0 F1) is the strongest open-source model in our evaluation, surpassing larger Kimi K2.5, MiniMax M2.5 and GLM-4.7, and approaches Claude Opus 4.6 (36.0). — ClinSeek-35B-A3B vs. open-source baselines and proprietary frontier models on AgentEHR-Bench. The distilled student matches the trade-off curve of much larger MoE models.

What the student actually learned

Tool-call distribution on AgentEHR-Bench, before vs. after SFT on ClinSeekAgent trajectories. After distillation, ehr.run_sql_query rises from 2.0% of all calls to 12.5%. — Total tool calls barely changed (33k → 31k on the same 500 questions), but the *allocation* did: ehr.run_sql_query rose from **2.0% → 12.5%**. The student inherits not just final-answer patterns but procedural evidence-seeking behaviour — it learns to treat the EHR as a programmable database.

Resources

Everything needed to reproduce the paper, plus the artifacts we release.

GitHub

UCSC-VLAA/ClinSeekAgent

Agent driver, EHR MCP server, image MCP server, scoring utilities, and vendored VERL SFT recipes (Megatron + mbridge on 8× H200).

🤗 Hugging Face

UCSC-VLAA collection

Benchmark inputs, evaluation results, the distilled student model, and training trajectories. Some assets are gated under MIMIC-IV / MIMIC-CXR access terms.

dataset

ClinSeek-Bench

1,800 text-only + 989 multimodal paired examples

dataset

ClinSeek-Evaluation-Results

Scored predictions for every model × mode × task

model

ClinSeek-35B-A3B

SFT of Qwen3.5-35B-A3B on Opus-4.6 trajectories

collection

ClinSeekAgent on Hugging Face

Model, benchmark, and evaluation results in one hub

how-to

Per-role install

Four venvs: agent · EHR MCP · image MCP · SFT

how-to

SFT recipe

8× H200 · TP=2 · EP=8 · 52K seq · cosine 2e-5

Cite

@article{wu2026clinseekagent,
  title   = {ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning},
  author  = {Wu, Juncheng and Zhang, Letian and Wang, Yuhan and Tu, Haoqin and
             Chen, Hardy and Wang, Zijun and Xie, Cihang and Zhou, Yuyin},
  journal = {arXiv preprint arXiv:2605.20176},
  year    = {2026},
}

Preprint: arXiv:2605.20176.