A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Our Pipeline

Figure 3: Our evaluation pipeline has different evaluation (a) aspects containing various tasks. We collect multiple (b) datasets for each task, combining with various (c) prompt strategies to evaluate latest (d) language models. We leverage a comprehensive set of (e) evaluations to present a holistic view of model progress in the medical domain.

Performances Overview of o1

Table 1: Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from Wu et al. (2024) as the reference. We also present the average score (Average) of each metric in the table

Table 2: BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric

Table 3: Accuracy of models on the multilingual task, XmedBenchWang et al. (2024)

Table 4: Accuracy of LLMs on two agentic benchmarks

Table 5: Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets

Case Study

Figure 4: Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.

Figure 5: Comparison of the answers from GPT-o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.

Acknowledgement

This work is partially supported by the OpenAI Researcher Access Program and Microsoft Accelerate Foundation Models Research Program. Q.J. is supported by the NIH Intramural Research Program, National Library of Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

BibTeX


      @misc{xie2024preliminarystudyo1medicine,
        title={A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?}, 
        author={Yunfei Xie and Juncheng Wu and Haoqin Tu and Siwei Yang and Bingchen Zhao and Yongshuo Zong and Qiao Jin and Cihang Xie and Yuyin Zhou},
        year={2024},
        eprint={2409.15277},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2409.15277}, 
  }

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor ?

Our Pipeline

Performances Overview of o1

Case Study

Acknowledgement

BibTeX

A Preliminary Study of o1 in Medicine:
Are We Closer to an AI Doctor ?