A Preliminary Study of o1 in Medicine:
Are We Closer to an AI Doctor ?

* Equal technical contribution
1 UC Santa Cruz, 2 University of Edinburgh 3 National Institutes of Health
Overall results of o1 and other 4 strong LLMs
Figure 1: Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models.
Average accuracy of o1 and other 4 strong LLMs
Figure 2: Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.

Our Pipeline

Figure 3: Our evaluation pipeline has different evaluation (a) aspects containing various tasks. We collect multiple (b) datasets for each task, combining with various (c) prompt strategies to evaluate latest (d) language models. We leverage a comprehensive set of (e) evaluations to present a holistic view of model progress in the medical domain.

Performances Overview of o1

Image 1

Table 1: Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from Wu et al. (2024) as the reference. We also present the average score (Average) of each metric in the table

Image 2

Table 2: BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric

Image 3

Table 3: Accuracy of models on the multilingual task, XmedBenchWang et al. (2024)

Image 4

Table 4: Accuracy of LLMs on two agentic benchmarks

Image 5

Table 5: Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets

Case Study

Image 1

Figure 4: Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.

Image 2

Figure 5: Comparison of the answers from GPT-o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.

Acknowledgement

This work is partially supported by the OpenAI Researcher Access Program and Microsoft Accelerate Foundation Models Research Program. Q.J. is supported by the NIH Intramural Research Program, National Library of Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

BibTeX


      @misc{xie2024preliminarystudyo1medicine,
        title={A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?}, 
        author={Yunfei Xie and Juncheng Wu and Haoqin Tu and Siwei Yang and Bingchen Zhao and Yongshuo Zong and Qiao Jin and Cihang Xie and Yuyin Zhou},
        year={2024},
        eprint={2409.15277},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2409.15277}, 
  }