Table 1: Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from Wu et al. (2024) as the reference. We also present the average score (Average) of each metric in the table
Table 2: BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric
Table 3: Accuracy of models on the multilingual task, XmedBenchWang et al. (2024)
Table 4: Accuracy of LLMs on two agentic benchmarks
Table 5: Accuracy results of model results with/without CoT prompting on 5 knowledge QA datasets