MedABench Logo Project Logo

AgentMed: Evaluating GPT-5-Based AI Agents in Medicine

Shaohui Zhang, Jiarong Qian, Zhiling Yan, Kai Zhang, Wei Liu, Quangzheng Li, Xiang Li Lifang He, Jing Huang, Lichao Sun
Lehigh University, University of Pennsylvania, Mayo Clinic, Harvard Medical School, Massachusetts General Hospital
Figure 1: Comparison of Accuracy Across Different Models and Methods on the MMMR Dataset.

Figure 1:Overview of the Dignosis Dataset and model evaluation. The left panel shows 198 cases across 11 disease categories and 9 image modalities. The middle panel illustrates cognitive complexity, progressing from ChatGPT reasoning to Web Agent execution. The right panel compares MVA% performance of different GPT-5 and ChatGPT configurations to assess the impact of reasoning and retrieval capabilities on accuracy.

Introduction

Large language models (LLMs) are increasingly being augmented with reasoning and tool-use capabilities, creating a spectrum of AI systems from simple chatbots to autonomous agents. This rapid evolution has fueled visions of Artificial Super Intelligence (ASI) revolutionizing medicine, yet the true gap between current systems and this future remains unquantified. It is unclear how incremental capabilities, from web search to agentic planning, affect performance on the path towards superhuman clinical intelligence, necessitating a comprehensive, ecosystem-wide evaluation. Here, we systematically evaluate the performance of a hierarchy of OpenAI models and systems on medical diagnosis tasks. Our analysis spans from the base large language model (e.g., GPT-5) to models enhanced with reasoning, web search, a deep research function, and a fully autonomous agent. We show that while augmenting models with external information tools improves data retrieval, it does not consistently translate to superior diagnostic accuracy and can introduce new error vectors. This work provides a crucial benchmark of OpenAI's model ecosystem, establishing a rigorous methodology for evaluating the true clinical readiness of increasingly complex AI systems and guiding their responsible integration into healthcare. Our new evaluation platform aims to establish a rigorous methodology for tracking progress and guiding the responsible development of AI towards a future of safe and reliable super-intelligence in medicine.

Diagnosis Dataset

Our dataset is derived from MedXpertQA, a publicly available benchmark introduced for Human Health tasks. The dataset contains 198 diagnosis cases distributed across 11 body systems—Cardiovascular (40 cases), Digestive (35), Respiratory (34), Skeletal (31), Nervous (22), Reproductive (9), Endocrine (9), Integumentary (7), Lymphatic (5), Muscular (4), and Urinary (2). Each case has a unique identifier and an open-ended, reasoning-style diagnostic question paired with medical images spanning CT, MRI, X-ray, PET, pathology images, EEG/ECG recordings, charts, and real-world photographs, together with a clinically validated ground-truth diagnosis.

Leaderboard

Table 1: Performance of Different GPT-5 Models on the Diagnosis Dataset (Accuracy %)
Category Model Result Accuracy (%)
LLM GPT-5 Auto 55/198
27.78
GPT-5 Fast 50/198
25.25
GPT-5 Thinking mini 41/198
20.70
GPT-5 Thinking mini (quick) 57/198
28.79
GPT-5 Thinking 55/198
27.78
GPT-5 Thinking (quick) 52/198
26.26
GPT-5 Pro 62/198
31.31
Web-Search GPT-5 Auto Web Search 40/198
20.20
GPT-5 Fast Web Search 43/198
21.72
GPT-5 Thinking mini Web Search 55/198
27.78
GPT-5 Thinking Web Search 65/198
32.83
GPT-5 Pro Web Search 78/198
39.39
Agent ChatGPT Agent 61/198
30.81
Deep Research 19/79
24.05
Table 2: Performance of Different 4o/o3 Models on the Diagnosis Dataset (Accuracy %)
Category Model Result Accuracy (%)
LLM 4o 53/198
26.77
Web-Search 4o Web Search 51/198
25.76
o3-Pro Web Search 66/198
33.33

Examples

BibTeX

@inproceedings{2025_AgentMed,
  title={AgentMed: Evaluating GPT-5-Based AI Agents in Medicine},
  author={Shaohui Zhang,Jiarong Qian,Zhiling Yan,Kai Zhang,Lifang He ,Jing Huang ,Lichao Sun},
  booktitle={},
  year={2025},
}