AgentMed: Evaluating GPT-5-Based AI Agents in Medicine

Shaohui Zhang, Jiarong Qian, Zhiling Yan, Kai Zhang, Wei Liu, Quangzheng Li, Xiang Li Lifang He, Jing Huang, Lichao Sun

Lehigh University, University of Pennsylvania, Mayo Clinic, Harvard Medical School, Massachusetts General Hospital

Introduction

Large language models (LLMs) are increasingly being augmented with reasoning and tool-use capabilities, creating a spectrum of AI systems from simple chatbots to autonomous agents. This rapid evolution has fueled visions of Artificial Super Intelligence (ASI) revolutionizing medicine, yet the true gap between current systems and this future remains unquantified. It is unclear how incremental capabilities, from web search to agentic planning, affect performance on the path towards superhuman clinical intelligence, necessitating a comprehensive, ecosystem-wide evaluation. Here, we systematically evaluate the performance of a hierarchy of OpenAI models and systems on medical diagnosis tasks. Our analysis spans from the base large language model (e.g., GPT-5) to models enhanced with reasoning, web search, a deep research function, and a fully autonomous agent. We show that while augmenting models with external information tools improves data retrieval, it does not consistently translate to superior diagnostic accuracy and can introduce new error vectors. This work provides a crucial benchmark of OpenAI's model ecosystem, establishing a rigorous methodology for evaluating the true clinical readiness of increasingly complex AI systems and guiding their responsible integration into healthcare. Our new evaluation platform aims to establish a rigorous methodology for tracking progress and guiding the responsible development of AI towards a future of safe and reliable super-intelligence in medicine.

Category	Model	Result	Accuracy (%)
LLM	GPT-5 Auto	55/198	27.78
GPT-5 Fast	50/198	25.25
GPT-5 Thinking mini	41/198	20.70
GPT-5 Thinking mini (quick)	57/198	28.79
GPT-5 Thinking	55/198	27.78
GPT-5 Thinking (quick)	52/198	26.26
GPT-5 Pro	62/198	31.31
Web-Search	GPT-5 Auto Web Search	40/198	20.20
GPT-5 Fast Web Search	43/198	21.72
GPT-5 Thinking mini Web Search	55/198	27.78
GPT-5 Thinking Web Search	65/198	32.83
GPT-5 Pro Web Search	78/198	39.39
Agent	ChatGPT Agent	61/198	30.81
Deep Research	19/79	24.05

Category	Model	Result	Accuracy (%)
LLM	4o	53/198	26.77
Web-Search	4o Web Search	51/198	25.76
o3-Pro Web Search	66/198	33.33

Figure 1: MM-94

Figure 2: MM-94

BibTeX

@inproceedings{2025_AgentMed,
  title={AgentMed: Evaluating GPT-5-Based AI Agents in Medicine},
  author={Shaohui Zhang，Jiarong Qian，Zhiling Yan，Kai Zhang，Lifang He ，Jing Huang ，Lichao Sun},
  booktitle={},
  year={2025},
}