AI Cracks Medical Exams—But Fails Where It Matters Most?
A new study reveals limitations in the clinical reasoning capabilities of state-of-the-art Large Language Models (LLMs). While these models exhibit proficiency on traditional multiple-choice questions, they fall short of human experts when it comes to script concordance testing (SCT), a complex assessment method centered around updating medical judgments with new information.
Increasingly integrated into clinical settings for decision support, LLMs like those developed by OpenAI have displayed promising capabilities. However, traditional evaluations may misrepresent their reasoning abilities. This study sought to address that discrepancy by employing SCT, which assesses how these models handle clinical uncertainty—a crucial aspect of real-world medical practice. Researchers constructed a public benchmark comprising 750 SCT questions drawn from a wide variety of specialties and international sources. Ten advanced LLMs were tested alongside medical students, residents, and physicians to measure performance across varying expertise levels.
The methodology involved using LLMs without any preparatory instructions (zero-shot) and with limited example-based training (few-shot). Evaluated models included OpenAI’s latest o3 and GPT-4o, as well as DeepSeek R1 and Google’s Gemini models. The models were tasked with adjusting diagnoses based on new information, scored against panels of expert clinicians. Results highlighted that the most advanced model, OpenAI’s o3, achieved a score of 67.8%, still notably lower than the performance of senior medical professionals. Notably, models generally displayed a tendency for extreme confidence, often misjudging when new information should impact their conclusions.
The findings underscore that even the most reasoning-optimized LLMs have significant room for improvement in flexible, uncertainty-driven reasoning tasks. As the researchers from MIT report, while LLMs outpace medical students in many scenarios, their performance on SCT reveals critical gaps in their clinical reasoning—primarily in their ability to manage uncertainty and probabilistic thinking.
This study’s implications extend beyond benchmarking. The SCT tool, now publicly available, serves as a crucial resource for advancing the assessment of medical AI systems. It will enable continuous evolution in AI model training, emphasizing the need for nuanced reasoning beyond rigid knowledge-based testing. This work is a stepping stone towards more comprehensive, trial-based evaluations of AI models in clinical settings, contributing to safer and more effective deployment in healthcare.
While LLMs have advanced AI’s capabilities in clinical decision support, the nuances of real-world reasoning remain a challenging frontier. The study’s findings invite future research, aiming to refine these models for deeper integration into clinical practice, aligned with the complexities of human decision-making.
McCoy, L.G., Swamy, R., Sagar, N., Wang, M., Bacchi, S., Fong, J.M.N., Tan, N.C.K., Tan, K., Buckley, T.A., Brodeur, P., Celi, L.A., Manrai, A.K., Humbert, A., & Rodman, A. (2025). Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study. NEJM AI, 2(10). https://tnyp.me/YctyIeBC