Rebecca Payne is Clinical Senior Lecturer at the North Wales Medical School, Bangor University and Clarendon Reuben doctoral scholar at the Nuffield Department of Primary Health Care Sciences, University of Oxford
Andrew bean is a DPhil student in Social Data Science at the Oxford Internet Institute
Adam Mahdi is an Associate Professor and Senior Research Fellow at the Oxford Internet Institute
Alasdair Purves is a 4th year medical student at North Wales Medical School, Bangor University
A recent Nature Medicine paper by Bean et al1 will come as a relief to GPs worried about their employment prospects: Artificial Intelligence (AI) chatbots are not yet ready to replace doctors. Bean and colleagues presented case vignettes summarising common medical conditions to members of the public, randomly assigning them to interact with commonly used large language model (LLM) chatbots (ChatGPT-4o, Llama 3 or Command R+) or to use whatever resources they would normally consult at home.
A recent Nature Medicine paper by Bean et al will come as a relief to GPs worried about their employment prospects: Artificial Intelligence (AI) chatbots are not yet ready to replace doctors.
The results were striking. Users of LLMs were less likely than the control group to correctly identify relevant medical conditions underlying the scenario. When it came to choosing the correct level of care (for example, self-care, GP, urgent care or ambulance), LLM users performed no better than those using conventional methods.
But was the AI itself to blame?
When the researchers bypassed human interaction by presenting the same cases directly to the LLMs, performance improved dramatically. The models alone correctly identified relevant conditions in the vast majority of cases and performed substantially better than human–LLM teams. When testing simulated LLM-to-LLM interactions (a common evaluation technique), performance again appeared strong. However, when real people interacted with the models, accuracy dropped sharply.
Importantly, analysis of conversation transcripts showed that LLMs often did suggest a relevant diagnosis at some point in the exchange. Yet by the end of the interaction, participants frequently failed to include that diagnosis in their final answer. In other cases, users provided incomplete information, or the LLM misinterpreted key details. The breakdown was not simply a failure of knowledge, it was a failure of communication between human and machine.
This study carries important implications.
First, standard testing methods, including medical licensing exam benchmarks and LLM-to-LLM simulations, do not reliably predict real-world performance when humans are involved. Models that perform near-perfectly on structured exam questions may still fail in messy, interactive settings. Where tools are intended for use by the public, they must be evaluated with real users, not just other models or technically fluent testers.
Second, passing medical exams does not equate to delivering safe patient care. Clinical practice is not simply about recalling knowledge; it involves eliciting relevant information, communicating clearly, managing uncertainty, and guiding decision-making. Knowledge alone is insufficient. Passing a driving theory test does not make someone safe on the road.
Third, while policymakers increasingly describe AI as a “front door” to healthcare, this study suggests that LLMs are not yet ready for unsupervised deployment in direct patient-facing roles. The models tested here did not improve the public’s ability to assess clinical urgency or to reach the right diagnosis. That does not mean such systems should never be used — but it does mean that claims of readiness are premature.
The findings are particularly important given that both LLM users and control participants tended to underestimate the severity of scenarios. Introducing AI did not reliably correct this tendency. If deployed widely without careful design and oversight, such systems could risk reinforcing rather than mitigating errors in judgment.
Despite these disappointing findings, it is important not to “throw the baby out with the bath water”.
When LLMs are used to structure and summarise known information, they augment human expertise. When they are used to generate new clinical interpretations in high-stakes settings, the limitations of human–AI interaction become more apparent.
There are many promising applications of LLMs within healthcare. Transcription tools are already moving into mainstream practice to produce medical notes, referral letters and patient summaries. Productivity tools that summarise meetings or synthesise dense patient records can save time and reduce cognitive burden. In these cases, LLMs are repackaging, organising and clarifying existing information — acting as highly capable assistants rather than autonomous clinical decision-makers.
The difference is crucial. When LLMs are used to structure and summarise known information, they augment human expertise. When they are used to generate new clinical interpretations in high-stakes settings, the limitations of human–AI interaction become more apparent.
And herein lies the mismatch between policy ambition and technological reality. AI is unlikely to substitute for shortages in healthcare staffing. Even highly capable systems require thoughtful integration, careful testing, and sustained oversight. Embedding new technologies into overstretched healthcare systems often increases short-term strain rather than alleviating it.
The lesson from this study is not that AI has no place in medicine. It is that its place is narrower than current rhetoric suggests. For now, LLMs function best as secretaries and assistants: tools that support documentation, synthesis and information retrieval. Before they can credibly serve as the first point of contact for patients, far more work is needed to understand and design around the complexities of human–AI interaction. AI may be able to pass medical exams. But medicine is not just a science — it is an art grounded in judgment, empathy and human connection.
Reference
- Bean AM, Payne RE, Parsons G, Kirk HR, Ciro J, Mosquera-Gómez R, et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nature Medicine. 2026.
Featured Photo by Alex Knight on Unsplash