Richard Armitage is a GP and Honorary Assistant Professor at the University of Nottingham’s Academic Unit of Population and Lifespan Sciences. He is on twitter: @drricharmitage
Artificial intelligence (AI) is developing at an astounding pace. If you’re not up to speed with large language models (LLM) such as ChatGPT – what they are, how they’re impacting industries, and why they promise to revolutionise general practice – see some of my BJGP Life writings.1,2,3
I firstly prompted GPT-4 to “Answer the following as if you were a GP trainee in the UK.”
Just one of the stunning capabilities of LLMs is their performance in publicly-available and professional examinations. For example, without any specific training, GPT-4 (the latest iteration of ChatGPT) scored in the 88th percentile of the LSAT (the US Law School Admission Test), and in the 90th percentile of the Uniform Bar Examination (the US legal licensing exam). The same model scored in the 99-100th percentile of the USA Biology Olympiad Semi-final Exam, in the 93rd percentile of the SAT Evidence-Based Reading & Writing (one element of the US college admissions test), and the 99th centile of the GRE Verbal exam (one part of the US graduate school admissions test).4 It also performed “at or near the passing threshold” of the United States Medical Licensing Exam,5 and passed three of four papers in the Faculty of Public Health Diplomate Examination, surpassing the current pass rate.6 In this article, I’ll explain how GPT-4 faired as it prepared to sit a MRCGP examination.
The Applied Knowledge Test (AKT) assesses the base of knowledge that is required for independent general practice in the UK within the context of the NHS. It is a computer-based examination that forms part of the MRCGP. Candidates must answer 200 questions within 190 minutes on clinical knowledge (about 80% of questions), evidence-based practice (10%), and primary care organisation and management (10%), thereby covering the RCGP GP Curriculum.7 The questions assess higher order problem-solving rather than the simple recall of basic facts. Candidates who pass the AKT show they are able to apply knowledge to a standard that is sufficient for independent general practice in the UK.8 While no complete past AKT papers are available online, the RCGP makes available 53 AKT practice questions – followed by their answers – to assist potential candidates in preparing for the examination.9 45 of these questions contain only textual information, while 8 contain images (of skin disease, for example) and statistical information presented as graphs. It is these text-only AKT practice questions that I subjected GPT-4 to as if it were preparing to sit the AKT.
I firstly prompted GPT-4 to “Answer the following as if you were a GP trainee in the UK.”
I then asked the LLM each of the 45 text-only questions by copying and pasting the questions one-by-one directly from the RCGP AKT practice paper (four of these questions contained tablets of results, which I converted into text so that the model interpret them). No further prompting was used at any time. I then marked the LLM’s answers against the answers included on the practice paper.
GPT-4 did rather well. It answered 43.5 out of 45 questions correctly, a score of 96.7%. It dropped half a mark in a multiple-choice question regarding developmental delay (question 16), and one mark regarding choice of inhaler (question 47).*
The webpage on which this AKT question paper is available came into existence in August 2022.10 GPT-4 was trained on data up to September 2021, suggesting the question paper was not incorporated in the model’s training set, and that it was ‘seeing’ these questions for the first time (the model’s training sets are not publicly-available, however, so I cannot be certain of this). However, even if the practice paper did constitute part of the training set, the model also provided detailed explanations for its answers that were not included in the practice paper. These explanations were highly compelling and suggest real ‘understanding’ of the answers that were provided, and were not simply retrieved and copied from the past paper if indeed it had been seen previously. Both interestingly and concerningly, the confidence with which the model explained its correct answers matched that with which it explained its incorrect answers (‘hallucinations’). In addition, only a single additional prompt was required to extract the correct answer (and an explanation that implies understanding) after each of its two incorrect answers, suggesting the model ‘knew’ the correct answers all along, but for some unclear reason did not provide them on the first attempt.
GPT-4 did rather well. It answered 43.5 out of 45 questions correctly, a score of 96.7%.
What does all this mean? While GPT-4 was unable to attempt the questions containing images and graphs, other AIs are able to interpret information presented in these modalities to a high degree of accuracy, such as those trained to interpret skin lesions,11 blood chemistry,12 and radiological imaging including CT,13 chest x-ray,14 and retinal photography,15 and are meeting or exceeding expert clinician performance in these domains. In light of the AKT questions that it was able to attempt, GPT-4’s score of 96.7% is highly likely to exceed the examination’s pass mark, thereby suggesting that the model is able to apply knowledge to the standard that is required for independent general practice in the UK (the full AKT consists of 200 questions, however, and requires the interpretation of multiple images and graphs). The model is also limited by its tendency to hallucinate, which it does with no discernible difference in confidence than that with which it delivers answers that are correct. This seriously limits the extent to which its answers can be trusted, although this fallibility is becoming less debilitating with each iteration of GPT.4
Of course, the AKT is only designed to test the foundational knowledge base that is required for independent practice. In the messy reality of general practice, information is not provided in the uncontaminated, accurate, and concise manner of the AKT questions. Rather, the relevant information is hidden amongst the obscuring complexities of superfluous material and multi-factorial contexts, while patient rapport, language barriers, and ethical issues often make the retrieval and interpretation of that useful information highly challenging. Furthermore, the AKT only tests the knowledge base required of a GP, and is not designed to assess the clinical skills and professional attitudes that are needed to practice safely and effectively in the role (these attributes are assessed through the MCRP’s other components – the Recorded Consultation Assessment, Workplace Based Assessments, and the Trainee Portfolio).16 Accordingly, the apparent competence demonstrated by its performance in the AKT practice paper does not imply that GPT-4 can replace the general practitioner, but might suggest that there is a role for the model in supporting GPs in the knowledge-based element of their practice.
- R Armitage. Using AI in the GP consultation: present and future. BJGP Life 29 May 2023. https://bjgplife.com/using-ai-in-the-gp-consultation-present-and-future/ [accessed 27 July 2023]
- R Armitage. ChatGPT: a threat to medical education? BJGP Life 11 May 2023. https://bjgplife.com/chatgpt-a-threat-to-medical-education/
- Armitage R. The utilitarian case for AI-mediated clinical decision-making. BJGP Life 16 July 2023. https://bjgplife.com/the-utilitarian-case-for-ai-mediated-clinical-decision-making/
- OpenAI. GPT-4. https://openai.com/research/gpt-4 [accessed 27 July 2023]
- TH Kung, M Cheatham, A Medenilla, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health 2023; 2(2): e0000198. DOI: 10.1371/journal.pdig.0000198
- NP Davies, R Wilson, MS Winder, et al. ChatGPT sits the DFPH exam: large language model performance and potential to support public health learning. medRxiv 04 July 2023; 23291894. DOI: 10.1101/2023.07.04.23291894
- RCGP. MRCGP exams: GP Curriculum. https://www.rcgp.org.uk/mrcgp-exams/gp-curriculum
[accessed 27 July 2023]
- Royal College of General Practitioners. MRCGP: Applied Knowledge Test (AKT). https://www.rcgp.org.uk/mrcgp-exams/applied-knowledge-test [accessed 27 July 2023]
- Royal College of General Practitioners. MRCGP: Applied Knowledge Test (AKT). AKT examples questions with answers. https://www.rcgp.org.uk/getmedia/8f99c1ac-7b31-44f8-8eaf-eb0c4174bf07/AKT-Example-Questions-with-answers2021.pdf [accessed 27 July 2023]
- Way Back Machine (Internet Archive). https://web.archive.org/web/20220601000000*/https://www.rcgp.org.uk/mrcgp-exams/ [accessed 27 July 2023]
- TC Pham, CM Luong, VD Hoang, et al. AI outperformed every dermatologist in dermoscopic melanoma diagnosis, using an optimized deep-CNN architecture with custom mini-batch logic and loss function. Nature Scientific Reports 2021; 11: 17485. DOI: 10.1038/s41598-021-96707-8
- W Walter, C Haferlach, N Nadarajah, et al. How artificial intelligence might disrupt diagnostics in hematology in the near future. Oncogene 2021; 40. 4271-4280. DOI: 10.1038/s41388-021-01861-y
- J Chen, L Wu, J Zhang, et al. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography. Nature Scientific Reports 2020; 10:19196. DOI: 10.1038/s41598-020-76282-0
- LL Plesner, FC Müller, JD Nybing, et al. Autonomous Chest Radiograph Reporting Using AI: Estimation of Clinical Impact. Radiology March 2023; 307(3). DOI: 10.1148/radiol.222268
- V Gulshan, L Peng, M Coram, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 2016; 316(22): 2402–2410. DOI:10.1001/jama.2016.17216
- RCGP. MRCGP exams. https://www.rcgp.org.uk/mrcgp-exams [accessed 27 July 2023]
*The author has provided a Word document containing screenshots of all the questions and answers, and a table to summarise the results.