Three AIs sit the GP SelfTest

Richard Armitage is a GP and Honorary Assistant Professor at the University of Nottingham’s Academic Unit of Population and Lifespan Sciences. He is on twitter: @drricharmitage

In August 2023 I subjected GPT-4 — the latest publicly available version of OpenAI’s large language model (LLM) in the form of ChatGPT — to a Royal College of General Practitioners (RCGP) Applied Knowledge Test (AKT) practice assessment.1 Of the 45 text-only practice questions made available on the RCGP website, GPT-4 answered 43.5 out of 45 questions correctly, a score of 96.7%. Not bad for an agent that lacks a medical degree, patient contact, and sentience.

Only 2 months later, the landscape of user-friendly LLMs that are available to the public has evolved substantially. ChatGPT, which was released to the public less than 11 months ago, is no longer the only competitive product in this space. In October 2023, a significant and growing number of competing LLMs are available for public use, including Meta’s Llama 2, Microsoft’s Bing AI, and Quora’s Poe. However, in addition to ChatGPT, Anthropic’s Claude-2 and Google’s Bard constitute the industry’s trio of front-running LLMs.

To mark the arrival of these challengers, I pitted the three LLMs against each other in a head-to-head challenge to see which would perform the best when subjected to MRCGP-style examination questions.

Introducing ChatGPT, Claude, and Bard

Readers of my previous writings on AI will be familiar with OpenAI’s LLM, ChatGPT. This LLM, which is powered by the GPT-3.5 model, is freely available to the public, but usage is restricted to limited conversations. ChatGPT Plus, which is powered by the GPT-4 model, costs $20/month, is more powerful than GPT-3.5, allows for more conversations, and provides access during peak traffic times. GPT-4 accepts input information as both text and speech, but does not have live access to the open internet, and its context window (the amount of data it can receive and process in a single prompt) is limited to 2048 tokens per prompt (one token is approximately four characters in English text, and 100 tokens is approximately 75 words).

Claude was developed by Anthropic, which describes itself as ‘an AI safety and research company [that builds] reliable, interpretable, and steerable AI systems.’2 The current version of Claude — Claude-2 — was released to the public in July 2023 and is free, but limits user conversations. Claude Pro, which uses the same Claude-2 model, costs £18/month and allows much greater usage and priority access to the model during high-traffic periods. While it does not have live access to the open internet and cannot accept speech inputs, Claude-2’s context window accepts up to 10 000 tokens, making it around 50 times larger than that of ChatGPT and Bard. In addition, Claude-2 can extract text from PDF and TXT files that are attached to the user interface (neither ChatGPT nor Bard have this functionality).

Bard is currently entirely free to public users, accepts both text and speech input data, and has live access to the open internet (unlike ChatGPT and Claude). Its context window is the same as ChatGPT — limited to 2048 tokens.

Sitting the GP SelfTest

The RCGP’s GP SelfTest is a ‘tool for GPs at all career stages, from those preparing for the MRCGP AKT to those preparing for their annual appraisals’, and supports MRCGP exam revision and learning assessment.3 I prompted GPT-4, Claude-2, and Bard with the instruction ‘Answer the following questions as if you are a GP working in the UK’, and subjected each LLM to 15 GP SelfTest questions that were randomly generated through the lucky dip function. Each question contained textual information only (meaning questions that involved images were skipped), and was copied and pasted into the user interface of each LLM.

GPT-4 scored the highest mark, 13/15 or 86.7%. In one of its incorrect answers, the LLM stated the correct answer in its justification, suggesting it ‘knew’ the right answer but did not provide it, thereby implying it did not ‘understand’ the response it provided.

Bard scored the second highest mark, 12/15 or 80.0%. It explained the correct ‘thinking process’ that was needed to achieve the right answer in one of its incorrect responses, but failed to use this thinking process to achieve that right answer, thereby making a simple logical error involving numbers. It also refused to answer one particular question, despite that question consisting of only textual information in exactly the same style of the other 14 questions that it did provide answers for.

Claude-2 scored the lowest mark, 11/15 or 73.3%. The model justified each of its incorrect answers with confidence. Table 1 presents the responses of each LLM to the 15 GP SelfTest questions.

Table 1. Responses of each large language model to the 15 GP SelfTest questions
Question GPT-4 Claude-2 Bard
1 A A A
2 C E D
3 D E E
4 D D D
5 D D D
6 B B B
7 B B B
8 D C D
9 B B B
10 B B B
11 A A A
12 E E E
13 D B A
14 B B
15 B E D
Total 13 11 12
% 86.7 73.3 80.0
Green = correct response. Red = incorrect response.

Generally, Claude-2 and Bard tended to give longer responses than GPT-4, and provided lengthy explanations that justified their answers. GPT-4 did provide this amount of detail when further prompted to explain the reasoning behind its responses, but this was generally not provided in answers to the initial prompt.

Bard concluded all its answers with a form of disclaimer or caveat, such as ‘It is important to note that this is just a general overview of the management of X. The specific management plan for each patient will vary depending on their individual circumstances’. In other answers, the LLM advised the user to consult their doctor if they were experiencing the symptoms that featured in the question. Neither GPT-4 nor Claude-2 tended to do this.

Concerningly, Claude-2 confidently provided justification for its incorrect answers, and showed no ‘awareness’ that those answers could possibly be wrong.

Bard was the only LLM to not answer a question. This glitch was repeated after it was repeatedly prompted to answer the same question.

Taking these aspects of the performances into account, I award the LLMs with the following grades:

• GPT-4 A-
• Claude-2 C+
• Bard B+

As such, although it is no longer the only user-friendly LLM available, GPT-4 maintains its leading position in answering MRCGP-style text-based questions. I think it is likely, however, that Claude-2, Bard, and other emerging LLMs will become equally or even more capable than GPT-4 in the coming weeks. This is as all models are subjected to further training, become able to process additional data modalities (such as images), and gain live and usable access to the open internet.

All of these LLMs are enormously impressive, and the speed with which their abilities are compounding is remarkable, if not terrifying. However, as I have discussed in previous articles on AI, the apparent competence of these LLMs is not sufficient to replace the GP, who operates in an environment of incomplete information, complex human relationships, and ethical conundrums. Despite this, the utility of these technologies is becoming increasingly hard to ignore, and the case for using LLMs to support GPs in various domains of their practice — such as the knowledge-based element — is becomingly increasingly robust.

1. Armitage R. ChatGPT prepares for the AKT, and does rather well. BJGP Life 2023; 7 Aug: (accessed 13 Oct 2023).
2. Anthropic. Company. (accessed 13 Oct 2023).
3. Royal College of General Practitioners Learning. GP SelfTest. (accessed 13 Oct 2023).

Featured photo by Mohamed Nohassi on Unsplash.

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Previous Story

Ideas concerns and expectations for peace in the Middle East

Next Story

Episode 141: Raising awareness of interconception care: what can we be doing to help women between pregnancies?

Latest from Bright Ideas and Innovation

Would love your thoughts, please comment.x
Skip to toolbar