ChatGPT’s Diagnostic Accuracy is Comparable to that of ‘Dr. Google’

ChatGPT’s Diagnostic Accuracy is Comparable to that of ‘Dr. Google’

According to a new study, ChatGPT is mediocre at diagnosing medical conditions, with an accuracy rate of only 49%. Researchers emphasize that their findings demonstrate AI should not be the sole source of medical information, underscoring the necessity of keeping the human element in healthcare.
Credit: Pixabay

According to a new study, ChatGPT is mediocre at diagnosing medical conditions, with an accuracy rate of only 49%. Researchers emphasize that their findings demonstrate AI should not be the sole source of medical information, underscoring the necessity of keeping the human element in healthcare.

The ease of accessing online technology has led some people to skip seeing a medical professional and instead search their symptoms on Google. Although being proactive about one’s health is beneficial, ‘Dr. Google‘ is not very accurate. A 2020 Australian study that examined 36 international mobile and web-based symptom checkers found that correct diagnoses appeared first only 36% of the time.

Advancements in AI and Its Diagnostic Accuracy

AI has certainly advanced since 2020. For instance, OpenAI’s ChatGPT has made significant progress and can even pass the US Medical Licensing Exam. However, this raises the question of whether it is more accurate than ‘Dr. Google‘ in terms of diagnostic precision. Researchers from Western University in Canada aimed to address this question in a new study.

Using ChatGPT 3.5, a large language model trained on a vast dataset of over 400 billion words from diverse sources such as books, articles, and websites, the researchers conducted a qualitative analysis of the medical information provided by the chatbot. They assessed its responses to Medscape Case Challenges.

Medscape Case Challenges are intricate clinical scenarios designed to test a medical professional’s knowledge and diagnostic abilities. Participants must diagnose a case or select an appropriate treatment from four multiple-choice options.

The researchers selected these challenges because they are open-source and freely available. To prevent ChatGPT from having prior knowledge of the cases, the researchers included only those published after the model’s training cut-off in August 2021.

A Range of Medical Issues and Exclusions

A total of 150 Medscape cases were reviewed. With four possible answers per case, there were 600 potential responses, but only one correct answer for each case. The cases spanned a variety of medical issues, with titles such as “Beer and Aspirin Worsen Nasal Issues in a 35-Year-Old with Asthma,” “Gastro Case Challenge: A 33-Year-Old Man Who Can’t Swallow His Own Saliva,” “A 27-Year-Old Woman with Constant Headache Too Tired to Party,” “Pediatric Case Challenge: A 7-Year-Old Boy with a Limp and Obesity Who Fell in the Street,” and “An Accountant Who Loves Aerobics with Hiccups and Incoordination.” The researchers excluded cases that included visual elements, such as clinical images, medical photographs, and graphs.

An example of a standardized prompt fed to ChatGPT
Hadi et al.

To ensure consistent input, the researchers converted each Medscape case challenge into a standardized prompt with a specified expected response. At least two independent medical trainees, blinded to each other’s assessments, reviewed ChatGPT’s responses for diagnostic accuracy, cognitive load, and quality of information.

ChatGPT answered correctly in 49% of the 150 cases analyzed, with an overall accuracy of 74%, reflecting its ability to identify and reject incorrect options. This high accuracy is due to its ability to rule out wrong answers but shows it needs better precision and sensitivity.

Accuracy and Quality of ChatGPT Responses

ChatGPT had false positives and false negatives in 13% of cases each. Over half (52%) of its answers were complete and relevant, while 43% were incomplete but still relevant. The responses had a low to moderate cognitive load, making them fairly easy to understand, though this could lead to misconceptions if used for medical education.

The model struggled with distinguishing between subtly different diseases and occasionally produced incorrect or implausible information, highlighting the need for human expertise in the diagnostic process.

The researchers say that AI should be used as a tool to enhance, not replace, medicine’s human element
Depositphotos

ChatGPT 3.5 and Differential Diagnosis

The researchers note that ChatGPT 3.5 is just one AI model and may not represent others, with improvements expected in future versions. The study focused on differential diagnosis cases, where distinguishing between similar symptoms is crucial.

Future research should assess various AI models across different case types. Despite this, the study offers valuable insights.

The combination of high relevance and relatively low accuracy suggests ChatGPT should not be relied upon for medical advice, as it may provide important but misleading information,” the researchers said. “Although ChatGPT consistently delivers the same information to different users, showing good inter-rater reliability, its low diagnostic accuracy highlights its limitations in providing accurate medical information.”


Read the original article on: New Atlas

Read more: ChatGPT’s Humor Challenges Professional Writers

Share this post