NIH study shows AI scores high in diagnostic quiz, but still can't show its work

A study by the National Institutes of Health (NIH) found that artificial intelligence could help accurately diagnose patients—but the large language model still stumbled when it came to clearly explaining its answers.

Researchers quizzed an AI using four years’ worth of online image challenges from The New England Journal of Medicine—a long-running column that tests the reader’s ability to diagnose a patient, based on a series of submitted pictures and some basic clinical background information. 

GPT-4V, developed by OpenAI, was prompted with 207 multiple-choice questions and told to show its work. Its answers were compared to a group of nine human physicians from different specialties who sat for the same tests, both closed- and open-book.

While the multimodal AI got more questions right at the outset versus the closed-book test—scoring higher in its recall of medical knowledge—the human physicians ultimately outperformed the LLM program in the open-book setting, especially among the challenges ranked the most difficult.

Researchers also found mistakes in the rationales that GPT-4V provided, even if it got the final answer correct—with error rates as high as 27% in image comprehension. 

For example, the researchers said that while the program was able to identify a case of malignant syphilis, it did not recognize that images of two skin lesions, presented from different angles, were actually caused by the same condition.

“Integration of AI into health care holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner,” said Stephen Sherry, acting director of the NIH’s National Library of Medicine, in a statement. “However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis.”

Overall, only 3 of the 207 questions were answered incorrectly by both the AI and physicians, which the researchers described as “indicating a promising synergy between the current tools and GPT-4V.”

Of course, there are no multiple-choice answers in day-to-day care, and often multiple diagnoses can be possible—which places greater importance on the ability to correctly describe the rationale and evidence behind a differential diagnosis, the researchers said, while calling for more comprehensive evaluations of medical AI.

“This technology has the potential to help clinicians augment their capabilities with data-driven insights that may lead to improved clinical decision-making,” said the study’s corresponding author Zhiyong Lu, an NLM senior investigator. “Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine.”