ChatGPT excels at differential diagnostics in hard cases
Today’s go-to generative AI—namely ChatGPT-4—is pretty darned good at parsing out probable diseases in difficult-to-diagnose patient cases.
Harvard researchers at Beth Israel Deaconess Medical Center came to the conclusion by challenging a ChatGPT bot to give differential diagnoses for 70 particularly problematic cases. The cases were previously selected for educating physicians at 2023 conferences by the New England Journal of Medicine. (These teaching resources are best known by shorthand as clinicopathologic conferences.)
Here are answers to five questions raised by the present research.
- What’s the deal with differential? Differential diagnoses generally use lists to name conditions that might be causing the puzzling set of signs and symptoms. The lists are typically arranged in order of most to least likely final diagnosis. The approach takes into account medical histories, lab results and imaging findings.
- How good is ChatGPT at this? In the present study, published June 15 in the Journal of the American Medical Association (JAMA), Chat GPT-4 came back with the same final diagnosis as expert physicians at a 39% clip (27 of 70 cases). Meanwhile the technology included the final diagnosis in 64% of its differential lists (45 of 70 cases).
- What benchmarks exist to place the new findings in context? ChatGPT’s performance compares favorably with that of earlier differentiator tools based in natural language processing (NLP). The authors cite a 2022 study showing an impressive rate of correct final diagnoses, 58% to 68%, while noting that forerunner’s only measure of quality was a “useful” vs. “not useful” binary. By comparison, in the present study, ChatGPT gave a “numerically superior mean differential quality score,” the Beth Israel Deaconess researchers report.
- How solid is the new evidence? The authors—Zahir Kanjee, MD, MPH, Byron Crowe, MD, and Adam Rodman, MD, MPH—acknowledge some limitations in their study design. These included a touch of subjectivity in the outcome metrics and a lack of some important diagnostic information in the patient cases due to protocol limitations. On the other hand, if anything, this deficiency probably showed up as an underestimation of the model’s capabilities, they suggest.
- What is the upshot? Generative AI is “a promising adjunct to human cognition in diagnosis,” the authors conclude.
Additional author commentary:
“The model evaluated in this study, similar to some other modern differential diagnosis generators, is a diagnostic ‘black box.’ Future research should investigate potential biases and diagnostic blind spots of generative AI models.”
Clinicopathologic conferences like those from NEJM “are best understood as diagnostic puzzles,” the authors add. “Once privacy and confidentiality concerns are addressed, studies should assess performance with data from real-world patient encounters.”