Large language AI put to the test for potential adopters in primary care

ChatGPT is only so-so at letting physicians know if any given clinical study is relevant to their patient rosters and, as such, deserving of a full, time-consuming read. On the other hand, the popular chatbot’s study summaries are an impressive 70% shorter than human-authored study abstracts—and ChatGPT pulls this off without sacrificing quality or accuracy and while maintaining low levels of bias.

These are the findings of researchers in family medicine and community health at the University of Kansas. Corresponding author Daniel Parente, MD, PhD, and colleagues tested the large language model’s summarization chops on 140 study abstracts published in 14 peer-reviewed journals.

Finally, while at it, the researchers developed software—“pyJournalWatch”— to help primary care providers quickly but thoughtfully review new scientific articles that might be germane to their respective practices.

The research is current in the March edition of Annals of Family Medicine. Noting that they used ChatGPT-3.5 because ChatGPT-4 was only available in beta at the time of the study, the authors offer several useful observations regardless of version. Here are five.

1. Life-critical medical decisions should for obvious reasons remain based on full, critical and thoughtful evaluation of the full text of articles in context with available evidence from meta-analyses and professional guidelines.

‘We had hoped to build a digital agent with the goal of consistently surveilling the medical literature, identifying relevant articles of interest to a given specialty, and forwarding them to a user. Chat-GPT’s inability to reliably classify the relevance of specific articles limits our ability to construct such an agent. We hope that in future iterations of LLMs, these tools will become more capable of relevance classification.’

2. The present study’s findings support previous evaluations showing ChatGPT performs reasonably well for summarizing general-interest news and other samples of nonscientific literature.

‘Contrary to our expectations that hallucinations would limit the utility of ChatGPT for abstract summarization, this occurred in only 2 of 140 abstracts and was mainly limited to small (but important) methodologic or result details. Serious inaccuracies were likewise uncommon, occurring only in a further 2 of 140 articles.’

3. ChatGPT summaries have rare but important inaccuracies that preclude them from being considered a definitive source of truth.

‘Clinicians are strongly cautioned against relying solely on ChatGPT-based summaries to understand study methods and study results, especially in high-risk situations. Likewise, we noted at least one example in which the summary introduced bias by omitting gender as a significant risk factor in a logistic regression model, whereas all other significant risk factors were reported.’

4. Large language models will continue to improve in quality.

‘We suspect that, as these models improve, summarization performance will be preserved and continue to improve. In addition, because [our] ChatGPT model was trained on pre-2022 data, it is possible that its slightly out-of-date medical knowledge decreased its ability to produce summaries or to self-assess the accuracy of its own summaries.’

5. As large language models evolve, future analyses should determine whether further iterations of the GPT language models have better performance in classifying the relevance of individual articles to various domains of medicine.

‘In our analyses, we did not provide the LLMs with any article metadata such as the journal title or author list. Future analyses might investigate how performance varies when these metadata are provided.’

Parente and co-authors conclude: “We encourage robust discussion within the family medicine research and clinical community on the responsible use of AI large language models in family medicine research and primary care practice.”

The study is available in full for free.

 

Dave Pearson

Dave P. has worked in journalism, marketing and public relations for more than 30 years, frequently concentrating on hospitals, healthcare technology and Catholic communications. He has also specialized in fundraising communications, ghostwriting for CEOs of local, national and global charities, nonprofits and foundations.

Trimed Popup
Trimed Popup