3 steps toward setting and sustaining standards for medical AI

If AI for medical diagnostics is to lift the health status of populations—and thus fulfill its implicit global promise—it’s going to need stronger regulatory guidance than it’s gotten to date.

The need is especially keen for standardization of testing and performance measures.

What’s more, the responsibility of standardizing, maintaining and updating algorithms must fall squarely on the shoulders of physicians and their respective medical societies. AI developers should be required to apply the resulting standards to every algorithm they create and to update their coding as standards change over time.

Three Stanford physician-professors, all with clinical expertise in diagnostic radiology, make the case for this commonsense if humanly complex scenario in a paper posted in Brookings TechStream.

Among the pillars supporting their detailed argument are three pointers for all stakeholders to know of—and for policymakers to act on.  

1. Distinguish the algorithm from the definition of the diagnostic task.

The definitions should flow from standards developed by medical societies, suggest Drs. David Larson, Daniel Rubin and Curtis Langlotz.

The standardizing societies would do well to include four components in the definitions:

  • a background review of relevant information and medical objectives;
  • a thorough description of the task that includes clinical assessment criteria, measurement definitions and descriptions, and the full universe of potential classification categories;
  • detailed image labeling instructions for the task in question; and
  • illustrated examples and relevant counterexamples for developers building these systems.

“It should be left to medical experts to specify additional companion references,” Larson, Rubin and Langlotz write. “In some cases, developers may need to propose and publish their own task definitions. True standardization will require the cooperative management of the ecosystem of related task definitions from medical professional societies rather than piecemeal evaluation or specification of definitions.”

2. Expand assessments of algorithmic performance beyond mere accuracy and into, for starters, reliability, applicability and self-awareness of limitations.

Current systems in the category of software as a medical device (SaMD) “have an alarming tendency to miss problems,” the authors comment. “Adding performance metrics beyond accuracy, such as transparency, use of fail-safes, and auditability will help those using and managing these systems to objectively assess the reliability of the algorithms and identify problems when they arise.”

3. Divide the evaluation of medical AI systems into five discrete steps.

These would be

  • diagnostic task definition,
  • capacity of the algorithm to perform in a controlled environment,
  • evaluation of effectiveness in the real world compared to performance in a controlled environment,
  • validation of effectiveness in the local setting at each installed site, and
  • durability testing and monitoring to ensure the algorithm performs well over time.

“After identifying the diagnostic task, assessing how capable a system is in performing its defined task in a controlled environment and comparing it with other competitors is a natural next step.”

Elsewhere in the piece, Larson et al. cite a policy brief the same trio wrote for Stanford University’s Institute for Human-Centered Artificial Intelligence. In the brief, they discuss how to build a framework for testing medical AI and encourage medical societies to do more to build trust in the technology.

Both publications point to the COVID-19 pandemic as a wakeup call to medical AI stakeholders in diagnostic radiology and across U.S. healthcare.

In the policy brief, Larson and co-authors write:

We are beginning to see how AI can enhance quality of life and promote human health. Ensuring that diagnostic algorithms perform effectively both in controlled environments and in real-world settings could improve health outcomes for millions, not just in the United States but around the world. Now is the time to shape these systems’ future with more thoughtful and inclusive regulatory guidance.”

TechStream article here, policy brief here.

Around the web

The Department of Defense is gifting Case Western researchers a grant to study the use of AI in determining whether patients require surgery. 

This study brings light to the prospect of a "fully automated solution" for echocardiogram analysis, experts reported.

Experts noted a "significant" reduction in false positives and false negatives using their modified machine learning model.

Trimed Popup
Trimed Popup