Machine learning (ML) and artificial intelligence (AI) algorithms have the potential to derive insights from clinical data and improve patient outcomes. However, these highly complex systems are sensitive to changes in the environment and liable to performance decay. Even after their successful integration into clinical practice, ML/AI algorithms should be continuously monitored and updated to ensure their long-term safety and effectiveness. To bring AI into maturity in clinical care, we advocate for the creation of hospital units responsible for quality assurance and improvement of these algorithms, which we refer to as “AI-QI” units. We discuss how tools that have long been used in hospital quality assurance and quality improvement can be adapted to monitor static ML algorithms. On the other hand, procedures for continual model updating are still nascent. We highlight key considerations when choosing between existing methods and opportunities for methodological innovation.
The use of artificial intelligence (AI) and machine learning (ML) in the clinical arena has developed tremendously over the past decades, with numerous examples in medical imaging, cardiology, and acute care1,2,3,4,5,6. Indeed, the list of AI/ML-based algorithms approved for clinical use by the United States Food and Drug Administration (FDA) continues to grow at a rapid rate7. Despite the accelerated development of these medical algorithms, adoption into the clinic has been limited. The challenges encountered on the way to successful integration go far beyond the initial development and evaluation phase. Because ML algorithms are highly data-dependent, a major concern is that their performance depends heavily on how the data are generated in specific contexts, at specific times. It can be difficult to anticipate how these models will behave in real-world settings over time, as their complexity can obscure potential failure modes8. Currently, the FDA requires that algorithms not be modified after approval, which we describe as “locked”. Although this policy prevents the introduction of deleterious model updates, locked models are liable to decay in performance over time in highly dynamic environments like healthcare. Indeed, many have documented ML performance decay due to patient case mix, clinical practice patterns, treatment options, and more9,10,11.
To ensure the long-term reliability and effectiveness of AI/ML-based clinical algorithms, it is crucial that we establish systems for regular monitoring and maintenance12,13,14. Although the importance of continual monitoring and updating has been acknowledged in a number of recent papers15,16,17, most articles provide limited details on how to implement such systems. In fact, the most similar work may be recent papers documenting the creation of production-ready ML systems at internet companies18,19. Nevertheless, the healthcare setting differs in that errors have more serious repercussions, the number of samples is smaller, and the data tend to be noisier.
In this work, we look to existing hospital quality assurance (QA) and quality improvement (QI) efforts20,21,22 as a template for designing similar initiatives for clinical AI algorithms, which we refer to as AI-QI. By drawing parallels with standard clinical QI practices, we show how well-established tools from statistical process control (SPC) may be applied to monitoring clinical AI-based algorithms. In addition, we describe a number of unique challenges when monitoring AI algorithms, including a lack of ground truth data, AI-induced treatment-related censoring, and high-dimensionality of the data. Model updating is a new task altogether, with many opportunities for technical innovations. We outline key considerations and tradeoffs when selecting between model updating procedures. Effective implementation of AI-QI will require close collaboration between clinicians, hospital administrators, information technology (IT) professionals, biostatisticians, model developers, and regulatory agencies (Fig. 1). Finally, to ground our discussion, we will use the example of a hypothetical AI-based early warning system for acute hypotensive episodes (AHEs), inspired by the FDA-approved Edwards’ Acumen Hypotension Prediction Index23.