Separating Hope from Hype





Although recent scientific studies suggest that artificial intelligence (AI) could provide value in many radiology applications, much of the hard engineering work required to consistently realize this value in practice remains to be done. In this article, we summarize the various ways in which AI can benefit radiology practice, identify key challenges that must be overcome for those benefits to be delivered, and discuss promising avenues by which these challenges can be addressed.


Key points








  • AI systems can provide value to radiologists in several ways, ranging from reduced time on task to discovery of new knowledge.



  • Potential challenges in deploying AI systems for radiology include myriad technical issues, difficulties mitigating algorithmic bias, and poor alignment between measured performance and clinical value.



  • Promising directions to address these challenges include improved software engineering practices, close clinician involvement in model development, and robust postdeployment monitoring.




Although recent scientific studies suggest that artificial intelligence (AI) could provide value in many radiology applications, much of the hard engineering work required to consistently realize this value in practice remains to be done. In this article, we summarize the various ways in which AI can benefit radiology practice, identify key challenges that must be overcome for those benefits to be delivered, and discuss promising avenues by which these challenges can be addressed.


How can artificial intelligence provide value to radiologists?


Although headlines often gravitate toward AI systems that claim to perform as well as or better than humans on a particular task, AI can provide value to radiologists in several specific ways. These include automated information extraction from imaging examinations, increased diagnostic certainty, decreased time on task, faster availability of results, reduced cost of care, better clinical outcomes, discovery of new knowledge, and improved patient access to radiological expertise. , While other chapters in this volume describe such applications in detail, we provide a brief overview here.


Leveraging information contained within an image to make prognostic and diagnostic decisions is a core component of radiology practice; AI systems can provide value to radiologists by enabling them to do so more effectively. For instance, although a clinician’s diagnostic ability is defined by a combination of first principles knowledge and experience with specific cases, an AI system can leverage information contained in millions or billions of data points to refine how image features are mapped to prognostic or diagnostic outputs. Recent analyses of AI models trained on large radiology data sets demonstrate the potential not only to improve diagnostic sensitivity or specificity, , but also to yield novel image features that correspond more directly to the outcome of interest than those that comprise existing standards. Furthermore, the fact that AI systems can perform such analysis with high levels of standardization across patients —and without being vulnerable to fatigue or cognitive biases—can yield substantial value in the real world. , AI-based approaches can augment human analysis both by surfacing information that is not readily apparent and by improving the utility of reconstructed images for human readers.


AI systems can also provide value to radiologists by increasing the speed with which imaging results are processed and by reducing required clinician effort. Automated optimization of worklists, for instance, can reduce time-to-treatment for life-threatening and severe conditions while still ensuring human review of all cases. , , , With appropriate algorithmic design and human factors engineering, the integration of AI-based triage and second read systems into clinical workflows holds the potential to decrease the time required per case. This would simultaneously increase patient access, lower costs, and improve outcomes by enabling radiologists to spend more of their time on cases that require substantial analysis. Decreased time requirements would also help to alleviate the workforce shortage that radiology is expected to experience in the coming years as demand for services continues to increase.


Finally, the consistent use of AI systems in radiology practice can yield new knowledge that improves patient care. The development of “radiomic” features that are not discernible by the human eye, but may nonetheless be predictive of outputs ranging from diagnosis to prognosis to treatment response, represents a particularly promising area of research. AI can also play a supporting role in such tasks as patient selection, tumor tracking, and adverse event detection that can inform the clinical trials necessary to create new forms of diagnosis and treatment.


AI systems that provide value in each of these ways have been conceptualized—and in some cases evaluated for clinical use—across a wide variety of applications, many of which have been detailed in this volume. The balance of this article will describe important pitfalls in the development and deployment of these systems that must be addressed for AI systems to provide widespread value for radiologists.


What challenges must be overcome for artificial intelligence to provide value to radiologists?


Translating the potential that academic studies and early clinical trials have shown into concrete improvements in radiology practice will require that researchers and practitioners alike be aware of the challenges that can accompany the development and deployment of AI systems in radiology applications. This section provides an overview of the major pitfalls that AI systems face in radiology, and the subsequent section will outline compelling approaches for addressing these challenges.


Meaningful Performance Measurement


The first, and perhaps most important challenge in developing an AI system for radiology is ensuring that the task of interest is sufficiently well-posed that performance can be meaningfully measured. Defining a suitable clinically relevant task is not always as easy as it might seem. Consider the example of chest x-ray (CXR) classification, a commonly studied application of AI. Much work in this area has focused on developing deep learning models that classify CXRs into 1 of the 14 different classes used in Rajpurkar and colleagues, but it is clear that several of these classes (eg, atelectasis, consolidation, infiltration) can be inconsistently understood across different clinicians. As a result, models trained for this particular task may confuse these 3 classes, or may provide outputs with which certain clinicians would agree more than others. Such ambiguity in task definition reduces our ability to effectively measure performance. Furthermore, it is critical to ensure that the measure of AI system performance is directly related to the outcome of interest. It is not immediately clear, for instance, that high levels of performance on a 14-class CXR abnormality classification task will translate into one of the types of value described previously (eg, reduced radiologist time, improved diagnostic certainty). In fact, one could argue that framing this task in a slightly different way—binary normal versus abnormal CXR triage for worklist prioritization , —could provide more direct clinical value because its place in the clinical workflow is clear, and metrics like turnaround time for high-priority cases can be immediately computed. Collaboration between radiology domain experts and AI developers will remain key to ensuring that AI systems are developed for tasks that are meaningful, and that performance is measured in ways that directly correlate with clinical value.


Even when a clinically useful task has been defined, inappropriately chosen performance metrics can hinder model development (see the Jayashree Kalpathy-Cramer and colleagues’ article, “ Basic AI Techniques: Evaluation of AI Performance ,” in this issue for more detail). Although sensitivity and specificity may be familiar to many clinicians, multiclass classification, segmentation, and reconstruction tasks are evaluated quite differently than binary classification, and metrics suitable to the task must be used. Equity considerations are also important in designing suitable metrics. For instance, it is often the case that deep learning models for classification perform well on classes that make up the most of the training set, but perform poorly on classes that are small. In some situations, this could be acceptable—in which case unweighted metrics are commonly used—but in others, it would not, meaning that class-weighted metrics should be reported. Furthermore, the common use of area under the receiver operating characteristic curve (AUROC) or area under the precision-recall curve as figures of merit should be viewed with caution; while useful in describing overall classification performance, these metrics can be misleading because they do not indicate how a model will perform at the specific operating points that must be chosen in practice.


In addition to computing an appropriate metric, evaluation procedures must be designed in a way that yields meaningful results. A common error is assuming that models that are internally validated—that is, that perform well on the same population on which they were developed—will continue to perform well when applied externally (ie, to a different population). Evaluation data sets must represent the population on which a model is intended to be used, otherwise performance computed thereon will be misleading. When comparing multiple algorithms, performance should also be evaluated on a common data set to provide meaningful information. Finally, a common pitfall in AI performance measurement occurs when the task schema is insufficiently granular to capture important variations in performance. A common example of this phenomenon—which has been termed “hidden stratification” —occurs in classification problems when performance variation occurs due to a variable that the original data set curators did not consider. As shown in Fig. 1 , for instance, Oakden-Rayner and colleagues recently demonstrated that while a common CXR classification model yields an overall AUROC value of 0.87 for detecting pneumothoraces, that performance increases to 0.95 on images that display a chest drain and drops to 0.77 on images that do not. Thus, if this model had been deployed in practice, it would have performed much worse on the very population—pneumothoraces without a chest drain—that would be of clinical interest. Similar issues can cause models to be biased and perform poorly on a given subclass (eg, non-Caucasian patients) because it just so happens that (a) that subclass makes up a minority of a data set and (b) the data set was not labeled with subclass information.




Fig. 1


ROC curves for subclasses of models trained on multiple data sets. ( A ) Model performance on different subclasses of the “abnormal” class for a model designed to detect abnormalities on radiographs from the Adelaide Hip Fracture data set. ( B ) Model performance on different subclasses of the “abnormal” class for a model designed to detect abnormalities in musculoskeletal radiographs from the MURA data set. ( C ) Model performance on different subclasses of the “pneumothorax” class for a multiclass CXR classification model designed to detect 14 different pathologies on the CXR-14 data set.

( From Oakden-Rayner L, Dunnmon J, Carneiro G, Ré C. Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging. Published online September 26, 2019. Accessed November 4, 2019. http://arxiv.org/abs/1909.12475.)


Creating Training Data Sets


Once an appropriate task and measurement metric have been defined, creating an AI system to perform that task generally requires constructing a data set on which a model will be trained. In supervised learning, which dominates current applications in radiology, this requires curating labeled training data. Unfortunately, the cost of creating these labeled data sets can limit the application of AI systems in clinical practice. Using the work of Gulshan and colleagues as an example, 3 to 7 physicians, most of whom are licensed ophthalmologists, were reported to have graded every single one of 128,175 retinal fundus photographs. Conservatively assuming 3 labelers per image, 15 seconds per image, and a $100 per hour rate, this comes out to a cost estimate in excess of $150,000 and 180 clinician-days for a single iteration of data labeling; in practice, multiple data labeling efforts are often necessary.


Importantly, even meticulously labeled training sets are not guaranteed to support models that generalize across different diseases, modalities, imaging systems, classification ontologies, clinical protocols, and medical guidelines, all of which change over time and with different application contexts. , , , This concept is known as distribution shift , and often causes model performance to degrade when used outside of the exact population on which the training set was constructed. This behavior has been observed in a variety of medical applications including pneumonia detection on CXR, diabetic retinopathy detection on retinal fundus photographs, and dermatology image classification, and remains arguably the dominant challenge in applying AI systems in practice. Although various mechanisms for handling distribution shift exist, this problem cannot be considered solved, and mitigating it can remain a major cost driver for AI systems in radiology.


A final reason that the burden of creating training data sets can be problematic for AI systems in radiology is that it can lock in outdated standards of care or treatment protocols. For instance, if an AI system for image triage was trained on a data set that did not contain cases from a newly discovered disease such as COVID-19, it could spuriously deprioritize individuals with those infections. Furthermore, continued use of models trained with expensive data sets that may someday reflect outmoded practice (eg, x-ray scoring systems that disadvantage marginalized patient subpopulations ) would result in patients receiving medical recommendations that are below the modern standard of clinical care. For radiologists, this issue may be particularly apparent for imaging protocols, which evolve over time and may be inconsistently implemented. As an example, widespread use of AI systems for computed tomographic (CT) analysis developed using a particular protocol for contrast timing may result in the continued use of that protocol even though it may be suboptimal for other reasons.


In summary, creating appropriately representative labeled data sets is likely to remain a challenge for widespread use of AI systems in radiology, both because of the cost associated and the inherent difficulty of ensuring that a data set represents all important axes of variation, including variations caused by changes in radiology technology and practice in the future.


Mitigating Algorithmic Bias


A major challenge for both users and developers of clinical AI systems is ensuring that they do not create or amplify biases in the provision of care that would disadvantage particular groups of patients. In technical parlance, this involves building models that are “robust” to important variations in the patient population such as gender, ethnicity, socioeconomic status, and other protected factors. As described earlier, creating representative data sets for training and evaluation of AI models is an important component of mitigating model bias, and it is worth further discussing specific error modes that can lead to biased data sets. First, data from Electronic Health Records (EHRs) are often not meant for algorithm development, meaning that models developed using cohorts and labels drawn from EHRs may contain a variety of inherent biases such as those resulting from the use of billing codes rather than pathologic descriptions for diagnoses. Second, because it can be difficult to access patient data (even for patients themselves), standard strategies for enrolling diverse populations in clinical development efforts can be difficult to apply. , Some health systems also suffer from selection bias, where information that would be useful for data labeling is only recorded for cases of particular academic or clinical interest. Furthermore, even with an appropriate cohort design, data may either be missing or only available in certain segments of the population. A particularly striking example of this situation was highlighted recently by the work of Kaushal and colleagues, which showed that most of the AI studies in imaging performed in the United States leveraged data from only 3 states. Prospective users of AI systems must be constantly vigilant for these types of data set curation issues that can result in biased algorithms.


Common training approaches that do not account for such issues as hidden stratification or class imbalance can also result in biased models. For instance, models are often trained to optimize average performance; such procedures result in models that perform well on majority classes (or subclasses) at the expense of less common groups in the population.


Finally, it is worth pointing out that unintended bias can also occur in algorithms aimed at improving elements of the image reconstruction process in volumetric imaging. For example, although both tomographic protocols and MR imaging could benefit from AI-based steps in the calibration, signal conditioning, denoising, and reconstruction processes, it is not always clear that mathematical transformations learned on a particular set of data or population will provide similar utility on other data sets. Common axes of variation that should be considered in data set curation and algorithm design for such applications include scanner or hardware type, examination protocol, tracer type, patient characteristics, and other parameters that could affect image acquisition and reconstruction.


Measuring Correlation Instead of Causation


A particularly concerning pitfall in deep learning systems in particular has been their ability to make accurate predictions based on features that are correlated with the outcome, but which are noncausal . In radiology, examples include algorithms that predict severe disease when they recognize a portable scanner was used instead of a fixed x-ray machine (which would require the patient to be well enough to travel to the radiology department for the image), and those that rely on the presence of chest drains to predict pneumothorax. , In dermatology, a prime example is a recent algorithm that used the presence of surgical markings to recognize melanoma in dermoscopic images. Because deep learning systems are usually optimized to maximize a specific performance metric without considering causality, they are prone to mistakes such as these, predicting outcomes based on confounding, noncausal features.


Technical and Engineering Issues


Even if the risks described to this point are appropriately mitigated, a variety of common technical issues can result in AI systems that do not perform as designed. One such problem is overfitting, which occurs when models perform well on a training set but poorly on a held-out evaluation set; this is often the result of insufficient regularization during training or distribution shift between training and evaluation sets. Data leakage between training and evaluation sets occurs when samples that are in the evaluation set also appear in the training set, and leads to overly optimistic performance metrics on the evaluation set because the model was exposed to very similar examples during training (and it can memorize them rather than learn generally useful image features). Although the exact same examples can be included in both sets by accident, a more subtle version of this same error can occur when examples from the same patient are included in both training and evaluation sets.


Poorly calibrated models can also be problematic. A “calibrated” model is one in which the quantitative values output from the model reflect true probabilities; for example, if a well-calibrated diagnostic algorithm predicts that each of 4 patients has a disease with 75% probability, one should expect that 3 of those 4 patients would actually have the disease. If a model is not calibrated, clinicians could erroneously interpret model outputs in a manner that would negatively affect patient care.


AI systems can also simply fail; because AI systems are a type of software, bugs are unfortunately a fact of life. In radiology applications, particularly important types of engineering errors include images that are corrupted in transmission/storage and cause erroneous predictions; preprocessing differences between data sets or institutions that result in distribution shift; or simple coding errors that cause model weights to be incorrectly loaded or output to be incorrectly computed. These errors can have real-world consequences, like a critically ill patient being deprioritized or benefits being withheld from needy individuals.


Finally, for image enhancement and reconstruction applications, a major technical challenge involves ensuring that as AI algorithms generate images that are more suitable for human interpretation, they do not insert spurious information that was not in the original image. The difference between imputation (the recovery of lost or imperfectly acquired information), enhancement (making better use of existing information), and hallucination (the creation of information that was not in the original image) is often subtle, and it can be difficult even for domain experts to evaluate. As this area of the field—sometimes referred to as “upstream AI”—matures further, it will be critical to develop robust metrics and evaluation procedures to ensure that AI-enabled image processing techniques can provide value by improving image analysis without inserting spurious information.


Postdeployment Monitoring


Postdeployment monitoring represents an additional challenge for the deployment of AI systems. To mitigate issues related to distribution shift and model bias—as well as to continuously evaluate whether a model is providing the anticipated operational benefit—it is critical that models be constantly under assessment while deployed. Various strategies for postdeployment monitoring exist, including manual human audits of model output, automated algorithmic evaluation of distribution shift or hidden stratification, out-of-distribution (OOD) sample detection, and continued evaluation protocols, but many academic studies that demonstrate the initial viability of an AI system do not consider how postdeployment monitoring should be implemented. Furthermore, when considering whether to deploy a given AI system, the cost of continuous monitoring—which includes subject matter expert time, additional data curation, and even the expense of taking a model out of service if it begins performing poorly—must be considered.


Deployment Details


In addition to technical and functional issues, deploying AI algorithms in radiology practice raises several ethical, medicolegal, economic, and logistical questions that have not yet been convincingly resolved.


First, if an outside developer creates a model, it must be decided how liability from mistakes that occur in the course of practice should be divided among the radiologist, the algorithm developer, the device manufacturer, and other relevant parties. Furthermore, it is often not clear how the model output is explained to a patient, whether patients should be informed that AI algorithms were used in their care, and what recourse might be available toward disputing treatment decisions made based on model output. These issues become even more fraught if models have been fine-tuned for a particular site or deployment environment, and may depend on whether a given model has been developed internally on custom or open-source tooling, has been developed internally using a commercial platform, or is provided via a software-as-a-service or model-as-a-service agreement.


Second, AI models and deployment hardware must be co-optimized to ensure that model execution time is sufficiently rapid to provide anticipated value. In particular, if users of models deployed to edge devices (eg, laptops, phones, etc.), on extremely large images (eg, volumetric scans), or in time-critical contexts like interventional radiology do not ensure that sufficient compute capability and network bandwidth are available to support proposed use cases, the resulting slowdown in computing model outputs could have negative clinical consequences. The alternative, however, may be the deployment of expensive new hardware at clinical sites or the use of cloud processing, each of which involves its own risks and benefits.


Third, for models to be used ethically, policies regarding the use of and access to patient data by the patients, the treatment center, and any external parties must be explicitly delineated. Unfortunately, in many contexts, public policy has not yet provided sufficient guidance for users to know exactly what procedures should be observed on this front.


Fourth, security considerations in deployment must be appropriately addressed. Were bad actors to gain access to a model or the training data, various attacks can be envisioned that could reveal patient’s identity, interfere with treatment, or exfiltrate valuable data to which various parties (including the patient) have exclusive rights as well as expectations of privacy. Proposed AI deployments in radiology often do not fully consider the scope of potential attack vectors on both data and models, and do not explicitly guard against such attacks as data poisoning (affecting model performance by altering training data) or model inversion (reconstructing training data from model parameters). Remaining robust to these sorts of attacks is heavily related to postdeployment monitoring described earlier, and may benefit from specific approaches to model training and evaluation.


User Trust


For AI to provide value in radiology practice, these systems must gain the confidence of both patients and clinicians. Concerns about the deleterious effects of automated assistance on radiologist performance, lack of interpretability in clinical decisions, and the potential for reinforcement of existing biases or outmoded practice must be overcome. , Automation bias is a serious problem wherein the very fact that human readers have algorithmic support causes them to trust the automated result even when it is flawed. Deep neural networks have well-documented difficulties establishing exactly what reasoning led to a given model output. The danger of introducing models that disadvantage particular patient groups is ever-present. As a result, to make effective and equitable use of AI in radiology, it is critical to design workflows that incorporate not only algorithmic input and broad clinical domain expertise, but also the individualized expertise that doctors have about the situation of each patient and the intimate knowledge that each patient has of their own body.


Regulatory Approval


Deployment of AI algorithms for clinical use cases will rarely occur outside the bounds of governmentally stipulated regulatory structures. As a result, regulations for AI systems in radiology must be designed to balance potential improvements in patient care with the risks that such systems can pose if deployed incorrectly. Though both governmental agencies and independent bodies , have recently made substantial progress toward defining constructive paths forward, the evolving regulatory environment will likely mean that certain applications will move faster than others (eg, computer-assisted detection vs computer-assisted diagnosis), and that it will be particularly important for clinicians to understand exactly what models can and cannot do before using them in practice. Although substantial discussion of regulation for clinical AI models is handled in a separate chapter, it suffices to say that clinicians intending to use AI in practice should remain up to date on regulations governing system use, processes for approval, and associated reporting requirements.


How can these challenges be overcome?


Although the challenges described earlier are substantial, technical and operational approaches to mitigate nearly all of them either exist or are in development. The degree to which AI algorithms can provide meaningful value in radiology practice will likely be determined by the effectiveness with which these techniques are implemented in practice and rigorously analyzed in the context of real-world operational data.


Meaningful Performance Measurement


Several concrete steps could help to improve the performance measurement of AI models in radiology.


First, common, widely available data sets suitable for evaluating performance on tasks of clinical interest should be constructed and continuously updated by objective bodies such as professional societies, academic consortia, or government agencies. Importantly, these evaluation data sets should be labeled in a way that closely reflects the intended workflow into which the model will be deployed, as opposed to using arbitrary academic schema. Existing efforts like data sets released by the Radiological Society of North America (RSNA), The Cancer Imaging Archive (TCIA), and others should be expanded. , Furthermore, each task of clinical interest should have evaluation data sets that are frequently updated so that models can be evaluated on the latest imaging technologies and not be allowed to overfit to a particular evaluation set.


Second, data sets should be labeled with important subclasses to enable analysis of potential model bias and reduce the impact of hidden stratification. Recent unsupervised methods can also be used to algorithmically identify subclasses of interest.


Third, it may sometimes be beneficial to define the scope of model functionality more narrowly to enable sharper measurements of performance. , Instead of aiming for a single model that can generalize across data from different institutions (ie, multiple distributions), for instance, modelers could consider developing multiple different single-institution models and avoid having to constantly measure relative performance across potentially different populations. Conceptually, this idea resembles recent approaches from precision medicine. If applied carefully, such a strategy could improve the utility of performance measurements for AI models in radiology.


Finally, assessing model performance on downstream clinical tasks—rather than on intermediate performance metrics like accuracy—will help to ensure that performance is measured in a way that is clinically meaningful. Ideally, direct comparison to existing baseline systems should be performed via randomized controlled trials wherein the AI system should be directly integrated into a clinician workflow and the downstream clinical outcome is measured. The more realistic the setting is, and the closer that we can come to measuring clinical value rather than algorithmic performance , the more likely we are to arrive at a useful assessment of AI system utility.


Creating Training Data Sets


Recent technical progress on methods that can relieve the burden of creating and updating data sets has been promising. First, methods from weak supervision have enabled large data sets with weaker, noisier labels to support AI models that perform similarly to those trained on hand-labeled data sets of similar size. Many of these methods directly leverage human expertise in a way that enables rapid relabeling and retraining to combat model performance and distribution shift issues. Automated, NLP-based labelers have also shown promise in building labeled data sets, though adapting them to new domains can be labor-intensive. ,


Other technical approaches have focused on leveraging additional sources of signal within the model training process. Modern data augmentation techniques enable users to increase the effective size of training data sets by applying transformations to existing images without disrupting the meaningful features within those images. Common examples include applying rotations to labeled images or synonymy swaps to labeled text in language modeling tasks. , Multitask learning—building models that learn to perform multiple, related tasks simultaneously—can also help to decrease the number of labeled examples required by leveraging additional information from the data set. Transfer learning applies a similar approach, but usually involves 2 steps: (1) pretraining a model on a task that is related to the final task of interest and (2) fine-tuning that pretrained model by continuing to train it on the task of interest. In medical computer vision applications, for instance, it is particularly common to use models that are pretrained on the ImageNet database as a starting point on which to train models for clinical use cases. , , , , Recent approaches from self-supervision and contrastive learning that leverage large, unlabeled datasets for model pretraining have also shown promise in reducing the required size of labeled data sets.


In clinical applications, another way that the data curation burden can be reduced is by standardizing protocols. Instead of having to train models over images acquired via a wide variety of protocols—for example, tube currents, voltages, and reconstruction settings in CT—it can be advantageous to train models obtained using a standard protocol and then ensure that such models are only applied to images obtained using that standard protocol. Similar to the precision medicine perspective presented earlier, this approach trades off generalizability for a narrow task definition.


Mitigating Algorithmic Bias


Combating algorithmic bias is one of the single most important tasks required to deploy AI models ethically and equitably within radiology practice. In addition to constructing training data in as nonbiased a way as possible, there exist several additional approaches that can help to mitigate this problem.


First, a variety of training algorithms focused on reducing the worst-case subgroup performance—that is, ensuring that there exists no subgroup of data on which a model performs substantially worse than another—have been the focus of recent research. , As these and additional approaches for improving algorithmic fairness are developed, they should be considered for clinical translation.


Second, because these training algorithms are often used during model development rather than model deployment, clinical users may rarely interact with them. However, clinical users will routinely be exposed to model output, and as a result, tooling designed to clearly and dynamically evaluate model robustness will become an increasingly important part of successful AI deployments in radiology. , Research and development studies focused on enabling clinical users to reliably determine which model features are most responsible for a given output, to quickly assess model performance on a wide variety of subclasses or subgroups, and to rapidly evaluate the effect of such variations on clinical outcomes would improve our ability to deploy models equitably.


Finally, direct participation from physician and patient communities in model development and deployment can help to ensure that individuals are best served by these models in practice. Indeed, as pointed out by Esteva and colleagues in their recent review article, community participation recently enabled the discovery of data set bias and identified demographics underserved by a model for population health management. A similar case occurred when evaluating models for detecting diabetic retinopathy in Southeast Asia, where socioeconomic factors heavily impacted model efficacy. If radiologists are able to deploy AI models in cooperation with their clinical communities—while ensuring that non-AI backups are used when appropriate—these capabilities stand a much better chance of having a clinical impact that is both positive and equitable.


Measuring Correlation Instead of Causation


Ensuring that models do not rely on confounding variables in making their predictions requires many of the same strategies described earlier. Model auditing by human actors can help to discover cases where models make the right prediction for the wrong reason. External validation can be a particularly helpful tool in ensuring that data set artifacts are not responsible for model performance. Encouraging models to respect important invariances via data augmentation strategies can further reduce the possibility of noncausal features driving model predictions. Finally, interpretability analyses such as heatmaps that identify which structures informed the algorithmic decisions and other visualization methods can help radiologists to identify such behavior before it becomes a problem ( Fig. 2 ).


Nov 5, 2021 | Posted by in GENERAL RADIOLOGY | Comments Off on Separating Hope from Hype

Full access? Get Clinical Tree

Get Clinical Tree app for offline access