Key points
- •
With the rapidly expanding use cases for AI, there is a growing need for proper evaluation of developed algorithms.
- •
Although the external test set is the gold standard for evaluation of model performance, algorithms ideally should be continuously evaluated prospectively and updated appropriately.
- •
The choice of metrics used when evaluating AI models varies depending on the application/task.
- •
The quality of the ground truth needs to be considered in assessing performance metrics.
Automated approaches in health care have been transformed by machine learning. Although the use of pre-engineered features combined with traditional machine learning approaches has been popular, their implementation has been limited by the need to manually extract and design features. Interest in machine learning has been catalyzed by advancements in deep learning. Breakthrough performance across various tasks within computer science spurred interest in expanding the technique to the medical domain. The need for proper evaluation of these algorithms is an active area of research. The focus of this article is on current issues, techniques, and challenges of evaluating medical artificial intelligence.
The expanding use cases of artificial intelligence in health care
Automated approaches in health care have been fundamentally transformed by the advent of machine learning. Although the use of pre-engineered features combined with traditional machine learning approaches has been popular, , their implementation across many domains has been limited by the need to manually extract and design features. Interest in machine learning has been catalyzed in recent years by advancements in deep learning, a technique in which neural networks are used to automatically learn pertinent patterns and features directly from the input data. In particular, breakthrough performance across various tasks within computer science spurred interest in expanding the technique to the medical domain.
Recent studies have reported high performance in a variety of medical tasks in domains like cardiology, ophthalmology, pathology, and dermatology. The applications of deep learning within radiology also have been promising, with numerous algorithms developed within all subspecialties, such as neuro-, musculoskeletal, cardiac, chest, breast, abdominal radiology, and nuclear medicine. Given that much of medicine relies on the interpretation of laboratory tests, physiologic signals, and medical imaging, the use of cases for medical artificial intelligence (AI) continues to expand at a remarkable pace. The growth of medical AI studies has not come without peril, however: the need for proper evaluation of these algorithms is an active area of research, discussion, and evolution. The focus of this article is on discussing current issues, techniques, and challenges of evaluating medical AI.
Generalizability of artificial intelligence algorithms
A machine learning study typically utilizes training, validation, internal testing, and external testing sets, all of which should be nonoverlapping. The training set is a partition of the data that is used to directly train the model. The validation set is used during training to tune hyperparameters of the model. The internal testing set is derived from the same distribution as the training and validation sets (ie, data from the same institution). The external testing set is a completely independent set of data (ie, data from a different institution). Both the internal and external testing sets are used for evaluation only after the model is fully trained.
Algorithmic performance is measured via a task-specific metric, discussed later. Algorithmic generalizability is assessed by measuring a model’s performance on new, unseen data. A model with high performance on the training set is not guaranteed to be clinically useful if the model is overfit or does not generalize well. Modern neural networks are extremely large, with many millions of trainable parameters. With further advancements in hardware, models continue to grow wider, deeper, and more complex. In contrast, medical data sets contain several orders of magnitude fewer data inputs. This discrepancy results in an ease of overfitting, a phenomenon known as the curse of dimensionality. Thus, perfect or near-perfect performance on the training set has no prognostic value about the generalizability of a model because a model may memorize the training data. Similarly, performance on the validation set is of little value because it is continuously being used during model training to tune model hyperparameters. This results in indirect overfitting due to leakage of information. On the other hand, the internal testing set does reveal some information about model generalizability, but it is limited in scope because the data usually are within the same distribution as the training and validation sets. Moreover, although some algorithms may perform satisfactorily on the internal testing set, they still can fail to generalize on the external testing set due to differences (minute or macro) in the data distribution. Ideally, multiple external testing sets are collected, but this can be challenging from a logistical perspective. For these reasons, the external testing set is the gold standard for evaluation of model performance.
When a model is evaluated on external testing sets, drops in performance often can be attributed to shifts in data distribution. To start, the patient populations between the training and external set could be different in terms of demographics (such as age, gender, prevalence, and race). For example, a model trained on data that have high prevalence of disease are not calibrated to a testing set with low prevalence of disease. Second, the image acquisition settings may be different across different sites. For example, different mammography systems may have differing x-ray tube targets, filters, digital detector technology, and control of automatic exposures that can affect model generalizability. Similarly, different magnetic resonance (MR) scanners likely have different field strengths, slice thicknesses, repetition times, and echo times. , Lastly, the neural network may have learned a confounder in the training data set that is not present in the external data set. For example, a neural network that learns that pneumonia more likely is seen on a portable chest radiograph (which more likely is used in an acute setting like the intensive care unit or emergency department) does not perform well when tested on a data set of outpatient radiographs. Along the same lines, another study demonstrated that neural networks can learn to associate disease status with data set–specific laterality markers, arrows, and other forms of annotation, which is a troubling sign if trying to train an algorithm to detect disease pathology. Algorithms are not explicitly trained to learn disease pathology; rather, they are optimized to correlate input data with the label of interest. Using an external testing set can elucidate these generalizability pitfalls.
Evaluation does not end with an external validation set. Just as bacteria continue to evolve with need for continual refinement of antibiotic regimens, data too evolve over time with changing technologies and patient populations. As such, the performance of algorithms needs to be continuously evaluated prospectively and updated appropriately. To facilitate this, research on algorithm development ideally is fully transparent, with clear description of methodology for ease of reproducibility by other research groups. Even more preferable would be release of the source code and trained models themselves, a trend that is steadily becoming more prevalent within the literature but challenging and potential infeasible in cases of commercial algorithms.
The issue of bias in machine learning
In the evaluation of an algorithm, a critical component that should be addressed is whether the algorithm is biased. Specifically, it is important to address whether there are discrepancies in performance across different subgroups, such as age, gender, and insurance status. Recent studies on deep learning algorithms for chest radiograph classification and acute kidney injury prediction have shown that there is differential performance across different subgroups. The reasons for this discrepancy can include representation in the training set (ie, certain groups are less represented in the training set) and differences in label noise (ie, certain groups are more likely to have incorrect diagnoses). In the deployment of algorithms, it is the developer’s responsibility to ensure that the algorithm is ethical. That is, the use of the algorithm helps ameliorate rather than perpetuate health care disparities. Thus, assessment of algorithm fairness is a key component of algorithm evaluation. Addressing this issue, whether through the curation of fairer training sets or debiasing of algorithms, is an area of active research. ,
Tasks in medical image analysis
A majority of supervised image analysis algorithms can be categorized broadly into 4 bins: classification, detection, segmentation, and regression. Classification is defined as the image-level assignment of a label from a finite set of categories. Detection involves coarsely identifying the region of the image where the label(s) is present. Segmentation involves classifying each pixel in the image to a unique label. And regression is defined as assigning each image a continuous-valued response. For example, there may only be interest if a patient’s radiograph is positive for pneumonia, in which a classification algorithm should be used to differentiate normal from pathologic radiographs. If there is further interest in drawing a bounding box around the predicted pneumonia, a detection algorithm should be applied. If accurately delineating the pneumonia from healthy lung tissue is needed, a segmentation algorithm is chosen. And, finally, if assessing the severity of the pneumonia is wanted, a regression algorithm to output higher values for more severe cases is used.
Task-specific metrics for evaluation of artificial intelligence models
The choice of metrics used when evaluating AI models varies depending on the application/task. This section presents the most common methods used in literature for binary classification, multiclass classification, object detection, segmentation, retrieval, and regression.
Binary Classification
In binary classification, each sample has 2 possible values for its true label (positive and negative) and 2 possible values for its predicted value (also positive and negative). Each sample consequently falls into 1 of the 4 following groups, depending on the combination of its true and predicted label: (1) true positive (TP)—the sample is positive and is labeled correctly as positive by the classifier; (2) false positive (FP)—the sample is negative but is labeled incorrectly as positive by the classifier (also known as a type I error or false alarm); (3) true negative (TN)—the sample is negative and is labeled correctly as negative by the classifier; and (4) false negative (FN)—the sample is positive but is labeled incorrectly as negative by the classifier (also known as a type II error or miss). The number of samples from a test set that fall into each of these categories is denoted as <SPAN role=presentation tabIndex=0 id=MathJax-Element-1-Frame class=MathJax style="POSITION: relative" data-mathml='NTP’>???NTP
N T P
, <SPAN role=presentation tabIndex=0 id=MathJax-Element-2-Frame class=MathJax style="POSITION: relative" data-mathml='NFP’>???NFP
N F P
, <SPAN role=presentation tabIndex=0 id=MathJax-Element-3-Frame class=MathJax style="POSITION: relative" data-mathml='NTN’>???NTN
N T N
, and <SPAN role=presentation tabIndex=0 id=MathJax-Element-4-Frame class=MathJax style="POSITION: relative" data-mathml='NFN’>???NFN
N F N
, respectively; the total number of samples in the test set as <SPAN role=presentation tabIndex=0 id=MathJax-Element-5-Frame class=MathJax style="POSITION: relative" data-mathml='NTOTAL’>??????NTOTAL
N T O T A L
. There are several metrics that use the counts of samples in these 4 categories to summarize the performance of the model.
Accuracy
Perhaps the simplest metric is accuracy. Accuracy quantifies the proportion of samples for which the predicted label matches the true label. It, therefore, is equal to the proportion of all samples that are either TPs or TNs: <SPAN role=presentation tabIndex=0 id=MathJax-Element-6-Frame class=MathJax style="POSITION: relative" data-mathml='Accuracy=NTP+NTNNTOTAL’>????????=???+?????????Accuracy=NTP+NTNNTOTAL
A c c u r a c y = N T P + N T N N T O T A L
. Accuracy lies in the range 0 to 1 and is expressed commonly as a percentage. A perfect classifier has an accuracy of 1 (100%), whereas a random classifier (one that randomly chooses a class label with uniform probability) has an accuracy of 0.5 (50%) because it chooses the correct predicted label by chance half of the time.
Although simple and intuitive, accuracy commonly is considered insufficiently informative for most applications because it does not differentiate between the 2 types of errors (FPs and FNs), which commonly have very different significance in practice. Furthermore, if a test data set is highly imbalanced, a classifier may achieve a high accuracy simply by predicting the most common class label for all samples.
Sensitivity and Specificity
Sensitivity and specificity together give a more detailed view of a classifier by considering the performance on positive samples and negative samples separately. Sensitivity (also referred to as the TP rate or recall) quantifies the proportion of positive samples that are labeled correctly as positive by the classifier, whereas specificity (also referred to as the TN rate) quantifies the proportion of negative samples that are labeled correctly as negative. These quantities may be expressed as follows: <SPAN role=presentation tabIndex=0 id=MathJax-Element-7-Frame class=MathJax style="POSITION: relative" data-mathml='Sensitivity=NTPNTP+NFN’>???????????=??????+???Sensitivity=NTPNTP+NFN
S e n s i t i v i t y = N T P N T P + N F N
and <SPAN role=presentation tabIndex=0 id=MathJax-Element-8-Frame class=MathJax style="POSITION: relative" data-mathml='Specificity=NTNNTN+NFP’>???????????=??????+???Specificity=NTNNTN+NFP
S p e c i f i c i t y = N T N N T N + N F P
. Note that the denominators of these expressions, ( <SPAN role=presentation tabIndex=0 id=MathJax-Element-9-Frame class=MathJax style="POSITION: relative" data-mathml='NTP+NFN’>???+???NTP+NFN
N T P + N F N
) and ( <SPAN role=presentation tabIndex=0 id=MathJax-Element-10-Frame class=MathJax style="POSITION: relative" data-mathml='NTN+NFP’>???+???NTN+NFP
N T N + N F P
), are equal to the total number of positive and negative samples in the data set, respectively. Like accuracy, sensitivity and specificity lie in the range 0 to 1 or often are expressed as percentages. A perfect classifier has both sensitivity and specificity of 1 (or 100%), whereas a random classifier has sensitivity and specificity scores of 0.5 (50%). In practice, improving sensitivity often comes at the cost of reducing specificity and vice versa.
Because sensitivity and specificity apply, respectively, to just the positive samples and just the negative samples of a data set, they may be expected to stay approximately constant irrespective of the balance of positive and negative samples in the data set. In other words, they are properties of the classifier itself rather than its utility in the environment in which it is used. This is an important advantage if the test data set may not have a class balance that is representative of the true population. They do not capture some aspects of a classifier’s performance, however, that are important when considering using a classifier for a situation with a highly imbalanced pretest probability (discussed later).
Receiver Operating Characteristics and Area Under the Receiver Operating Characteristic Metric
Many classification models do not directly output a binary classification but rather a classification score that then is compared with a threshold to give a binarized prediction. When using such models, it is possible to choose the threshold value after the model has been trained to give the desired trade-off between sensitivity and specificity for a certain application. A receiver operating characteristic (ROC) aids in this process by plotting the sensitivity (TP rate) on the Y axis and the FP rate (which is equal to 1 minus the TN rate or, equivalently, 1 minus the specificity) on the X axis as the classification threshold is varied. By inspecting an ROC plot, it is possible to see how sensitivity and specificity jointly vary as the threshold is changed.
The area under the ROC (AUROC) summarizes the performance of the model at all possible thresholds. This important metric often is used to give a single number that captures a more complete picture of a classifier’s performance than sensitivity and specificity at 1 particular threshold. This is important particularly when evaluating models that later may be used with different thresholds. The resulting metric is known as the AUROC score. It also commonly is referred to simply as the area under the score, but the authors favor the more specific term, AUROC.
Every ROC passes through the lower left corner of the axes (0, 0) corresponding to no positive predictions and the upper right corner (1, 1) corresponding to no negative predictions and increases monotonically between these 2 points. A random classifier (one that predicts random labels for each sample) has an ROC that is a straight diagonal line between the lower left and upper right corners ( Fig. 1 ) and accordingly has an AUROC of 0.5. A perfect classifier (one for which there exists a threshold that splits the positive and negative samples perfectly) passes through the point of perfect separation of the positive and negative samples at the upper left corner of the axes (0, 1), where sensitivity and specificity are both 1. It therefore consists of a vertical line from the lower left to the upper left corner of the axes and then a horizontal line from the upper left to the upper right corner. This results in an AUROC score of 1. A good classifier has an ROC that comes close to the upper left corner of the axes and an AUROC of between 0.5 and 1.0. If a classifier is performing worse than random chance, it has an AUROC of below 0.5 and an ROC that approaches the lower right corner of the axes.
Precision and Recall
Sensitivity, specificity, and AUROC scores are the set of metrics used most commonly for binary classification performance in medical AI. As alluded to previously, however, they do not capture important aspects of a classifier’s utility in imbalanced data sets. Precision (also known as positive predictive value) quantifies the proportion of the classifier’s positive predictions that are in fact positive. Recall quantifies the proportion of positive samples that are labeled correctly as positive by the classifier. Specifically, the 2 metrics are defined as follows: <SPAN role=presentation tabIndex=0 id=MathJax-Element-11-Frame class=MathJax style="POSITION: relative" data-mathml='Precision=NTPNTP+NFP’>?????????=??????+???Precision=NTPNTP+NFP
P r e c i s i o n = N T P N T P + N F P
and <SPAN role=presentation tabIndex=0 id=MathJax-Element-12-Frame class=MathJax style="POSITION: relative" data-mathml='Recall=NTPNTP+NFN’>??????=??????+???Recall=NTPNTP+NFN
R e c a l l = N T P N T P + N F N
. Recall is the same as sensitivity, and it is purely a matter of convention that when presented alongside precision it typically is given this alternative name.
Like sensitivity and specificity, precision and recall lie in the range 0 to 1 and alternatively may be expressed as percentages. Unlike sensitivity/recall and specificity, precision depends on the class balance in the test data set. This has 2 important implications. First, the precision of a classifier measured on a test set is not generalizable if the proportion of positive and negative samples in that test set does not match the proportion in the population in which the classifier is intended to be used. Second, the precision of a random classifier on a given data set (the baseline precision) generally is not 0.5 but rather the proportion of positive samples in that data set. This can complicate the interpretation of the precision metric.
The importance of precision as metric arises as a result of the fact that in many medical applications, the balance of positive and negative samples is highly in favor of the negative class (ie, there are many more negatives than positives). In such situations, a classifier can have both high sensitivity and high specificity but low precision (ie, most of its positive predictions are incorrect), meaning that its predictions are of limited use in practice. These situations may not be identified if sensitivity and specificity alone are considered. Although it may be an unusual situation in practice in medical applications, it also is important to note that if instead the data set is heavily imbalanced in favor of the positive class a classifier may simultaneously achieve high precision and recall even if its specificity is low (ie, it classifies most negative samples as positive). Consequently, a complete description of a classifier’s performance ideally should include all 3 of these metrics (sensitivity/recall, specificity, and precision).
Describing a classifier purely by its precision and recall does not consider the number of TN samples, because <SPAN role=presentation tabIndex=0 id=MathJax-Element-13-Frame class=MathJax style="POSITION: relative" data-mathml='NTN’>???NTN
N T N
does not appear in the definition of either of the metrics, listed previously. This may be useful in certain situations in which the number of TNs either is unimportant, difficult to determine, or even undefined (discussed later).
Precision-recall Curves and the Average Precision Metric
When working with models that output a classification score, the classification threshold can be adjusted to give the desired balance of precision and recall. The relationship between a classifier’s precision and recall as the classification threshold is varied may be visualized using a precision-recall curve (PRC) ( Fig. 2 ). A PRC plots the precision of the classifier on the Y axis and the recall of the classifier on the X axis.
At a low classification threshold, the recall tends to 1 and the precision tends to the ratio of positives to negatives in the dataset. Therefore, on the right-hand side of the PRC, the curve tends toward the class ratio. A random classifier has a flat PRC fixed a constant precision equal to the class ratio. A perfect classifier’s PRC passes through the top-right corner of the plot, representing a precision and recall of 1. A good classifier has a high precision across a range of recall values. Unlike ROCs, PRCs do not necessarily decrease monotonically from left to right.
The area under the PRC gives a metric that quantifies the performance of the classifier over all possible thresholds. This metric may be referred to as the area under the PRC or more commonly as average precision (AP) because it measures AP of the classifier over a range of recalls. The AP score lies between 0 and 1, where a perfect classifier has an AP of 1 and a random classifier has an AP equal to the class ratio. In practice, there are several different ways to calculate the AP metric, which are not discussed.
F 1 Score
The F 1 score is a metric that combines precision and recall into a single value. It is defined as the harmonic mean of precision and recall and, therefore, always lies between the 2. Specifically, it is defined as <SPAN role=presentation tabIndex=0 id=MathJax-Element-14-Frame class=MathJax style="POSITION: relative" data-mathml='F1=precision×recallprecision+recall=2NTP2NTP+NFP+NFN’>?1=?????????×???????????????+??????=2???2???+???+???F1=precision×recallprecision+recall=2NTP2NTP+NFP+NFN
F 1 = p r e c i s i o n × r e c a l l p r e c i s i o n + r e c a l l = 2 N T P 2 N T P + N F P + N F N
. If a single metric to measure the performance of a classifier is required, the F 1 score often is preferred to the simple accuracy metric. Like precision and recall, the F 1 score does not consider the number of TNs. It may not be an appropriate metric when the precision and recall of the classifier are not equally as important.
Multiclass Classification
Multiclass classification problems are those in which each sample belongs to 1 of several finite, mutually exclusive classes, and the classifier predicts a single label for each sample.
Confusion Matrices
A confusion matrix ( Figs. 3–5 ) is a convenient way to visualize the performance of the classifier for all classes in a single figure. A confusion matrix is a table with 1 row and 1 column for each class. Each entry in the table represents the number of cases in which a sample with the row’s label was given the column’s label. Often, the elements of the matrix may be shaded according to the number of cases to aid interpretation.