Diagnostic Tests

3


Diagnostic Tests


George Tomlinson, Gerald Lebovic, Connie Marras, and Andrea S. Doria


images Learning Objectives


• To apply methods for estimation and hypothesis testing for proportions to the binary diagnostic test characteristics of sensitivity, specificity, and positive and negative predictive values.


• To apply methods for estimation of ratios of proportions to the estimation of positive and negative likelihood ratios.


• To construct receiver operating characteristic (ROC) curves for a test that has more than two diagnostic levels and estimate the area under the curve (AUC).


This chapter introduces key concepts involved in a study of a diagnostic test: sensitivity, specificity, and positive and negative predictive values of binary diagnostic tests, likelihood ratios, and the ROC curve. As we introduce these concepts, we demonstrate statistical methods for estimation of these quantities and their confidence intervals, and for hypothesis testing.


images Summary Measures of Diagnostic Accuracy


To streamline the presentation of materials here, we need to introduce some terminology. First, define D as a binary variable representing the true disease status, which takes the value 1 if someone has the disease and 0 if not. This requires the existence of a reference standard that correctly identifies patients as having the disease or not.1,2,3 A reference standard is the test (or group of tests) best representing the true disease state. A classic example of a reference standard is the result of a biopsy with histologic examination of a pulmonary nodule seen on a computed tomography (CT) scan of the chest, which classifies it as benign or malignant. Another example of a reference standard is sufficiently long clinical follow-up of patients with clinical suspicion of appendicitis to determine the eventual development of appendicitis or freedom from this condition. In many cases, however, a reference standard is not available as it is either not ethical or not feasible to carry out the often invasive procedures needed to obtain certainty about the presence of disease. This leads to the situation where the best currently available diagnostic test (or combination of tests) is used as a de facto reference standard. For example, magnetic resonance imaging (MRI) or CT imaging may be used as a reference standard for a proposed diagnostic technique based on ultrasound.


Next, we define T as a binary variable that is the result of a diagnostic test and that takes the value 1 for a positive test and 0 for a negative test. In the assessment of medical images, the classification of an image as T = 1 versus T = 0 relies on the synthesis of information across the image, or even across several images or different imaging modalities. This information can often be thought of as taking on some value X, where high values are more suggestive of disease and lower values are more indicative of no disease. The variable X may be a measurable quantity, such as wash out time or a quantification of the degree of enhancement. But in many settings, X is not measured directly based on the image, but represents, for example, a degree of suspicion, a qualitative assessment of the level of enhancement, or a combination of characteristics associated with disease—opacity, size, and calcification, for example, in the case of MRI for detection of breast cancer. Whether X is quantified or X is a general measure of the amount of evidence for disease, if X is used to produce a binary diagnosis of T = 1 or T = 0, it must be compared to a threshold, which we call k. This threshold is the minimum level of evidence that the reader of the images uses to make a diagnosis of disease. This contrasts with many medical diagnostic tests based on blood levels, where, for example, levels of serum creatinine above a certain level are compared to a strict numerical threshold (e.g., > 1.5 mg/dL [133 mmol/L]) to classify a patient having an MRI examination as being at high risk for gadolinium-induced nephropathy.4 Values of X at or above the threshold result in a diagnosis of disease, T = 1; and values of X below the threshold result in a patient being diagnosed as disease free, T = 0. That is, if X ≥ k then T = 1, and if X < k then T = 0.


A person who has undergone both a reference standard assessment and the diagnostic test can be classified in one of four ways. Table 3.1 summarizes data from a binary diagnostic test study and can be used to estimate all the relevant quantities.


In this study design, each subject with disease is considered to have the same probability of being diagnosed positive. This probability of a positive diagnostic test, given that the subject has disease, is called the sensitivity (se) or true-positive fraction:


sensitivity = P (T = 1 | D = 1)


This is estimated from Table 3.1 as images, the proportion of all diseased cases that have a positive test. The circumflex, or “hat,” over se indicates that it is not the actual sensitivity, but an estimate of sensitivity.


images


All subjects without disease on the reference standard are considered to have the same probability of obtaining a negative diagnostic test. This probability is called the specificity (sp), or true-negative fraction.


specificity = P (T = 0 | D = 0)


and is estimated from Table 3.1 as images.


The proportions of patients wrongly classified by the diagnostic test do not have special names, but are known simply as the false-positive fraction (for those healthy patients with a positive test) and the false-negative fraction (for the diseased patients with a negative test). These proportions are estimated from Table 3.1 as FP = b/(b + d) and FN = c/(a + c). Alternatively, they are found as the complement of specificity (FP = 1 − specificity) and sensitivity (FN = 1 − sensitivity).


The sensitivity of a test answers the question: Given that a subject has the disease (D = 1), what is the probability that the test will pick up the disease? Similarly, the specificity of a test answers the question: Given that a subject does not have the disease (D = 0), what is the probability that the test will be negative? These quantities examine probabilities by doing what is called “conditioning” on the true state of the disease. Whereas the true disease state is often (but not always) known in a study of a diagnostic test, in the clinical setting where the diagnostic test will be used, the true disease state will not be known. In fact, the reason for administering a diagnostic test is to obtain an improved estimate of the probability that a patient has the disease. In the clinical setting, the questions of interest are conditional on the result of the diagnostic test. Given that a diagnostic test is positive, the probability that a patient has disease is called the positive predictive value (PPV) and can be written as P (D = 1 | T = 1). When the diagnostic test is negative, the probability that a patient is disease-free is called the negative predictive value (NPV) and can be written as P (D = 0 | T = 0). In Table 3.1, these quantities are estimated simply as the proportion of patients with a positive test who have the disease, images, and the proportion of patients with a negative test who do not have disease, images.


Chapter 13 shows how to test a hypothesis and how to calculate a confidence interval—a range of plausible values—for a proportion, an approach that can be used for all of the summary proportions from Table 3.1. Keep in mind that all these equations refer to proportions, not percentages. Once all the calculations have been done on proportions, the resulting values may be presented as percentages.


images Example: Ultrasound for Assessment of Appendicitis


Schuh et al.5 examined the use of ultrasound scanning as a diagnostic tool to be used in the emergency department in children who have suspected appendicitis. The paper classifies the results of the ultrasounds into three broad categories: positive (95 children), negative (76 children), and equivocal (92 children). The positive and negative categories refer to cases where the sonographer was able to visualize the appendix and classified the test result as either positive or negative, respectively, for appendicitis. Those examinations that were unable to properly visualize the appendix or had nondiagnostic features were classified as equivocal ultrasounds. Table 3.2 summarizes the results for the positive, equivocal, and negative groups.


images


images


If we try to apply the definitions developed in Table 3.1, we are quickly faced with a decision: what to do with the equivocal cases? We need a binary classification of T = 1 or T = 0 for each case. In the Schuh et al. paper,5 equivocal scans were classified two ways, as either false-positive (the 79 scans where the child did not have appendicitis) or false-negative (the 13 scans where the child did have appendicitis), but here, we will take a different approach. If equivocal patients are going to be treated as if the ultrasound finding was negative and the patient was sent home, then they should be classified with the T = 0 row of Table 3.2. If the patients are going to form a “possible or probable appendicitis” cohort that was sent for further treatment or follow-up, then they are better classified with the T = 1 patients. The latter situation is more appropriate here, so we collapse Table 3.2 to form the 2 × 2 table shown in Table 3.3.


We can estimate specificity as images. Sensitivity is estimated as 99/99 = 1.0.


Below, we use the prop.test function in R to compute 95% confidence intervals for sensitivity and specificity.

> prop.test (99, 99, correct=F)

[output snipped]

95 percent confidence interval:

0.963 1.000

sample estimates:

p1


Sensitivity values as low as 0.963 lie in the confidence interval, even though we observed perfect sensitivity. Notice that the confidence interval is not symmetric.

> prop.test (76, 164, correct=F)

[output snipped]

data: 76 out of 164, null
probability 0.5

X-squared = 0.878, df = 1, p-value
= 0.3487

alternative hypothesis: true p is
not equal to 0.5

95 percent confidence interval:

0.389 0.540

sample estimates:

p

0.463


A 95% confidence interval is 0.389 to 0.540, which is approximately symmetrical around the estimate.


The function prop.test also gives a p value for the test of the hypothesis that the proportion is equal to 0.5; in this case, the hypothesis cannot be rejected, since p = 0.349 is higher than the usual significance level of 0.05 (Chapter 13). Since the proportion 0.5 lies within the 95% confidence interval (0.389 to 0.540), we already know that we will not reject the hypothesis that specificity is 50% at an α of 0.05, but the confidence interval does not give us a p value. For measures of diagnostic performance such as sensitivity and specificity, it is more useful to test whether the value is different from some level that would be useful in clinical practice than to test this default value of 50%; specificity or sensitivity of 50% is generally below what is acceptable. We can use the following R code to test whether we can reject the hypothesis that the true value of specificity is 60%, an acceptably high value.

> prop.test (76, 164, p=0.6,
correct=F)

[output snipped]

data: 76 out of 164, null
probability 0.6

X-squared = 12.7, df = 1, p-value =
0.0003564

alternative hypothesis: true p is
not equal to 0.6

95 percent confidence interval:

0.389 0.540

sample estimates:

p

0.463


The estimate and confidence interval are the same as in the analysis above because we are using the same observed data. But we are now testing a different hypothesis, so get a different p value; this result suggests that such low observed specificity would be unusual (p = 0.0003564) if specificity were as high as 60%, so we reject that hypothesis. Estimation and construction of confidence intervals for the remaining proportions for the 2 × 2 table (FP, FN, PPV, NPV) can all use prop.test in R, no matter whether there are zero counts or not.


The occurrence of a zero count, as we saw for the number of false-positives in Table 3.3, is common enough that it deserves the attention we gave it above. However, it does somewhat complicate the presentation of some additional important concepts in the analysis of a binary diagnostic test, so we continue this section with the data in Table 3.4, formed by grouping the equivocals and negatives in Table 3.2 into an “ultrasound negative” group. This grouping is appropriate if those with a positive ultrasound undergo some invasive procedure that should not be used on equivocal cases.


Sensitivity and specificity in Table 3.4 are estimated to be 86.9% and 94.5%. In this sample, 86 of the 95 patients with a positive ultrasound are cases of appendicitis, so the PPV is 90.5%. Among those with a negative ultra-sound, 155/168 do not have appendicitis, so the NPV is 92.3%. If a positive ultrasound was used to decide on further treatment, 90.5% of the treated patients in this sample would have appendicitis and 92.3% of untreated patients would be free of appendicitis. These numbers suggest that ultrasound can be a useful diagnostic tool to direct potential cases of appendicitis to the appropriate clinical management path. Can the readers of the Schuh et al. study5 apply these measures of diagnostic accuracy to their own emergency rooms (ERs)? This study took place in the ER of a pediatric hospital downtown in a major metropolitan area; it is possible that the prevalence of appendicitis there is not typical of other settings where ultrasound might be used to diagnose appendicitis. Would ultrasound be as useful in a setting with a lower prevalence? Let’s imagine a study where the number of children without appendicitis is 10 times as high—they have some other cause for their symptoms. These hypothetical numbers are presented in Table 3.5.


images


Clearly in Table 3.5, sensitivity is still 86.9% and it is easily verified that specificity is still 94.5%; the numerator and denominator of the estimated specificity are both 10 times as large as in Table 3.4. But now, only 86 of the 176 patients who are positive on ultrasound are cases of appendicitis, so PPV = 48.9%. Someone who applied the PPV from this study to a low-prevalence setting would be sending more than half of the children with a positive ultrasound for unnecessary treatment. By contrast, 1550 of 1563 patients with a negative ultrasound are free of appendicitis, so NPV = 99.2%, higher still than the value in Schuh et al.5


images


While it is generally felt that the sensitivity and specificity of a diagnostic test are properties of the test (which includes contributions from the ultrasound device itself, the sonographer, and the reader of the diagnostic test in the case of ultrasound), the positive and negative predictive values are also dependent on the prevalence of disease. How can the results from a study carried out in a setting with one value for prevalence be applied to a setting with a different value for prevalence? This situation exists in primary care: the prevalence of most diseases is low, but diagnostic tests that will be applied there may have been developed and evaluated in a setting with a high disease prevalence.


The PPV in a new setting with prevalence P can be related to sensitivity (se) and specificity (sp) through the following formula:



images

The numerator represents the probability of a diseased patient testing positive: it is the proportion of patients with disease times the proportion of those who test positive. The numerator is the probability of a positive test, irrespective of true disease state: the probability of a diseased patient testing positive (se × P) plus the probability of a nondiseased patient testing positive (FP × (1 − p)). Similarly, the negative predictive value can be calculated as:



images

Our online materials supply an R function PredictiveValues that computes the PPV and NPV based on prevalence, sensitivity, and specificity. To apply the diagnostic accuracy findings from Schuh et al.5 to settings with prevalences ranging from 10% to 60%, use this function as shown below:

> PredictiveValues (se=0.869,
sp=0.945, P=c (0.1, 0.2,0.3, 0.4,
0.5, 0.6))

Pretest PPV NPV

1 0.1 0.637 0.985

2 0.2 0.798 0.967

3 0.3 0.871 0.944

4 0.4 0.913 0.915

5 0.5 0.940 0.878

6 0.6 0.960 0.828


The output above illustrates an important concept: for any fixed values of sensitivity and specificity, higher prevalence results in a higher PPV and in a lower NPV. We suggest that the value of a diagnostic test in a new setting be evaluated by computing predictive values for a plausible range of prevalences.


Knowledge of the sensitivity and specificity of a test will help clinicians determine whether the test is most useful in ruling in or ruling out a disease.3,6 For example, a test that is very sensitive will rarely miss people with the disease and a negative result obtained from a test with high sensitivity will therefore be useful in ruling out disease (although a positive result will not necessarily rule in the disease). Conversely, a highly specific test will rarely misclassify people without the disease as diseased and a positive result will be useful in ruling in a disease (although a negative result in this situation does not necessarily rule out the disease).


images Likelihood Ratios


The previous section shows how to use the two summary statistics of sensitivity and specificity from one setting to calculate PPV and NPV in a setting with a different prevalence. It is also possible to summarize the value of a diagnostic test using two different summary statistics, the positive and negative likelihood ratios.7 The likelihood ratio of a positive test is defined as:



images

This ratio measures how much more likely a positive test is in a diseased patient than it is in a healthy patient. As we show here, it is also the relative increase from the pretest odds of disease in a patient who has a positive result on the diagnostic test.


In a similar fashion, we can define the likelihood ratio for a negative test as the relative probabilities of a negative test in diseased and nondiseased patients. This ratio will generally be less than one and measures the relative decrease from the pretest odds of disease in a patient who has a negative result on the diagnostic test.



images

If a diagnostic test has a positive likelihood ratio ≥ 10, meaning that a positive test increases a person’s odds of disease by a factor of 10 or more compared to the pretest odds, this is considered quite an informative test. Such a positive test can be very good at ruling the disease in. The higher the positive likelihood ratio, the larger the increase in the probability that the person has the disease with a positive test. Likewise, if a test has a negative likelihood ratio ≤ 0.1, it is also informative and can be very good at ruling the disease out. The lower the negative LR, the larger the increase in the probability that the person does not have the disease with a negative test.


In Table 3.3, the positive likelihood ratio is LR+ = 0.869/(1 − 0.945) = 15.8 and the negative likelihood ratio is LR− = 0.139. The odds of disease increase almost 16-fold after a positive test and decrease by a factor of around 7 after a negative test. As with sensitivity and specificity, we do not learn anything about the absolute odds of disease from the LR+ and LR−, only how much the odds change.


To better understand the use of likelihood ratios, it is necessary to be comfortable working with odds. The odds are the ratio of a probability for an event occurring to the probability against the event occurring.8 If P is the probability that a patient has appendicitis, then the odds of appendicitis are odds = P/(1 − P). The equation that converts an odds into a probability is P = odds/(1 + odds). Odds can be as low as zero but have no upper bound. For example, if the probability of an event is 0.10 (10%) then the odds are 0.10/(1 − 0.10) = 0.10/0.90 = 1/9 = 0.111. If the probability of an event is 99% then the odds of the event are 0.99/0.01 = 99, sometimes expressed as 99 to 1. It is important to keep in mind that doubling the odds does not mean doubling the probability. In the case where the odds are 99:1, doubling the odds means they go from 99 to 198 and the resultant probability is 198/(198 + 1) = 0.995; we doubled the odds but increased the probability by only one half of one percent.


Before a test is performed, one can estimate a pretest probability, P, of the disease using all available information on the patient (e.g., age, gender, prevalence of the disease, previous diagnostic tests, etc.). The pretest odds, Oddsprior, can be found as P/(1 − P)


The odds of disease after a positive test, images, can be calculated as:



images

Likewise, the odds of disease after a negative test are equal to the product of the pretest odds of disease and the likelihood ratio negative



images

For clinical use, where probability is a more familiar expression of risk than odds, these are usually converted back to posttest probabilities, PPV and NPV. We shall check that the use of the LR+ and LR from Schuh et al. gives us the same PPV and NPV as the direct use of sensitivity and specificity.


Using LR+ = 15.8 and a pretest probability of 0.2, images then we can calculate PPV = odds/(1 + odds) = 3.95/(1 + 3.95) = 0.798, the same value we calculated above. Below, we use our R function PredictiveValuesLR to calculate PPV and NPV from prevalence and the LR+ and LR from the Schuh et al.5 data:

> PredictiveValuesLR (LRp = 15.8,LRn
= 0.139, P = 0.2)

Pretest PPV NPV

0.2 0.798 0.967


images Nomograms


The formulas above allow accurate calculation of posttest probabilities from the relevant quantities. But, in many cases, the assessment of pretest probability gives only an approximate value and a similarly approximate value for the posttest probability will suffice for clinical decision making. The relationship between the pretest probability, the likelihood ratio, and the posttest probability can be represented in a nomogram, a graphical aid that is frequently used by physicians to obtain approximate post-test probabilities based on pretest probabilities and likelihood ratios (Fig. 3.1). To use the nomogram, draw a straight line from the pretest probability of disease on the left axis through the value corresponding to the likelihood ratio of the diagnostic test and over to the right axis. The value where the line crosses the right axis is the posttest probability of disease. The left panel of Fig. 3.1 shows a nomogram being used to calculate posttest probabilities for a patient who, after a clinical examination, had an 85% pretest probability of appendicitis. Suppose that the likelihood ratio is 15.800 for a positive test and 0.139 for a negative test. A line drawn from 85% on the pretest axis through a likelihood ratio of 15.8 shows that the posttest probability is almost 99% after a positive ultrasound and around 45% after a negative ultrasound. The right panel of Fig. 3.1 shows the use of the nomogram for the same LRs when the pretest probability is 20%. Notice that the numerical value on the LR axis can usually be located only approximately and that the posttest probabilities can be read off only approximately; in many cases, the degree of approximation is good enough in a clinical setting. If more precision is desired, then the formulas should be used.


Apr 5, 2019 | Posted by in GENERAL RADIOLOGY | Comments Off on Diagnostic Tests

Full access? Get Clinical Tree

Get Clinical Tree app for offline access