14 – Designing Perception Experiments

14 Designing Perception Experiments

Ehsan Samei

14.1 Introduction

Perception experiments are a vital part of medical imaging research. Such experiments involve a group of observers, often clinicians, who provide interpretations of a set of medical images with known characteristics. The results are then analyzed to infer information about the performance of the observers, the utility of the clinical system, and/or the accuracy by which a clinical outcome is rendered. Designing such experiments involves multitudes of planning and logistical decisions about how to conduct the experiment. Those include decisions involving defining the objective of the experiment, choosing appropriate methodology, deciding on target samples, and planning for experimental implementation, data collection, and data processing. Proper design would ensure that the desired answers are obtained, statistical power and precision are achieved, potential biases and confounding factors are avoided, cost is minimized, and the experimental results are accurate, reproducible, definitive, and generalizable.

Devoted effort at the outset for proper design of a perception experiment cannot be underestimated; even powerful statistical analysis cannot salvage a poorly designed study. For specific design issues, there are numerous resources in the literature that one may turn to. However, these resources are scattered throughout the literature. The purpose of this chapter is to detail major design considerations for planning a perception experiment and to discuss basic issues associated with such designs.

14.2 Objective

14.2.1 Objective of the Experiment

Most perception experiments aim to assess and compare the performance of images in rendering a clinical decision. Some do so in the context of technology assessment of imaging modalities, machines, algorithms, or displays. Some perception experiments aim to compare different types of observers in performing imaging tasks on a single or multiple imaging systems. In doing so, a perception experiment incorporates the observer as part of the imaging system, recognizing that the observer is an indispensable element of the “system” as a whole. In another word, in the context of this chapter, an imaging system includes the whole chain of imaging hardware, acquisition parameters, display hardware and software, and observers.

Aiming toward its goal, a perception experiment should be anchored to a hypothesis. Example hypotheses might fall into broad categories:

1. an imaging system or technique providing adequate diagnostic value;
2. an imaging system or technique offers an improvement over another one (e.g., comparing the detection of lung nodules with and without computer-aided diagnosis input; comparing the efficiency of radiology residents against experienced radiologists for a given diagnostic task);
3. an imaging system or technique offers an equivalency performance compared to another one (e.g., comparing digital and analogue mammography in terms of diagnostic accuracy; determining whether the dose for a particular imaging exam can be reduced without affecting diagnosis).

A hypothesis of an experiment may not fall into these categories, but regardless, a hypothesis needs to be explicitly articulated at the outset if one is to achieve a definitive outcome from the effort. Targeting a hypothesis, the experiments offer specific endpoints that may fall into one of these categories:

1. studying imaging systems in terms of detection accuracy (e.g., the detection of lung nodules in chest computed tomography (CT (exams);
2. studying imaging systems in terms of classification accuracy (e.g., distinguishing benign masses from malignant masses, assessing the volume of a tumor);
3. assessing the subjective preference of observers for imaging systems (e.g., comparing the perceptual acceptability of the appearance of images processed by two different image postprocessing algorithms);
4. studying imaging systems in terms of the efficiency or time required to perform an observer task (e.g., comparing the speed of reading images on two different display devices);
5. assessing observer behaviors for a given imaging system (e.g., assessing variability for and among observers, satisfaction of search, memory effect, visual search patterns, and observer fatigue in reading a particular type of image).

Studies in the first and the second endpoint categories above provide objective results pertaining to diagnostic accuracy. Therefore, the majority of perception experiments fall into those categories.

Studies in the third category fall short of providing fully objective results in terms of diagnostic outcome. They are often an easy choice when it comes to “quick and dirty” decisions about what imaging methods are preferred, using limited number of cases and observers and preferential aesthetic criteria. However, provided that they are designed with meaningful hypotheses and statistical rigor, they offer immense value in sorting through many competing image quality factors and perceptual preferences (Samei et al., 2014).

Studies in the fourth and fifth categories focus on the psychophysical and cognitive attributes of image interpretation, and not so much on the clinical evaluation of imaging methods. They encompass the majority of the research that is highlighted in this book. Interested readers are referred to relevant chapters in parts I and II of the book for indepth discussions on observer behaviors. This chapter aims to focus on the more commonly-performed studies of categories 1 and 2.

Though it seems naïve to assume otherwise, we emphasize that it is paramount that at the outset of a perception experiment the objective of the study is explicitly defined and documented. That step alone provides the most important guideline for answering study design questions.

14.2.2 Phased Approach

For general diagnostic procedures, including medical imaging, it is often most efficient to adopt a phased approach in that the overall objective of the study is achieved in multiple sequential phases (Zhou et al., 2002). In that way, it is possible to refine the study design along the way to avoid “costly” mistakes in conducting large studies with premature designs.

A study may start with a pilot or exploratory phase in which a limited number of images and observers are employed. The study, while void of significant statistical power, may provide needed data to assess the magnitude of influencing effects and to “test-drive” the study design. The study may then proceed to a comprehensive “laboratory” phase in which a larger number of images and observers are used but the conditions are still kept fairly tractable to minimize variability. In the third and most comprehensive (or sometimes clinical) phase, the imaging systems are tested under a more realistic clinical scenario with added images and observers to fully test the hypothesis in the most clinically relevant condition. Different phases of a perception experiment may have slightly different objectives, all aiming toward a broader overall objective.

14.3 Choice of Methodology

The methodology is an important choice to be made at the outset of an experiment. Generally speaking, studies focused on assessing diagnostic accuracy (for categories 1 and 2 studies defined above), two types of methods are generally used: receiver operating characteristic (ROC) and alternative forced choice (AFC). It is possible for a study design not to conform to these two methods, but such approaches are rare and would require more “customized” statistical analyses.

Both types of method require a sample of normal (negative) cases and a sample of abnormal (positive) cases. A case can be composed of a single image of a patient (i.e., a patient’s mammogram) or a set of images of a patient (i.e., a patient’s chest CT examination). Positive or negative (i.e., “normal” or “abnormal”) condition is defined relative to the clinical task. For example, if the clinical task is the detection of lung nodules, a case with nodule absent is a negative case, whereas a case with nodule present is a positive case. If the task is the identification of malignant lesions from benign ones, a case with a malignant lesion is a positive, while one with a benign lesion is negative. Both ROC and AFC methods aim to measure the ability of an imaging system to discriminate between positive and negative conditions.

In an ROC experiment, an observer reviews each case and provides a confidence rating regarding “normal” or “abnormal” condition of the case for a specific clinical task. The confidence rating is given on either a categorical scale (e.g., 1: definitely normal; 2: probably normal; 3: unsure; 4: probably abnormal; 5: definitely abnormal) or a continuous probability scale (e.g., between 0% and 100%, with 0% being definitely normal and 100% being definitely abnormal). The rating data are assumed to come from two underlying distributions: one for normal cases and the other for abnormal cases (Figure 14.1).

Figure 14.1 In a receiver operating characteristic experiment, the rating data are assumed to come from two underlying distributions: one for normal cases and the other for abnormal cases.

A series of decision thresholds are then placed on the rating scale (Figure 14.1). Each decision threshold represents a possible critical level at which the observer (e.g., a radiologist) operates in his/her daily practice: if the rating assigned to a case is above the critical level, the case is diagnosed as an abnormal case, whereas a case with a rating below the critical level is diagnosed as a normal case. In other words, the decision threshold reflects how strict or lax the observer is when deciding whether a case is normal or abnormal.

The application of multiple decision thresholds in ROC analysis recognizes the fact that the strictness of an observer may vary from case to case and from day to day. Changing the decision threshold changes both an imaging system’s sensitivity (i.e., the probability that an actually abnormal case will be correctly classified as “abnormal” with regard to the given clinical task) and the imaging system’s specificity (i.e., the probability that an actually-normal case will be correctly classified as “normal”). By measuring sensitivity and specificity at all possible decision thresholds of an observer, one can obtain a ROC curve, which is a plot of sensitivity versus one minus specificity (Figure 14.2). The diagnostic accuracy of the imaging system is then characterized by the area under the ROC curve (AUC), a summation of sensitivity at all specificities. As such, AUC describes the intrinsic accuracy of the imaging system, free from the influence of decision threshold.

Figure 14.2 A receiver operating characteristic curve is a plot of sensitivity versus one minus specificity.

In addition to the total area under the ROC curve, partial area under the ROC curve (pAUC), defined as the area between two specificities or two sensitivities, has been used to describe diagnostic accuracy over a range of specificities (sensitivities) relevant to a particular clinical application. Furthermore, ROC method has been extended to multiple-choice paradigms, in which the observer’s rating reflects his/her confidence regarding more than the usual two-way classification of “normal” and “abnormal,” but a more sophisticated multiple classification such as different classes of lesion categories (e.g., LUNG-RADS 1, 2, 3, …). For a comprehensive review of the ROC method, the statistical tests for evaluating difference/equivalence between ROC results, and software available for ROC analysis, the reader is referred to Chapter 15. As we will discuss below, most perception experiments require more than one observer/reader. Special methods have been developed for multireader multicase ROC data and they are the focus of Chapter 16.

In conventional ROC, the location of the abnormality within the image is not of primary relevance, an assumption often not reflecting clinical reality. To address this limitation, there have been a number of ROC variants devised, including region of interest-based ROC (ROIROC), localization ROC (LROC), free-response ROC (FROC), and alternative free-response ROC (AFROC). ROIROC divides each image into multiple zones, each of which is rated as a separate image. In LROC, one rating-location pair is given to each image, while FROC and AFROC allow multiple rating–location pairs per image. Each ROC variant has its own particularities and limitations; some require statistical analyses that are less standardized.

As an alternative to ROC analysis, some studies have used the AFC methodology, most commonly in the form of the two-alternative forced choice (2AFC). In an AFC experiment, the observer reviews two or more independent images simultaneously (Figure 14.3). Exactly one of the images is actually abnormal with respect to a given clinical task. The observer is asked to indicate which image is abnormal. The data are analyzed to assess the percentage of correct decisions. It has been shown that the percentage correct in 2AFC experiments equals the AUC in the ROC method (Bamber, 1975; Green and Swets, 1966; Hanley and McNeil, 1982).

Figure 14.3 An example image panel used in a four-alternative forced-choice experiment. Exactly one of the images contains a lung nodule (arrow). The observer is asked to identify which image has a nodule.

AFC experiments are faster to conduct (i.e., more cases can be read per unit time compared to ROC). When the alternatives are four or more, they further provide a greater range of useful clinical task difficulties while still maintaining the AUC in the intended 0.7–0.8 range, a reduced coefficient of variation in the measured AUC for repeated trials, and less susceptibility to training and categorical reporting biases (Burgess, 2011). However, contrary to ROC, the AFC results do not provide information about the underlying distribution functions, and the tradeoff between sensitivity and specificity is not determined. Furthermore, the setting of an AFC experiment does not resemble an actual clinical paradigm; clinicians do not usually compare unrelated images with each other. For indepth discussions on AFC methods, the reader is referred to excellent articles on that subject (Burgess, 1995; Metz, 2000).

Both ROC and AFC differ from the clinical paradigm in one way or another. The choice of which one to use for a particular study depends on time/resource constraints, the granularity by which the study aims to assess accuracy (i.e., getting the AUC in AFC versus getting the full ROC curve in ROC), and whether the implication of the deviation of the method (as implemented) from the clinical practice is adequately considered. Once a choice of methodology is made, that generally informs the other aspects of study design considerations, discussed below.

14.4 Sample Selection

14.4.1 Case and Image Selection

A perception experiment may employ three types of images: real images from actual patients, simulated images from computer or physical models of patient anatomy or function, and hybrid images which contain computer-simulated abnormalities added to real image backgrounds (Figure 14.4).

Figure 14.4 Examples of different types of images employed in perception experiments: (a) real image (real mammographic background and lesion); (b) hybrid image (real mammographic background with a simulated lesion); (c) simulated image from computer model of human anatomy (a simulated mammogram from a computerized breast model); and (d) simulated image from physical model of human anatomy (an image of a breast phantom (Model 011A, CIRS, Norfolk, VA).

(Courtesy of Joseph Y. Lo, PhD.)

Among the three, real images are usually the preferred choice as they fully represent the clinical paradigm. However, they have major drawbacks. Collection of clinical cases is a time- and cost-consuming process. In detection studies, the abnormal features within the image often need to be subtle enough to be appropriate for the investigation at hand. Further, clinical cases frequently contain more than one type and location of abnormality. Adherence to regulatory standards and Health Insurance Portability and Accountability Act (HIPPA) regulations further adds to the complications associated with this particular option.

An important consideration in using actual patient cases, the cases need to have associated “truth” in that either the normalcy or the abnormality of the case needs to be established in reference to the true disease status or the gold standard. Ideally, the gold standard should be based on the results of surgery or pathology (biopsy). However, patients who undergo biopsy represent only a small subgroup of patient population. When biopsy results are unavailable, results from clinical workup have been used to serve as gold standard (Dobbins et al., 2008; Thornbury et al., 1993). In some cases, the truth cannot ever be established, case in point being the morphology of a lesion depicted in an image; neither biopsy nor surgery can extract the lesion in a state that is reflective of its native presence during the imaging exam.

In the absence of ground truth, “relative truth” can be established by expert opinion or expert consensus. However, such a choice has major drawbacks in terms of added uncertainty and potential bias in the study outcome. Reviews of the potential biases associated with using suboptimal gold standards and the approaches to minimize these biases can be found in Chapter 15 and in Zhou et al. (2002).

In any phase of a study, the selected image sample for the perception experiment must reasonably represent the patient population of interest; a lack of important subgroups of patients can bias study results. As an example, bias is introduced into the selection process when the image sample contains only images of patients who are likely to participate in the study, elect a diagnostic test, or meet certain gender or racial distribution. Proper study design requires careful sampling of the patient population.

Disease prevalence (i.e., the ratio of abnormal to normal cases) is another important aspect of case selection. In general, the observer results are assumed to be free from the influence of disease prevalence as long as the sample sizes of abnormal to normal cases are large enough. However, studies have shown that the disease prevalence can influence the observer’s psychology, and that observer accuracy may increase with disease prevalence (Egglin and Feinstein, 1996). In the design of perception experiments, the potential implication of disease prevalence on the study outcome should be taken into consideration.

A second alternative to the use of actual patient data is to rely on fully simulated images. Those include simulations using computational models (Segars et al., 2008) or physical phantom models (Ko et al., 2003). The simulated images can represent simple objects (e.g., a uniform cylinder representing the abdomen for CT imaging) (Schindera et al., 2008), or an anthropomorphic model (Segars et al., 2008). Realistic simulation of clinical cases is extremely challenging, both in terms of realistically simulating the human body but also simulating the image acquisition and presentation conditions. Both simulations can present different difficulties in creating image data that are useful for an observer study. Until very recently, few studies has provided simulated images that can approach equivalency to real images (Abadi et al., 2017; Pokrajac et al., 2012). Depending on their sophistication and realism, these images can be very effective in perceptual experiments. They can offer a distinct advantage in that the full truth of the case (abnormality and the native patient morphology) is known. Thus, the influence of imaging acquisitions or techniques can be most objectively ascertained. Realism can be suspect but. as a minimum, this approach can be invaluable in pilot phases of subsequent studies using more clinically relevant images.

An effective alternative to the two choices above, hybrid images use actual clinical cases with added realistic simulation of abnormalities. Simulation of abnormalities alone can be easier to achieve, as showcased by a number of studies modeling mammographic, pulmonary, and hepatic lesions (Hoe et al., 2006; Samei et al., 1997; Saunders et al., 2006; Solomon and Samei, 2014). Computer-simulated abnormalities offer several advantages over real abnormalities. First, the characteristics of the abnormalities, such as size, shape, contrast, and location, can be well controlled, permitting the study of abnormalities with desired characteristics. Second, the prevalence of the abnormalities can be controlled. Images containing isolated abnormalities, which may be rare in actual patients, can be readily created. Moreover, the truth is absolutely known: simulated abnormalities are by definition “real” since they are explicitly inserted within the image. Lastly, compared with the difficult and time-consuming process of collecting patient data with abnormalities of desired characteristics, a large database of cases can be quickly generated to enable a large-scale perception experiment.

As alluded to earlier, the level of difficulty or subtlety of the abnormalities within the image set is an important consideration for a perception experiment. For example, in comparing two different modalities in detecting certain lesions, if the lesions are too apparent, they could easily be identifiable by either modality, thus masking possible differences in performance. Similarly, if the abnormalities are too subtle, they will be missed by both modalities. In ROC experiments, it is generally recommended that a target AUC of 0.7–0.8 is sought, so that the differences would be most discernable. The appropriate level of subtlety of abnormalities for an experiment can be determined or confirmed in the pilot or exploratory phase of a study when the observers are presented with cases representing a wide range of subtleties. Once the appropriate level of subtlety is ascertained, the diagnostic accuracy of the imaging condition(s) is assessed in a laboratory phase of the study with appropriately difficult or subtle cases. In the clinical phase, if such a phase is desired, all levels of difficulties will then be included. The level of difficulty here will represent typical disease conditions while the proportions of cases at different difficulty levels would ideally match that of an actual patient population.

14.4.2 Observer Selection

The observer is a quintessential element of a perception experiment. The objective of the experiment is tied to the kind of observer used in the study. For example, if a study aims to assess an image quality factor affecting cardiac images, the observers should be selected from the cardiac clinicians who frequently use those images. Depending on the type of images used, expert observers are usually necessary.

The spectrum of selected observers should emulate that in an actual clinical situation. If the users of the imaging system under investigation are general radiologists, observer selection should not be limited to specialists, and vice versa. Clinicians can also come from a broad range of experience, in terms of both years of practice as well as the volumes of images read. Again the objective of the experiment would determine that choice. The source of observers is another important consideration. The broader the source of observers, the more generalizable the conclusions can be; if only observers from the same institution or hospital are accessible, the conclusions can only be generalized to comparable-level care units.

In certain types of perception experiments, the focus is on a psychophysical aspect of the human visual system. The objectives of such studies do not involve a clinical condition. As such, the experiment can use nonclinician observers such as imaging scientists or students (Christianson et al., 2015; Pollard et al., 2008). In certain other types of studies, the imaging system under investigation might involve other members of allied health personnel such as radiological technologists or nurses. Needless to say, such perception experiments need to employ those individuals as observers.

In addition to human observers, model (mathematical) observers have been widely used in perception experiments. Model observers are mathematical formulations that combine the characteristics of human visual system with the statistical properties of an image. The need for model observers stems from the fact that the diagnostic accuracy of any modern imaging system is often determined by a myriad of parameters. Optimization of these parameters requires the testing of a large number of conditions in combinations of multiple parameters, which is often prohibitively expensive and time consuming to perform using human observers. Model observers are designed to emulate the performance of human observers. They play an essential role in the preliminary evaluation of new imaging systems (Chawla et al., 2008) and are important in subsequent optimizations of imaging protocols with respect to image quality and radiation dose (Boedeker and McNitt-Gray, 2007; Solomon and Samei, 2016). Chapter 18 provides a thorough review on model observers. Application of model observers in perception experiments is discussed in Chapter 19.

14.4.3 Case and Observer Crossing/Matching

When comparing two or more imaging systems, it is often desirable to obtain or present images from each sample case/patient in all image systems, a study design known as case crossing. This approach rules out the potential contribution of case variation to the difference in diagnostic accuracy and improves statistical power. However, as much as it is statistically desirable, case crossing has two major problems that might make it infeasible for many experiments. First, it is often not possible to image a patient more than once because of radiation dose or logistical reasons. But, even if that can be overcome, second, an observer might remember a patient from his/her first reading so that the second reading of the case with a different imaging system will be affected. This phenomenon is known as the “memory effect.” It is possible to design studies with randomization and time lags such that the memory effect is minimized or averaged across observers (see sections 14.5.5 and 14.5.6), but it cannot be fully eliminated.

When case crossing is not feasible, case matching can be used, provided that the cases and patients representing different imaging systems are drawn from similar populations to avoid introduction of bias. For example, in a study aimed to assess the effectiveness of two imaging systems for lung cancer screening, the two imaging systems should not represent populations with differing smoking habits.

As with image selection, it is also desirable to have the same sample of observers read images representing all imaging systems, a study design known as observer crossing. This design eliminates the contribution of observer variation to the measured difference in diagnostic accuracy and improves statistical power. But again, observer crossing might not always be possible. For example, observers from one institution may have access to only one set of images. In those situations, observer matching may be employed in that observers reading different imaging systems have somewhat similar attributes (e.g., similar experience or expertise). That does not guarantee the absence of bias, but at least minimizes the likelihood.

A study design that employs both case crossing and observer crossing is often referred to as a fully crossed design. Because such a design is the most statistically powerful (Dorfman et al., 1992), it is adopted by many multiobserver ROC studies and is the target of many ROC software tools.

14.4.4 Sample Size

The sample size is of prime importance in the design of a perception experiment. In that regard, the investigator needs to decide on the number of cases or images, the number of observers, and the number of repeated readings per observer. Integral to these decisions are the methodology that a study will employ. For example, the computation of the numbers depends on whether the study will employ an ROC or an AFC method. In general, the needed numbers depend on the expected magnitude of the difference that exists between different imaging systems in relationship to the expected magnitudes of variability. If there were no sources of variability in an experiment, a single case and a single observer can provide sufficient information regarding whether a particular system is better or worse than another one. However, in any perception experiment, there are three major sources of variability:

1. The first source of variability is the variability associated with the case sample brought about by the fact that no two cases presenting the same abnormality would invoke the same response from the observer. The case sample variability necessitates a number of cases larger than one be used in the experiment. Typically numbers range between tens of cases used for pilot and feasibility experiments to hundred or even thousands for laboratory and clinical studies.
2. The second source of variability is the interobserver variability. It is brought about by the fact that the same case presented to different observers will not generate the same response from the observers. The interobserver variability similarly would require the participation of more than one observer in an experiment. In most perception studies, a minimum of three (typically four to six) observers is deployed with larger-scale clinical trials and observer-focused studies using tens or even hundreds of observers (Obuchowski, 2004).
3. The third source of variability is the intraobserver variability brought about by the fact that the same case presented to an observer twice will not bring about the same response from that individual observer. It can be influenced by observer fatigue as well as a host of other factors involved in human decisions. Intraobserver variability is commonly not isolated out; it is assumed to be part of the overall encompassing “observer” variability. However, if it is measured specifically, it can provide added statistical power to the study. To do so, the observers would need to provide redundant readings of some or all the images. Integral to that design, however, the compounding influence of the memory effect should be taken into consideration.

Here we provide a simple example¹ to illustrate the dependence of sample size on the difference between imaging systems in relationship to the magnitudes of the three sources of variability.

Consider a perception experiment that compares the diagnostic accuracy (AUC) of two imaging systems X and Y for a specific clinical task. For each imaging system, the researcher collects n number of cases; the cases from X are independent of (unmatched to) the cases from Y. The researcher plans to recruit a group of l number of observers, each of whom will review all cases from both X and Y (i.e., observers crossing). Provided that the number of cases cannot be further increased, the question for the researcher is then how many observers he/she will need in order to demonstrate statistically significant difference in AUC between the two imaging systems if such difference does exist.

Let us start by analyzing the different variance terms of AUC. We denote σc2 as the variance in AUC due to case sample variability, σbr2 as the variance in AUC due to interobserver variability, and σwr2 as the variance in AUC due to intraobserver variability. Because σc2 and σbr2 cannot be measured directly without including σwr2, case sample variability is measured indirectly as σc+wr2(=σc2+σwr2), the variance in AUC that would be obtained by having one observer read once a set of different case samples, and inter-observer variability is measured indirectly as σbr+wr2(=σbr2+σwr2), the variance in AUC that would be obtained by having one case sample read once by all the observers. Assuming σc+wr2, σbr+wr2, and σwr2 are the same for the two imaging systems, the standard error for the difference between X and Y can be calculated as (Swets and Pickett, 1982):

S.E.(diff)=21/2[σc+wr2+σbr+wr2l(1−rbr−wr)−σwr2]1/2,

(14.1)

where rbr−wr is the correlation between the AUC values in imaging system X (AUCobserver 1X, AUCobserver 2X,…AUCobserver lX) and those in imaging system Y (AUCobserver 1Y, AUCobserver 2Y,…AUCobserver lY) and is a result of observer crossing in this experiment. In an ideal situation of rbr−wr=1, the contribution to the standard error from interobserver variability vanishes, demonstrating the advantage of crossing the observers. Assuming the case sample is large, the test statistic follows normal distribution, i.e.,