21 – Perception of Volumetric Data

21 Perception of Volumetric Data

Geoffrey D. Rubin , Trafton Drew , and Lauren H. Williams

21.1 Introduction

Given the increasing proportion of medical images in clinical practice that are volumetric, there is growing interest in understanding how volumetric images are perceived. A major component of that is understanding how search is performed in these types of images. We have learned a great deal about how search is accomplished in clinical practice by mostly focusing on two-dimensional (2D) images, such as chest radiographs and mammograms. However, it is unclear how generalizable these principles are across imaging modalities. There is an enormous basic science literature devoted to laboratory visual search, where naïve observers search artificial stimuli (e.g., the letters T and L) for prespecified targets (e.g., a T of a certain orientation) (Duncan and Humphreys, 1989; Wolfe, 1994) (Figure 21.1). This literature is a useful starting point for understanding how search is accomplished in diagnostic radiology. In fact, Wolfe and colleagues (2016) recently posited that there is great value in understanding these sorts of laboratory visual search tasks because all humans use the same search engine for all different flavors of search (see Chapter 12).

Figure 21.1 Typical laboratory visual search task: search for the T among the Ls.

To take one example, basic science research suggests that humans tend to miss targets more often when targets occur less frequently: the so-called prevalence effect (Horowitz, 2017) (Figure 21.2A). This simple observation has profound implications for screening radiology, where signs of cancer are often present on less than 1% of all cases. However, searching for Ts amongst Ls in a boring laboratory task is a far cry from the life-and-death stakes of cancer screening, so until recently it was not clear whether the basic science findings have any bearing on the actual experience of radiologists. In fact, Evans and colleagues (2013) found that cancer prevalence plays an important role in the rate of false negatives in clinical practice (Figure 21.2B). Positive cases were much more likely to be missed when viewed in regular work flow, where cancer prevalence was below 1%, than when the same cases were a part of a laboratory experiment where prevalence was 50%.

Figure 21.2 (a) Low-prevalence effect in an artificial baggage-screening task (Wolfe et al., 2007). (b) Low-prevalence effect in expert radiologists (Evans et al., 2013).

Nonetheless, it is important to acknowledge that, although valuable information can be gleaned from tasks that are related but fundamentally different than the tasks we are interested in studying, recommendations for specific tasks are most persuasive when the recommendations are based on the task in question. For example, thanks to pioneering work from Harold Kundel’s group, there is strong evidence that approximately one-third of the lung nodules that are missed in chest radiographs are missed because they are never fixated (Kundel et al., 1978). This suggests that performance could be improved if radiologists took additional time to ensure that they analyze the exams more comprehensively. However, to highlight the challenge of generalizing findings across imaging modalities, the same group found that less than 10% of errors are caused by miss errors when searching for a bone fracture (Hu et al., 1994).

21.2 How Did We Get Here?

21.2.1 Development of Volumetric Imaging

Introduced into medical practice just 35–40 years ago, both computed tomography (CT) and magnetic resonance imaging (MRI) have evolved substantially (Edelman, 2014; Rubin, 2014). Referred to as “cross-sectional” imaging techniques, both techniques sequentially acquired discrete cross-sections of the body in their early years, which, in the case of CT, were always oriented in the transverse plane. For both modalities, the thickness of these sections most commonly ranged between 7 and 10 mm, resulting in highly anisotropic voxels with in-plane voxel dimensions that ranged between 0.7 and 1.5 mm. For display purposes, images were printed on 14 × 17” sheets of film in a tiled format with 6–20 sections per sheet of film, and hung on back-lit surfaces for viewing.

For radiologists reading these images, tracking structures longitudinally was challenging because serial images were not aligned along the imaging axis. Other than accommodation for their smaller size, individual cross-sections were reviewed in a manner similar to projection radiographs. However, to integrate information through the plane of acquisition, interpreters would scan back and forth across the pages. When a seemingly disconnected finding was observed on a single cross-section, a blank 3 × 5” card placed over the back-lit image with edges aligned to the corner of the cross-section allowed a pen mark to be placed over the finding. The card was then repositioned over adjacent cross-sections and the pen mark referenced to the structures that were now visible below it. The process was cumbersome and ineffective at conveying the three-dimensionality of the underlying anatomy. Radiologists often claimed special powers of mental integration to see the third dimension in their mind’s eye. However, these capabilities have never been formally established.

This construct for cross-sectional image review was disrupted in the 1990s with the introduction of picture archive and communication systems (PACS) (Huang, 2011). While initially configured to present CT and MRI images in the familiar tiled format, image review on a PACS-based computer display screen soon transitioned to stacked image paging (Straub et al., 1991). In this manner, a single cross-section is displayed to fill a quadrant, half, or all of a standard approximately 14 × 17” display screen. A computer input device, typically either a mouse with scroll wheel or a trackball, is then used to sequentially display serial images in the same spatial location on the display as the preceding cross-sections. When unfettered by processor-limited slowing, this display method transforms the interpretative experience to one that allows the reader to integrate longitudinal relationships with transverse scanning, providing a greater sense of 3D relationships (Seltzer et al., 1995).

In addition to forcing tiled image display, film further limited the ability to leverage volumetric image acquisition methods by rendering thin-section and overlapping reconstruction costly and impractical. With the introduction and rapid advancement of spiral or helical CT in the 1990s, augmented by multidetector-row CT in 1998, thousands of submillimeter-thick CT sections were available for reconstruction following a single scan requiring less than 10 seconds to complete (Rubin, 2000; Rubin et al., 1999). While PACS eliminated the cost and hassle of handling film, radiologists were often slow to embrace the reconstruction and review of so many cross-sections, favoring reconstruction of fewer thick sections, effectively averaging out longitudinal spatial resolution and diminished volume averaging in favor of expedient interpretation.

The advances in CT technology that have resulted in acquisitions with isotropic spatial resolution have been paralleled by advances in MRI technology. However, the greater number of tissue contrast mechanisms that can be revealed through MRI and the diversity of image acquisition approaches required to generate images that display those contrast mechanisms result in interpretative tasks that diverge for CT. While CT image review is largely focused on a single high-resolution volumetric data set, clinical MRI exams continue to be represented by multiple series of images with varying voxel dimensions and planes of acquisition. Because of the greater variability with which images are displayed and interpreted in an MRI exam, CT has emerged as the primary modality for study of reader perception using eye tracking.

Another important development in CT and MRI interpretation was the evolution of imaging processing workstations and associated algorithms for alternative visualization through the creation of multiplanar reformations, maximum-intensity projections, shaded surface displays, and volume rendering (Rubin et al., 2009). Optimized through the use of isotropic input voxels made possible by advanced CT and MRI methods, these images augment primary cross-sectional review by presenting alternative perspectives on the underlying anatomy allowing greater appreciation of complex spatial relationships or accentuating pathological features from background tissues.

21.2.2 Visualization of Volumetric Imaging

As discussed above, the transition from tiled cross-section display printed on sheets of film to PACS-based stacked scrolling had a profound effect on the image interpretation process. This change provided additional perceptual cues that facilitated lesion recognition, most notably the emergence of the pop-out phenomenon and motion-related effects. When considering that each volumetric scan represents a portion of the human body with its various organs and tissues aligned by anatomic and pathological relationships that bear little direct relationship to the coordinate system of the volumetric image, a number of alternative image presentation formats emerge as considerations for interpretation.

Two broad categories of image presentation formats include those that seek to create cross-sections that are aligned to a coordinate system that matches the structure(s) of interest, and those that seek to selectively display structures of interest volumetrically while rendering surrounding regions of anatomy transparently. While a detailed discussion of alternative volumetric display strategies is beyond the scope of this chapter, we provide a general framework for consideration.

Perhaps the simplest form of alternative display is the reformation of volumetric imaging data into planes that are orthogonal to the primary reconstructions. In the case of CT, these would be either coronal or sagittal planes. The routine use of coronal and sagittal reformation in abdominal and musculoskeletal CT is becoming increasingly commonplace (Dreizin and Munera, 2012; Ghekiere et al., 2007). These reformations facilitate understanding spatial relationships when tracking the gastrointestinal tract, long bones, joints, and the spine in particular.

Although these and other longitudinally oriented structures may approximate sagittal and coronal orientations, it is important to remember that these reformations are created relative to the CT table on which the patient lies during image acquisition. While this coordinate system approximates key planes through the body, a variety of conditions, both intrinsic to the human anatomy and behavioral, result in imperfect alignment relative to the structures of interest. Consequently, oblique reformations may be required to align with the principal axes of the target anatomy. When evaluating curved structures such as blood vessels or a scoliotic spine, curved planar reformations have been shown to convey key anatomic relationships and unveil important characteristics of the underlying structures that are not easily appreciated on standard flat planar reformations.

The second class of visualization is volume rendering. Volume rendering may be applied to slabs representing some number of adjacent cross-sectional reconstructions or multiplanar reformations; it may be applied to the entire imaging volume or it may be applied to a subvolume defined by isolating contiguous structures using region-growing or sculpting techniques. Maximum- and minimum-intensity projections represent the simplest forms of volume rendering (Napel et al., 1993). By projecting the highest- or lowest-intensity voxel on the output image, they emulate primary reconstructions but help differentiate tubular structures from nodular lesions and provide greater context for the course of those tubular structures (Rubin, 2015). Increasing in sophistication are fuller implementations of volume rendering that render voxel opacity as a function of its intensity value and possibly the intensity of neighboring voxels. Color maps are applied to help discriminate relative intensity values amongst rendered voxels, and grayscale conveys the impact of lighting models that convey surface details (Fishman et al., 2006; Rubin et al., 2009) (Figure 21.3). These varying methods of image display have become critical aspects of both primary image interpretation and to facilitate communication of key anatomic attributes to nonradiologist physicians who use the data to plan therapeutic interventions but may be less comfortable integrating the information from cross-sectional display formats.

Figure 21.3 Frontal volume rendering of an abdominal computed tomography scan. The opacity transfer functions have been selected to render the arteries, bones, and kidneys opaque, while the remainder of the structures are transparent and thus invisible. Depth through the image is conveyed by modeling surface reflections from simulated lighting.

Most volume rendering is created based upon a viewing position that is external to the imaging data set. Unlike single-axis scrolling used to navigate cross-sectional stacks, examination of volume renderings is accomplished through real-time three-axis rotation, opacity transfer manipulation, and subvolume editing. The latter is most commonly performed using clip planes of varying thickness and orientation, but can also involve free sculpting. Volume renderings use orthographic or parallel projection, where individual rays are cast parallel to one another and perpendicular to the plane of display. The resultant image is free of the perspective with which humans visually perceive the world, where objects closer to the viewer appear larger than those further away. Consequently, orthographically rendered objects are also represented visually as same size regardless of their distance from the viewpoint (Rubin et al., 2009).

In the special case of virtual endoscopy (Rubin et al., 1996), used most commonly for CT colonography, visualization is immersive, meaning it is rendered from a viewpoint within the imaging volume (Rubin, 2003). In this setting, volume rendering is performed using perspective projections where projectional “rays” converge at the viewpoint. When rendering immersive scenes with perspective, depth cueing and shadowing facilitate orientation and positional context.

Volume rendering presents several unexplored challenges to the interpretation of eye-tracking data. Standard planar reformation using a linear grayscale provides a straightforward basis for quantifying metrics of lesion conspicuity based upon background-to-target differences (Krupinski et al., 2003). Since volume rendering uses opacity transfer functions, nonlinear color look-up tables, and lighting model-enabled surface shading and shadow casting, lesion conspicuity can vary substantially for a given projection and viewpoint. When the projection is made with perspective, then an added complication is that the apparent size of a structure is dependent both upon its true size and its depth within the image. When further considering perspective volume rendering as a real-time immersive exploration driven by an observer’ s dynamic viewpoint with perspective and rendering parameters at their discretion, standard constructs of useful field of view (UFOV), saccade length, and even fixation when focused upon semitransparent objects are unmeasurable using currently accepted methods. Thus, when considering the breadth of possibilities for visualizing volumetric imaging data, it is evident that our ability to assess perception with consistent metrology is in its infancy.

21.3 What are the Benefits and Costs of Having More Images?

The increasing speed with which CT is capable of imaging the body has resulted in a veritable explosion in the size of clinical imaging exams (Rubin, 2000). In the early 1990s, most CT scans were composed of less than 50 images. Today, most exams contain at minimum 300–500 images and, for many applications, image numbers rise into the multiple thousands depending upon the length of the body to be covered and the use of time-resolved acquisitions (e.g., electrocardiographically gated examinations). Simultaneously, the use of CT in clinical medicine has also grown at the expense of projectional radiographic examinations, resulting in a shift in radiologist workload toward the interpretation of larger and more information-dense data sets.

Despite this growth in cross-sectional imaging, radiologists have been forced to confront this explosion without increases in their labor force or tools beyond PACS to make cross-sectional imaging interpretation more efficient. Consequently, and perhaps paradoxically, radiologists are reading more cases per day than they were just a few years ago, despite the fact that the cases of today contain many more images (McDonald et al., 2015). Whether this means that radiologists are interpreting imaging exams in less time or that radiologists are working longer and eliminating professional activities that did not involve image interpretation is unknown. However, even if overall exam interpretation times are approximately the same as in the past, the greater number of images in present-day exams implies that radiologists would be spending far less time examining individual images than they have in the past (McDonald et al., 2015).

In spite of these practical drivers of CT practice, there are no reports to suggest that CT interpretation is less effective than it was in the past; while the reconstruction of less than 1-mm transverse CT sections provides for higher longitudinal spatial resolution and less volume averaging, the impact of these image features to alter radiology reports in a manner that will impact the health of the patient is largely unknown. What is known is that thinner sections allow resolution of smaller structures and detection of smaller lesions. Lung nodules are a prototypical example of lesions that are found in greater number and at smaller sizes when thin versus thick sections are reviewed (Rubin, 2015). While this would seem to be an unconditionally good attribute of thin-section reconstruction, smaller lung nodules are more likely to represent benign, clinically insignificant findings than larger nodules. If the goal of performing a CT scan of the chest is to detect the presence of lung cancer, then finding more small nodules while potentially detecting more small cancers will result in a greater fraction of benign nodules that must be scrutinized and either ignored or referred for further testing to ultimately establish their benignity. In either case, it might be argued that imaging in a manner that does not detect small benign nodules would be better for the patient and the radiologist.

Another characteristic of thin sections is that they are noisier images than those created by thick sections. For structures that have a high degree of contrast relative to background tissues, the added noise may not be relevant, but for low-contrast structures such as some brain, liver, and renal lesions, the added noise of thin sections can reduce their conspicuity to a point where they are less detectable. This characteristic has been a major motivator for many radiologists to prefer 5-mm thick sections for brain, abdominal, and pelvic CT scans in spite of the greater volume averaging and lower longitudinal spatial resolution. The recent introduction and refinement of iterative reconstruction methods to replace traditional filtered back-projection has been an important advance that provides for substantial reduction in noise levels in thin reconstructions (Solomon et al., 2017). Lesion detectability using iterative reconstruction in low-contrast environments remains an area of active investigation.

One benefit of reconstructing a greater number of thin sections to achieve isotropic spatial resolution is that reformations and renderings have consistent spatial resolution regardless of their orientation. This feature has been tremendously impactful for assessing complex anatomy that is not clearly visualized in transverse sections. Notable examples include virtually all cardiovascular examinations (Rubin et al., 2009) and many examinations of the bowel (Ghekiere et al., 2007; Pickhardt, 2003) and urinary tract. While these targeted renderings are optimized to evaluate the specific anatomic structures, the cross-sections created to comprise the rendering represent additional data to be scrutinized. When isotropic data are generated for the purpose of creating 3D renderings or for quantifying tissue dimensions and orientations, then radiologists may choose to review the source images directly or a reduced set of images comprised of thicker sections.

21.4 Challenges in Experimental Design

21.4.1 Different Presentation Modes Across Studies

While it would be convenient to unify the approach to the interpretation of volumetric medical images, clinically important detection tasks vary substantially depending upon the nature of the target and background.

Early lung cancer detection is dependent upon the identification of spheroid lung nodules as small as 4 mm in diameter within a high-contrast but complex background composed of normal lung tissue with its many varied branching blood vessels and airways.
Early colon cancer manifests within polyps that represent surface irregularities within the convoluted tubular lumen of the colon. Positioned amongst normal folds and turns that are arbitrarily oriented to the body’ s principal axes, their recognition is very challenging based upon standard transverse reconstructions and most readers use specialized volume renderings to visualize surface irregularities in the colon and separate polyps from normal folds.
Many common cancers tend to metastasize to the liver. Even when a centimeter or more in size, they tend to have low contrast relative to the background of liver parenchyma. Changes in the display settings for transverse CT sections can accentuate subtle liver lesions, but the decision to access and establish alternative display settings is at the discretion of the interpreter.
Atherosclerotic plaques in coronary arteries are the cause of most heart attacks. Present on the inner surface of the coronary arteries, they are often composed of highly heterogeneous materials that may appear very bright in the case of calcifications or dark in the case of lipid-rich plaques. Their impact on blood flow to the heart muscle is dependent upon the residual coronary artery lumen adjacent to the plaque and the size of undisposed segments of the coronary artery. Effective grading of coronary artery lesions often depends upon the use of curved planar reformations to create cross-sections both along and perpendicular to the center line of the arteries.

As each of these detection tasks involves unique display paradigms, it should be evident that studies focused upon reader perception of these varied abnormalities need to be uniquely customized to each diagnostic task. Moreover, to the extent that individuals complete a diagnostic task with varying approaches using the tools available to them, investigators can struggle to establish a study model that is sufficiently realistic to measure real-world performance while being sufficiently flexible to accommodate varying approaches amongst interpreters.

Diving a bit deeper into one of the above examples, lung nodule detection in CT scans has been an area of particular interest for perception studies. A CT scan of the chest acquired with modern multidetector-row technology can be expected to result in an imaging volume containing 512 × 512 × 350 12-bit voxels with voxel dimensions of approximately 0.7 × 0.7 × 1.0 mm. Amongst the approximately 90,000,000 voxels comprising this CT scan, a 10-mm diameter lung cancer would occupy approximately 1200 voxels and a 4-mm lung cancer would occupy approximately 75 voxels, corresponding to 0.0013% and 0.00008% of the voxels in the volume, respectively (Rubin, 2015). Perhaps a more accessible construct of the challenge of lung nodule detection is to consider that a 5-mm lung cancer will be visible on 4–6 contiguous transverse reconstructions, depending upon the character of the nodule. Unless one of those sections is displayed, the possibility of lung cancer detection is zero. Across 13 readers interpreting 40 lung CT scans containing 5-mm diameter lung nodules and using standard PACS-based paging of stacked transverse sections, nodules were displayed during 6% of the search period with 94% of the search activity being devoid of detection opportunities (Rubin et al., 2015b).

When considering the design of a study to assess radiologists’ perception of lung nodules using eye tracking, investigators might consider several image presentation possibilities, including isolated CT sections in static presentation (Ellis et al., 2006), a fixed frame rate video segment scrolling from one end of the scan to the other (Ebner et al., 2017), or a free-search paradigm where readers mimic the behavior that they follow in the clinical environment (Drew et al., 2013b, 2013c; Rubin et al., 2015b). As will be discussed subsequently, these choices lie on a spectrum from greater standardization to greater clinical realism. When considering that many radiologists facilitate their lung nodule detection using sliding thin-slab maximum-intensity projections (Rubin, 2015), which help to separate nodules from background lung using a substantially smaller number of transverse cross-sections than the source transverse reconstructions, then the potential divergence between controlled study of image perception and assessments of clinical performance is evident.

21.5 Using Eye Tracking in Volumetric Data

Following the seminal work in the 1970s, eye tracking has been utilized extensively in the field of medical image perception. For example, eye tracking has provided invaluable insight on expert search behavior, the use of global processing in image interpretation, and the classification of omission errors (Kundel and La Follette, 1972; Kundel et al., 1978, 2007). This research has broad implications for diagnostic error prevention as well as training the next generation of physicians.

Although it has been several decades since eye tracking was first used in medical image perception, there are relatively few studies that have utilized eye tracking in volumetric images. Among these studies, there is little consensus on how to translate classic eye-tracking metrics to 3D images. The most common measures include fixations, saccades, time to first fixation, dwell time, and coverage. If the field of medical image perception is to keep pace with the rapid shift to volumetric imaging in medicine, these measures will need to be adapted. In this section, we will discuss the challenges in eye-tracking research that are unique to volumetric images and weigh the merits of various approaches to resolve these issues (Table 21.1). Despite these challenges, we believe that many, if not all, of these issues can be satisfactorily addressed.

Table 21.1 Comparison of two-dimensional and volumetric imaging from an experimental design perspective

Issues	Similarities	Challenges	Actions to overcome challenges
Stimulus material	Issues of standardization	Multitude of display options, ranging from time-varying display of two-dimensional cross-sections to three-dimensional volume visualization with anatomy-specific opacity transfer functions Multitude of navigation strategies specific to display type User control of navigation and display factors to mimic clinical environment	Recognition of fundamental difference across display paradigms and management of key control variables relevant to display type Provision of navigational tools with attention to variation from clinical reality Weighing of experimental rigor and ecological validity
Event detection	Event detection based on physiological properties of eye movements	Smooth-pursuit eye movements Events cover multiple cross-sections	Velocity-based event detection or use of raw data after elimination of saccades Customized software that maps events to cross-sections
Parameter calculation	Similar parameters can be used to study visual search	Apparent lesion size varies through depth and display method Meaning of parameters may change Navigation through scrolling, model rotation, or other dynamic input device-based manipulation is not covered by conventional parameters	Calculation of parameters directly related to event More research is needed; time will tell New parameters are being developed; further research is needed for their validation
Reporting abnormalities	(JAF) ROC studies are suitable	Orientation within case can be tricky Reporting interferes with eye tracking	Definitive marking of reported lesions Tracking of reporting to separate perceptual and decision-making processes

Adapted from Venjakob and Mello-Thoms (2016).

JAF, jackknife free-response; ROC, receiver operating characteristic.

21.6 Stimulus Presentation Methods

When considering how to best advance our understanding of volumetric medical image search, it is important to try to find an appropriate balance between experimental control and the strength of the connection to the task in question. This has been a complicated issue since researchers first became interested in studying medical image perception: displaying high-resolution medical images is a great deal more complicated than displaying the simple images that are typically favored in the basic science literature.

One of the first questions that researchers must ask themselves when designing a new experiment is whether they will employ an image delivery system that emulates the real conditions that radiologists use in the clinic or a constrained system that provides greater control over experimental conditions. Realistic image delivery systems should lead to conclusions that are more readily applicable to performance in actual practice, but there are costs to this approach as well. Adopting this sort of design typically forces the experimenter to relinquish control with regard to what is on the screen at what time. Although most commercial PACS allow researchers to determine how long a given case was examined, it is difficult to gain access to more fine-grained information, such as how long a given image was viewed or how long a given slice of an image was viewed.

Furthermore, most desktop-mounted eye-tracking systems cannot be used with commercial image delivery systems because of the difficulty co-registering the displayed images in synchrony with the gaze track with high temporal precision. Moreover, multiscreen presentations are increasingly the norm in clinical practice. However, calibration of eye-tracking systems is typically not allowed for more than one screen, and spatial resolution and tracking accuracy decrease significantly when more than one screen is employed.

An emerging option for introducing eye tracking into the multiscreen clinical environment is the use of mobile eye-tracking systems, which do not require a data stream from the image delivery system. The downside of this approach is that the researcher must separately determine where the radiologist was looking, which can be accomplished by viewing a video with an overlaid fixation point that indicates where the eye tracker estimates the radiologist is looking (Figure 21.4).

Figure 21.4 Fixation mapping using a mobile eye-tracking system. Top panel shows the eye-tracking pattern (large circle); lower panel shows the type of area of interest being fixated.

At present, translating the eye-tracking video into quantifiable data requires a great deal of time and effort in mapping each sample (typically in the neighborhood of 60 Hz) or fixation on to prespecified regions of interest (ROIs). There is a great deal of active work in the computer vision field to try to automatically parse these data. Although this work is extremely promising for mobile eye tracking generally, medical image perception may be a particularly difficult application for these sorts of algorithms. The algorithms are quite good at determining whether the eye is on a car or a cloud in the sky, but not yet very good at determining whether a radiologist is looking at the liver or the spleen.

The mobile eye-tracking approach also makes it difficult to confidently quantify movements through depth. Most PACS indicate which slice is being displayed in a given case, so it might be possible in principle to reconstruct movement through depth, but radiologists do not always look at a given image while moving through space, and even when they do, they may not be looking in a place where the image depth is visible on the mobile eye-tracking video. In sum, mobile eye tracking provides a very promising venue for the use of realistic image delivery systems, but doing so comes with considerable challenges, particularly if researchers are interested in movement through depth.

Perhaps as a result of the challenges involved in using realistic image delivery systems, to our knowledge all of the existing studies of volumetric medical search use artificial stimulus delivery systems where the eye-tracking system is integrated with the image-presentation software. Even under these circumstances, investigators must make additional choices with respect to balancing ecological validity and experimental control. The nature of artificial stimuli used to study medical image perception varies greatly. On one end of the extreme, experimenters can elect to examine how single cross-sections of a volumetric image are examined (Ellis et al., 2006; Matsumoto et al., 2011; Suwa et al., 2001). This approach has the advantage of allowing the experimenter to know precisely when the image is visible (thereby allowing simple computation of metrics like time to first fixation), and avoids the inherent complications involved in interpreting eye movements through volumetric space. The cost is that this approach assumes a single slice of a volume will be treated the same as if it was viewed in its original volume. As an analogy, it is intuitively clear that we would treat a single frame from a movie much differently from that same frame embedded in a film. Although this approach may be valuable in assessing how a particular image from within a volume might be treated, it certainly misses much of what makes examining volumetric images interesting and important.

A classic example of the additional information provided by scrolling through stacks of images rather than treating each image independently is the so-called popping of chest nodules in chest CT scans. In the mid-1990s, radiologists had embraced the use of dynamic viewing methods for evaluating moving structures like coronary arteriograms, but radiologists were hesitant to use this viewing method for most image interpretation tasks (Coughlin et al., 1992). Based on basic science findings that suggest that perceived motion is an important feature for picking structures out of 3D representations (Cavanagh, 1987), Seltzer and colleagues (1995) reasoned that lung nodule detection in chest CT would be improved by viewing stacks of images in sequence rather than the more traditional method of viewing single images one at a time. This is largely due to structural differences between lung nodules and lung vessels that become clear when viewed in sequence but are extremely difficult to piece together over multiple independent slices. Nodules are generally spherical while vessels are more tubular. In stack viewing, one quickly finds that an important signal to look for is the rate of change for the diameter of these sorts of structures. This source of information is completely absent in 2D viewing of these images. Thus, it is reasonable to expect that the eye movements would be fundamentally different as well.

Another approach to examining search through volumetric space is to display a series of images at a prespecified rate (Bertram et al., 2016; Helbren et al., 2014; Mallett et al., 2014). This allows the researcher fine experimental control of which aspects of the case are displayed at what time and eliminates a large source of variance that occurs when observers are given the ability to freely navigate through the volume. This approach facilitates examination of eye movements in a well-controlled, volumetric environment and ensures that all observers see the same visual stimulus. Thus, the experimenter can be more confident in identifying any observed differences in behavior or eye movement patterns.

In contrast to scripted and pre-rendered video presentations, free-viewing experiments allow radiologists to navigate imaging volumes in a manner that is familiar and congruent with clinical practice (Rubin and Krupinski, 2017). Although allowing the observer to decide how to move through the volume complicates the analysis, it conveys important information: the volumetric search process through stacked cross-sections is ultimately an integration of gaze navigation in two dimensions and scrolling through the third dimension.

For example, one of the foundational findings from the medical image perception literature is that experts tend to make fewer fixations and longer saccades than novices when viewing the same image (Krupinski, 1996; Krupinski et al., 2006; Manning et al., 2006). On some level (whether it be conscious or not), experts know that they do not need to fixate each rib on a radiograph in order to determine whether one of them might be broken. Thus, the pattern of how radiologists move their eyes through an image seems to denote important information about how they are processing the information in that image. Following this logic, we argue that researchers may be able to extract important information by examining how radiologists decide to move through volumetric space. As we will see later in this chapter, it appears that scrolling behavior is a promising method for differentiating between the various strategies employed while moving through stacks of cross-sections.

Although conducting studies using free-viewing conditions has the dual appeal of allowing researchers to examine scrolling behavior and more closely approximate how observers examine these images in clinical practice, these studies are significantly more complex from an analysis perspective. There are many software packages that easily integrate with modern eye-tracking hardware when displaying 2D images. At present, doing so with volumetric images requires customized solutions. All modern eye trackers can be set to export x, y, and t (time) information. The most important factor necessary to enable eye tracking of volumetric images is a precise record of which image cross-section (z) is on the screen at each moment in time throughout the experiment. Accurate registration of both gaze and subject-driven scrolling through cross-sections is necessary for recording the resulting 4D (x, y, z, t) gaze traces (Figure 21.5) (Rubin et al., 2015b).

Figure 21.5 Schematic representation of real-time gaze volumetric tracking system used to study search and lesion detection during free navigation through volumetric imaging data sets. CT, computed tomography. (Rubin et al. (2015b).)

21.7 Translating Eye-Tracking Metrics to Volumetric Data

21.7.1 Samples, Fixations, and Smooth-Pursuit Eye Movements

Following preliminary calibration (Tall et al., 2012), eye-tracking systems record gaze position as a set of x, y, and t “samples” at a rate that is typically between 60 and 250 per second (though may extend between 30 and >1000 Hz). Most modern eye-tracking software packages include automatic methods for aggregating and parsing eye movement samples into fixations and saccades. Fixations are prolonged periods of relatively stable eye position. As such, fixations are used to make inferences about which regions of an image have been viewed or attended. In contrast, saccades are rapid eye movements from one fixation to the next fixation.

Should researchers use these well-established settings in volumetric images? As outlined by Venjakob and Mello-Thoms (2016), this is a tricky question, and the field does not appear to have settled on an answer yet. In volumetric imaging, there are no established analogues or definitions for these basic eye-tracking measures. In a volume, the eyes can remain in a stable position along the 2D plane while the observer’s entire visual input changes through the depth of the image. Eye trackers will classify this event as a fixation, but it is not a fixation in the typical sense of the word.

Two approaches have been reported to resolve this problem. One approach is to envision eye tracker-defined fixations as cylinders that extend into the depth of the volume (Figure 21.6). This can be achieved by co-registering the position in depth at each sampling point and recreating the cylinders offline. This approach can be used to calculate refixations or determine whether an ROI was viewed. However, it does not take into account the length of time spent on a given slice of the image. Therefore, accurate dwell time measures are difficult to calculate.

Figure 21.6 Fixations in volumetric images illustrated as cylinders that extend through the depth of the image.

The second approach is to use the raw sampling data to recalculate fixations offline based on the amount of gaze points clustered together across time and space separately for each slice of the image. This requires co-registering the position in depth at each time point and removing saccades from the raw data prior to analysis. Dwell time on a region of interest can be calculated by summing across the relevant layers. This method is the closest approach to the current definition of a fixation, but information about travel through depth might be lost (e.g., fixation duration through depth).

Rather than isolating fixations from amongst continuous gaze traces, some volumetric imaging studies have used the raw sampling data to calculate their outcome variables (Rubin et al., 2015b). However, raw samples do not isolate saccades, which are typically eliminated from analysis due to saccadic suppression. Saccadic suppression is the finding that visual perception appears to be inhibited (or suppressed) during saccadic eye movements, such that observers do not notice large changes (e.g., words changing positions) that occur during saccades (McConkie and Rayner, 1975). If the raw sampling data are used, periods of time with suppressed visual input might be misinterpreted and the level of noise in the analysis will increase. The distinction between fixations and saccades might not be as important for comparing outcome variables across groups or conditions with equal levels of noise. However, when considering the phenomenon of saccadic suppression, using raw sampling data might be misleading if the researcher aims to determine when the observer views a particular ROI.

Slow-pursuit eye movements are another important factor for consideration when utilizing eye tracking in volumetric images. Slow-pursuit eye movements occur when the eye slowly (<100°/second) follows a moving stimulus. Unlike saccades, visual information is not suppressed during a smooth pursuit. Most saccadic detection algorithms are based on the velocity and acceleration of the eye-position. These packages assume that if the eye is not moving quickly enough to be considered a saccade, it must be a fixation.

In the vast eye-tracking literature, very few researchers examine slow-pursuit eye movements because we are incapable of generating slow-pursuit eye movements on static stimuli. You may think you can slowly move your eye along the top of this book, but if you ask a friend to watch your eyes as you do this, the friend will see that you are making small saccadic jumps. On the other hand, if your friend moves the book slowly in front of you, your eyes will slowly track that movement. Volumetric images present a unique opportunity to elicit slow-pursuit eye movements as discrete structures move through the volumetric space. So, while slowly tracing a rib on a single slice of a CT scan will elicit saccadic eye-movements, following a lung vessel through many slices of the same image may elicit a slow-pursuit eye movement.

Eye trackers are designed to handle static images, and smooth pursuits are often misclassified as long phantom fixations interspersed with phantom saccades (Mital et al., 2011). This is a particularly important consideration when researchers are interested in measures such as overall coverage of an image or categorizing errors based on how long a lesion was fixated. Notably, the fixation coordinates calculated by the eye tracker are slightly inaccurate during a pursuit because the software calculates the average x and y coordinates assuming a relatively stationary eye position.

The simplest approach to solve this issue is to use the raw sampling data to ensure that all periods of visual input are included. However, as previously mentioned, these data will also include periods of saccadic suppression. Advanced algorithms have been developed to correctly identify smooth pursuits based on their velocity and acceleration, but they have not yet been utilized in volumetric imaging research (Larsson et al., 2015, 2016; Mital et al., 2011). To our knowledge, the only method used to measure smooth pursuits in volumetric imaging has been to calculate consecutive gaze points with a specified duration and distance from the ROI (Helbren et al., 2014). The beginning and end of the pursuit are determined based on whether the gaze point is within the UFOV of the ROI boundary. However, it is unclear if this method properly accounts for periods of saccadic suppression and it has not yet been utilized using free-scroll methodology.

Although identifying smooth pursuits in volumetric images is in its infancy, novel eye-tracking metrics and analogues to classic eye-tracking metrics are already being developed. For example, time to first hit is a common measure used in 2D imaging and has provided important insight on expert search behavior. An analogue, time to first pursuit, proposes a link between volumetric imaging and 2D methods (Helbren et al., 2014). As in 2D images, time to first pursuit decreases with experience.

21.7.2 Apparent Lesion Size

One challenge of volumetric imaging is the changing size of the target. There are two independent phenomena that manifest this challenge. The first and most common occurs when examining a cross-sectional stack. The classical scenario involves a spheroid lesion that, while visible over multiple cross-sections, will exhibit its greatest dimension equatorially with gradually diminishing cross-sectional area for sections further from the lesion center. Many lesions encountered in vivo are not spheroid and possess irregular, complex margins. Depending on the nature of the experiment and the data to be collected, accommodation for these varying dimensions and lesion conspicuity may be necessary.

The second scenario where apparent lesion size will vary is when examining imaging volumes using perspective volume rendering (Rubin et al., 1996), most commonly used for CT colonography. In this setting, the use of perspective to render 3D scenes from an immersive viewpoint within the CT data represents structures closer to the viewpoint as being larger than those further from the viewpoint (Figure 21.7). One study using perspective rendering suggests that experts might be able to identify lesions when displayed with a smaller size (greater distance from the rendered viewpoint) when compared to novice observers (Helbren et al., 2014). In this setting, novel measures such as size at first pursuit might be a promising method for assessing expertise in volumetric imaging.

Figure 21.7 The upper row depicts three approaches to visualizing a three-dimensional scene which is composed of seven spheres of varying sizes as viewed from above. The left upper image depicts the orthographic rendering condition where parallel rays are cast from the scene toward the viewer (arrows), essentially modeling a viewer at very far distance with a powerful telescopic lens. In this setting, the rendered scene (bottom left) represents the spheres with sizes that are equally scaled to their native size. The central images depict perspective volume rendering, where rendering rays converge on a viewpoint represented by the gray camera. The viewpoint is positioned in front of a smaller sphere (intermediate gray), but the scene rendering below depicts the intermediate gray sphere as larger than the other spheres, because of its proximity to the viewpoint. In the scene on the right, the viewpoint is advanced into the data volume, around the intermediate gray sphere (dotted line) and exposing the other six spheres. In this scene, the black-and-white ball appears largest when in reality the ball behind it is the largest. The effects of perspective should be considered when studying medical imaging volumes rendered from immersive viewpoints such as with computed tomography colonography.

21.7.3 Useful Field of View

The UFOV or functional visual field refers to the area around a fixation that can be attended without moving the eyes (Ball et al., 1988). Historically, a UFOV with a 5° diameter has been utilized in 2D medical images (Kundel et al., 1989; Nodine et al., 2002). Cognitive research has shown that the UFOV decreases in size as the complexity of the stimuli increases, but it is unclear how this applies to the depth dimension of a volumetric image (Young and Hulleman, 2013). In the volumetric imaging literature, assumed UFOV values have varied in size from 2°, 2.5°, and 5° diameters (Drew et al., 2013b; Ellis et al., 2006; Helbren et al., 2014; Williams and Drew, 2017) (Figure 21.8). In the most detailed analysis to date, Ebner and colleagues (2017) demonstrated the UFOV as a distance-dependent probability function that varies by reader, lesion size, and background complexity. Extrapolating from this analysis, we might expect that the UFOV will differ across medical image types and may need to be calculated separately for different types of volumetric images. Once the UFOV has been calculated for a given stimulus, it can be used to calculate a number of eye-tracking metrics (e.g., coverage).

Figure 21.8 Variation in useful field of view for two of six radiologists studied in Ebner et al. (2017). Predicted probabilities of detection are displayed as a function of: (1) distance of secondary nodule from primary (on x-axis); (2) size of target (black: 5–8 mm, gray: 3–4 mm colors visible online only); (3) background complexity defined by the number of particles 13 mm² or greater within a 100-pixel (68-mm) region of interest centered on the target. Particle count (solid line: median particle count = 20, dashed line: first quartile of particle count = 13, dotted line: third quartile of particle count = 26).

21.7.4 Error Classification

In 2D images, omission errors have been classified into three categories (Kundel et al., 1978): search errors (the observer fails to fixate on the abnormality), recognition errors (the observer briefly fixates on the abnormality but does not mark it), and decision errors (the observer dwells on the abnormality but misclassifies it as normal). In volumetric images, these classifications are made more difficult by the fact that abnormalities change position and size along the x and y plane as the observer scrolls through depth. This requires defining a 3D ROI, co-registering the position in depth at each sampling point, and recalculating fixations or pursuits in the volume offline. Using these techniques, omission errors can be classified in volumetric images using the same definitions used in static images. For example, these types of errors have been identified using similar methods in 3D colonoscopy (Phillips et al., 2013) and chest CT scans (Drew et al., 2013b; Rubin, 2015b).

21.7.5 Gaze Paths

As discussed earlier in the chapter, the depth dimension of volumetric images is a particularly challenging issue in terms of experimental control. Researchers frequently attempt to control this additional variable by imposing a single, unidirectional, fixed-velocity video presentation through depth, represented as either a stack of cross-sections (Diaz et al., 2015) or volume-rendered scenes (Bertram et al., 2016; Helbren et al., 2014; Mallett et al., 2014). However, this approach eliminates the ability to calculate realistic scan paths through volume.

This is particularly problematic when you consider the vast amount of information that has been gained about expertise by studying scan paths in 2D images. Notably, volumetric studies that allowed the observer to freely navigate through the image have already discovered important differences in how observers scroll through volume. In some cases, certain search strategies have been linked to better outcomes (Diaz et al., 2015; Drew et al., 2013b). Although some experimental questions might benefit from restricting movement through depth, this variable can be effectively utilized for many questions by co-registering the position in depth with each sampling point and recreating the scan paths offline.

21.7.6 Summary

Eye tracking during search through volumetric images comes with a unique set of challenges. However, many of these issues are fundamentally solvable problems. The majority of existing eye-tracking metrics can be adapted to volumetric images. In addition, volumetric imaging has introduced a number of new metrics that may prove to be useful measures of how search occurs in a volume. Currently, the biggest issue in the field is a lack of consistency in presentation mode and eye-tracking measures across studies. This makes it difficult to make comparisons between volumetric studies or to connect findings to the 2D literature. Although the nature of the research question is the determining factor in experimental design, the field will benefit from communicating in the same terms and measures wherever possible. Using eye tracking in volume is further complicated by the need for customized software to calculate eye-tracking metrics offline. However, this issue can be minimized using open science practices and sharing resources between researchers. Eye tracking in volumetric images is still in the early stages, and there are many promising avenues of research to pursue.

21.8 Suggestions for Best Practices in Experimental Design

Although there are a growing number of studies that employ eye tracking to analyze volumetric medical images, there is a huge amount of variability with respect to how the images are displayed and how the resultant data are analyzed. Of course, best practices will always depend on the question being addressed, but we attempt to provide some recommendations below.

1. More realistic is better … to a point. There is a fine balance between creating a study that manages to capture important aspects of what a clinician sees while maintaining enough experimental control to be able to draw meaningful conclusions from the resultant data. A good example of this dynamic is the question of whether to conduct a study using a real PACS and mobile eye tracking, or an artificial stimulus delivery system and desktop-mounted eye tracker. The first option is a better representation of what a radiologist actually sees in practice, but will yield data that are a great deal more complicated to analyze. Given current mobile eye-tracking systems, the data will also have lower spatial and temporal resolution. In general, we would recommend using the most realistic display possible that yields data that researchers feel confident they can interpret.
2. Think carefully about samples vs. fixations. There is currently no consensus on what the correct answer is to this challenge. Using samples erroneously includes saccadic eye movements. However, using simple fixation algorithms treats slow-pursuit eye movements as fixations and may oversimplify the location of these fixations. The best choice will often depend on which outcome variables are most important to the researcher.
3. Coverage is more complicated than it appears. Most measures of UFOV in the literature are based on estimates from other investigators using different stimulus materials and different tasks. One thing we know about UFOV is that it definitely varies with both of these factors: UFOV for a strawberry hidden in a bush will certainly be much lower than the same target on a white screen. Thus, without preliminary experiments devoted to quantifying UFOV for a particular experiment, UFOV should be treated as a rough estimate. If coverage is an important outcome variable, best practice may be to report results from a range of UFOV estimates.
4. Do not ignore the volumetric aspects of volumetric search. Although it is tempting to compare volumetric search to better understood aspects of 2D search, doing so can lead researchers into the trap of missing the unique aspects of volumetric images that make them so interesting and challenging. In our work, we have found that some of the most interesting individual differences are related to how individuals traverse through volume (Drew et al., 2013b). Although eliminating the ability to navigate freely across depth can be appealing from an experimental design perspective, it may obscure an important potential source for understanding how experts search through these sorts of images.
5. Image presentation methods may present fundamentally different detection tasks from the same source data. In current medical practice, the majority of volumetric imaging data sets are interpreted by paging through a stack of cross-sections. Alternative visualization paradigms such as volume rendering are common to some niche applications such as CT colonography. Volume renderings may be presented as static views, pre-rendered movies along defined “flight paths” (Paik et al., 1998), or free explorations driven in real time by the observer. Hybrid methods using stacked slabs of volume renderings via maximum- or minimum-intensity projections represent another common niche application popular for lung nodule detection (Napel et al., 1993; Rubin, 2015). Assessments of search strategy within a given imaging data set must be targeted to the format of image presentation.

21.9 What We Have Learned

21.9.1 Search Strategies

21.9.1.1 Drillers and Scanners

When faced with a volumetric stack of cross-sectional images, the searcher must decide how to traverse through depth. Hypothetically, if the stack of images was not highly correlated across depth, one might be warranted in searching each level of the stack independently. In practice, due to the high correlation from one level to the next, even naïve observers quickly learn to scroll through depth at much higher rates than one would if the levels were truly independent. However, there is substantial variation in how different radiologists move through depth while examining chest CT scans (Drew et al., 2013b).

Radiologists fall into two broad strategies with respect to how they search the chest CT volumes: scanners and drillers (Figure 21.9; for color figures see Drew et al. (2013b)). Scanners tended to move slowly through depth while making fewer consecutive eye movements within a given quadrant in the lung. Drillers tended to move more quickly through depth with more sequential eye movements within a single quadrant. In our relatively small sample, roughly one-third of our radiologists were categorized as scanners and two-thirds as drillers. Interestingly, drillers find more nodules and appear to examine a higher proportion of the lung tissue in roughly the same amount of time as scanners (Figure 21.10).

Figure 21.9 Illustration of the two search strategies, scanning and drilling, observed in volumetric image search.

(Reproduced from Drew et al. (2013b).)

Figure 21.10 (A) Comparison of the types of errors made by scanners and drillers in a nodule detection task. (B) Comparison of lung coverage across time by scanners and driller.

(Reproduced from Drew et al. (2013b).)

This is a tantalizing finding, but needs to be replicated with a larger, more diverse sample of radiologists. The current data suggest that drilling may be a superior strategy for a lung nodule detection task in chest CT images. However, it remains to be seen whether this finding generalizes to different images or even different tasks when examining chest CT images. One could imagine that drilling is a good strategy for lung nodule detection, but might be bad for the detection of incidental findings. In fact, when a large gorilla image was digitally inserted into the final chest CT scan amongst a set of lung cancer screening studies (Figure 21.11), 20 of the 24 radiologists who took part in the study missed this large, unexpected abnormality (Drew et al., 2013c). This is a replication and extension of a famous finding in the cognitive psychology literature known as inattentional blindness (Simons and Chabris, 1999). Inattentional blindness rates did not differ for drillers and scanners, but the study was not designed to examine whether search strategy influences the likelihood of missing incidental findings. Given the ever-growing number of volumetric images radiologists need to be able to search, there is certain to be variability in terms of which strategies are best for a given modality. Unfortunately, we currently lack any sense of which rules governing which strategies are best for a given image type.

Figure 21.11

(A) In the original inattentional blindness study, participants failed to notice a person in a gorilla suit when they were asked to count the number of basketball passes.

(Reproduced from Simons and Chabris (1999).)

(B) Expert observers failed to find a large gorilla inserted into chest computed tomography scans, which demonstrates that experts are not immune to inattentional blindness.

(Reproduced from Drew et al., 2013c).

21.10 The Role of Peripheral Vision

Through study of 2D projection radiographs, a theory of detection has been advanced that schematizes detection into three sequential activities – search, recognition, and decision (Kundel et al., 1978). In this paradigm, search directs the gaze to regions where a structure such as a lung nodule might be exposed to the visual system for eventual detection. Once within a critical region of the visual field, the next step is recognition that a suspicious finding is present, which then initiates a decision-making process to determine if a lesion is present or not. Kundel (2015) has advanced the hypothesis that the requirement for search is satisfied when a lesion falls within the central field, which is assumed to be within a 5° gaze angle, and recognition is then satisfied if the lesion is exposed to central gaze for at least 100 ms. It is assumed that an undetected lung nodule meeting these criteria is due to an error in decision making.

Using this model, in our analysis of 13 radiologists seeking to detect 157 5-mm lung nodules in 40 CT scans (Rubin et al., 2015b), false-negative nodules were attributable to errors of search, recognition, and decision-making errors in 31.3%, 8.5%, and 60.2%, respectively. From these results, it is intriguing that amongst recognized nodules (falling within central gaze for at least 100 ms), nodules were seven times more likely to pass undetected as “decision-making errors” than they were to be detected. Given the relatively high conspicuity of 5-mm lung nodules on thin CT sections, we expect that a large fraction of “recognized” nodules would be detected.

Fundamental to these outcomes are strict definitions of gaze distance and duration to characterize error in missed nodules. Holding the criteria for search error fixed, Figure 21.12 illustrates how shifting the threshold for confirmed “recognition” from 100 to 600 ms in central vision alters the distribution of error types for the 13 radiologists described above. When considering the occurrence and varying conditions that might influence behaviors such as inattentional blindness, the idea of applying strict exposure thresholds to classify detection errors in CT may warrant reconsideration. While the distinction between lesions that received or did not receive central vision is relatively straightforward, the line between recognition and decision-making errors is much less clear.

Figure 21.12 The prevalence of error type in lung nodule detection on computed tomography scans is heavily influenced by the threshold for “recognition.” With 31% of nodules never exposed to central vision (search error), the remaining 69% of missed nodules are classified as errors of either recognition and decision making. When varying the minimum amount of time required for a considering a nodule to be “recognized” from 100 to 600 ms, the frequency of these two error types varies from 8.5% and 60.2% to 48.8% and 19.9% for recognition and decision-making errors, respectively.

(Data adapted from Rubin et al. (2015bh).)

The definition of recognition error is not the only refutable aspect of the proposed model of detection error based upon strict exposure metrics. A definition of “search” error based upon a 5° central gaze angle does not recognize the fundamental role for peripheral vision in search. Two observations refute the strict requirement that adequate search necessitates exposure to a fixed definition of central vision:

1. UFOV is not fixed: it varies by reader, lesion size, and background complexity (Ebner et al., 2017).
2. The fraction of nodules exposed to central gaze during a reader’ s free search through the lungs is substantially greater than the fraction of the lung parenchyma exposed to central gaze (Rubin et al., 2015b).

The second point suggests that the gist of a scene, realized through global processing of information across the whole of the visual field, is key to directing search and is largely peripheral vision-dependent (Kundel et al., 1987). Global processing of information may be one of the factors that underlie the well-known differences between novice and expert medical image perception.

One of the most remarkable feats of expert radiologists is the vast amount of information that can be obtained in a single glance (see Chapter 12 for a more in-depth discussion of gist). Kundel and Nodine discovered that radiologists can achieve 70% accuracy after glimpsing chest radiographs for just 200 ms (Kundel and Nodine, 1975). Similarly, Evans and colleagues (2013) found that experts can identify the presence, but not the location, of abnormalities in mammograms at a level above chance in as little as 250 ms. Furthermore, half of all breast cancers are fixated within just 1 second (Kundel et al., 2007). This growing body of research suggests a two-stage model of visual search (Drew et al., 2013b) (Figure 21.13). In the first stage, experienced individuals are able to rapidly form a global impression of the image that serves to guide the more focused search of stage 2. It is thought that this ability is a specialized version of gist processing, which allows observers to categorize a scene in a mere fraction of a second (Potter, 1975). Experience with different scene categories trains the observer to recognize statistical irregularities in the image. In radiologists, this ability is thought to extend to medical images after extensive experience. In volumetric imaging, it is unclear what role, if any, global processing has in determining which regions of the image to further scrutinize. However, evidence does suggest that expert observers make an initial pass through volume before making their decision (Diaz et al., 2015).

Figure 21.13 Two-stage model of visual search.

(Reproduced from Drew et al. (2013b).)

21.11 The Moment of Recognition

To explore the role of peripheral vision in the detection of lung nodules from stacked scrolling through transverse CT reconstructions, we analyzed gaze tracks immediately preceding 997 lung nodule detections and identified a dominant pattern where an identifiable saccade converged on the nodule and was followed by a period of scrolling and relatively stable eye movement prior to the reader’ s confirmation of detection (Rubin et al., 2015a). We classified this latter period as decision making and the moment immediately preceding the saccade as the moment of recognition (Figure 21.14). Upon analyzing the gaze position at the moment of recognition we found tremendous diversity across readers and nodules. Overall, half of recognition events occurred within a 5° gaze angle, while the other half were beyond the 5° gaze angle and thus classified as being within peripheral vision, with the furthest recognition event occurring at a gaze angle of 36°.

Figure 21.14 Free longitudinal (z) search path from a single reader examining a chest computed tomography (CT) scan with three 5-mm lung nodules centered on the lines and visible over the faded bands. Arrows indicate periods when nodules were displayed on visualized cross-sections, but were not detected. The boxed region indicates the period when one of the three nodules was detected by the reader. The three nodules were visible for 3.1, 3.3, and 3.4% of the search time and were exposed to central gaze for 0.2, 0.0, and 0.1%, respectively, across the 357-second search duration. (B) The 3.5-second region contained within the boxed zone in (A) is magnified and displayed with the corresponding gaze point samples. Vertical gray zones indicate regions where the target lung nodule is not visible at the extremes of the slab through which the subject scrolls. Selected time points (1–6) are illustrated with corresponding CT section, gaze point (black circle with 50-pixel radius), target (white circle), and acceptance of the detection (dashed white circle). The subject is positioned such that central gaze (5° gaze angle) is within 90 pixels of the gaze point. At the beginning of the trace, the nodule is not visible, but the subject scrolls down and the nodule is revealed when the gaze is 353 pixels away (1). The gaze then deviates closer to the x, y position of the nodule (2), but moves back to the posterior lung (3). Following a saccade, the gaze shifts anteriorly to within 164 pixels of the nodule (4). A saccade ensues, bringing the gaze within 50 pixels of the nodule, just as the viewer scrolls beyond the nodule, reverses scroll direction, and lands on the nodule (5). After 1 second scrutinizing the nodule, it is accepted (6). Based upon the location of the final saccade converging on the nodule, the moment of recognition is classified to occur at the dotted black line and the preceding time period is assigned as search while the subsequent time period is assigned as decision making.

Easier-to-detect nodules (measured as a high fraction of readers successfully detecting them) tended to be recognized at greater distances than were nodules that were harder to detect, supporting the observation that context contributes to a narrowing of the UFOV. Additionally, readers varied significantly in their tendency to rely on central versus peripheral vision with interreader median recognition distance varying by over 100%. However, central versus peripheral dominance in nodule recognition was not correlated with overall detection performance (Figure 21.15). Whether central versus peripheral gaze dominance in nodule recognition is a reflection of innate differences in readers’ visual systems or is learned through training is a topic for future investigation.

Figure 21.15 Box-and-whisker plots show the distribution of the gaze distance from the nodule center at the moment of recognition (MoR) for 13 viewers. The dotted region (≤ 32 mm from the nodule center) represents distances within a 5° gaze angle and corresponding to central gaze. Detections recognized within the lighter zone represent peripheral recognition events. Dark gray, light gray, and white boxes represent readers that primarily detect with central, balanced, or peripheral vision, respectively. The tendency for some readers to detect with peripheral vision varies greatly across the readers, but is not associated with diagnostic performance (sensitivity values to the right of the plot).

The implications of these observations are intriguing and may support differentiating search strategies based upon a reader’ s tendency or ability to recognize lesions with peripheral gaze. For example, search strategies intended to enhance diagnostic performance might not influence readers equally, depending upon their ability to rely on peripheral vision. There is some evidence that readers might subconsciously accommodate their search paths to their tendency to rely on peripheral vision, as readers with greater recognition distances tend to construct searches that expose a smaller fraction of the lungs to their central vision.

21.12 Expertise and Training

21.12.1 Search Patterns

There are a number of measures associated with expertise in medical image perception, but these analyses have largely been restricted to 2D images. As the observer gains experience, accuracy improves, the time before the first detection is reduced, a smaller region of the image is covered, there are fewer fixations, and search time decreases (Krupinski, 1996; Krupinski et al., 2006; Manning et al., 2006).

There have been a few studies using volumetric imaging that have compared search patterns between experts and novices. In addition to increases in accuracy, Mallett and colleagues (2014) found that the time to first pursuit, the volumetric analogue to time to first hit, decreased with expertise. However, no other differences in their eye-tracking metrics were found.

Bertram and colleagues (2016) also found an increase in accuracy with increasing levels of expertise, which was largely due to the fact that expert radiologists found a greater number of low-contrast lesions than residents. They found a decrease in saccadic length on slides with a lesion present, and this reduction in saccadic length was greater for expert radiologists than residents. Notably, both of these studies used scripted and pre-rendered video segments, which limits the ability to generalize these results to more ecologically valid settings.

Examining transverse brain MR images in patients with stroke, Cooper and colleagues (2009) found that experts fixated on lesions more quickly, had more fixations overall, and had more fixations for longer periods of time within a ROI around primary lesions. Observers were able to scroll through depth in this study but could only do so in one direction. In the only assessment of expertise-based performance to date where observers were free to scroll throughout a stack of primary CT cross-sections at will, experience was not associated with sensitivity of lung nodule detection, overall fraction of the lungs viewed by central vision, or total search duration (Rubin et al., 2015b).

21.13 Scrolling Through Depth

Studies that have allowed observers to scroll freely through depth have found important differences in movement patterns. For example, as discussed earlier in the chapter, Drew and colleagues (2013b) characterized two methods of searching through a volume: scanning and drilling. Drillers covered more of the lung and found a greater number of nodules than scanners. However, drillers had more years of experience on average and it is difficult to determine the direction of the effect. In addition, Diaz and colleagues (2015) found that experts tend to make a few passes through the layers of the lung before making a decision. In contrast, novice observers scrolled rapidly back and forth through a few layers at a time and made more direction changes overall.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Tags: The Handbook of Medical Image Perception and Techniques

Jan 4, 2021 | Posted by drzezo in GENERAL RADIOLOGY | Comments Off

Radiology Key

Fastest Radiology Insight Engine

21 – Perception of Volumetric Data