2 – A Short History of Image Perception in Medical Radiology




2 A Short History of Image Perception in Medical Radiology


Harold Kundel and Calvin F. Nodine



Offering an account of the past, in disciplinary histories as in ethnic and national ones, is in part a way of justifying a contemporary practice. And once we have a stake in a practice, we shall be tempted to invent a past that supports it.


(Appiah, 2008)


2.1 Foreword


Medical radiology is a practical field in which images are produced primarily for the purpose of making inferences about the state of health of people. Research in radiology is also practical. Historically, imaging physicists have concentrated on developing new ways to visualize disease and on improving image quality. They have worked on the development of psychophysical models that express mathematically how observers respond to basic properties of displayed images such as sharpness, contrast, and noise. Radiologists have concentrated on image interpretation, which is using images for diagnosis, follow-up, staging, and classification of disease. Image perception, which is the process of acquiring, selecting, and organizing visual information, has generally been neglected, perhaps because radiologists take for granted their ability to make sense of the patterns in images. Research in image perception has been motivated by two factors: first, the realization that human factors are a major limitation on the performance of imaging systems and second, the appreciation of the extent of human error and variation in image interpretation. Radiologists certainly are surprised when they discover that they either missed a lesion or saw one that really wasn’t there.


This chapter will trace the study of perception and psychophysics as it has unfolded in books and journal articles. We will concentrate on observer error and variation, which have been major stimuli for the development of a body of statistical methodology known as receiver operating characteristic (ROC) analysis and for attempts at understanding the perceptual basis for image interpretation and reader error. The chapter is based in part on material used by one of us (HK) for a talk given in 2003 at the Medical Image Perception Society in Durham, NC, USA. It reiterates material already published in the Journal of the American College of Radiology (Kundel, 2006). Manning et al. (2005), as well as Eckstein (2001), also have published histories of medical image perception. A review of research and development in diagnostic imaging by Doi (2006) contains observations about image perception, and Metz (2007) has written a tutorial review of ROC analysis that is considerably more detailed and authoritative than the material presented here.1



2.2 Fluoroscopes and Fluoroscopy: A Lesson in Optimizing Image System Performance


One of the earliest articles about visual perception in radiology was a discussion of visual physiology and dark adaptation in fluoroscopy by Béclère (1964). Béclère’s article was originally published in 1899 but it was not until 1941 that the impact of dark adaptation on the visibility of details at fluoroscopy was seriously studied. A radiologist, W. Edward Chamberlain, working with the medical physicist George Henny, used the phantom developed by Burger and Van Dijk (1936) to measure contrast-detail curves for fluoroscopic screens. They came to the conclusion that, although the fluoroscopic screens in use at the time were technically almost equal in sharpness and contrast to images on X-ray films, the decrease in visual acuity and intensity discrimination of the retina at low brightness levels “render the available sharpness and contrast more or less invisible” (Burger and Van Dijk, 1936). Chamberlain presented the results in the Carman Lecture at the Radiological Society of North America (RSNA) (Chamberlain, 1942) and suggested that a device called an image intensifier that had been patented recently by Irving Langmuir of the General Electric Research Laboratories could provide a technological solution to the visibility problem. The subsequent development of the image intensifier (Coltman, 1948) vastly improved fluoroscopy and facilitated the development of cineradiography, cardiac catheterization, and interventional radiology.



2.3 The Personal Equation: Objectively Evaluating Image System Performance


Chamberlain was also involved in the first extensive study of error and variation in radiology. In 1946 the US Public Health Service (USPHS) and the Veterans Administration (VA) initiated an investigation of tuberculosis case finding by the newly developed technique of photofluorography. The VA had the responsibility of evaluating induction and discharge chest radiographs on millions of World War II veterans and wanted to use the best of four imaging techniques available for chest screening: 14 × 17-inch celluloid films, 14 × 17-inch paper negatives, 35-mm photofluorograms, and 4 × 10-inch stereophotofluorograms. A “Board of Roentgenology” chaired by Chamberlain and consisting of two radiologists, three chest specialists, and a statistician was convened to address the issue. They designed a study in which five readers independently interpreted four sets of 1256 cases radiographed using each of the techniques. After a lapse of at least 2 months, the 14 × 17-inch films were interpreted a second time. The results, published in 1947 (Birkelo et al., 1947) in the Journal of the American Medical Association (JAMA), were inconclusive because the variation among readers was greater than the differences among the techniques. The disagreement between pairs of readers averaged about 30% and within pairs of readers about 20%. Tables 2.1 and 2.2 contain brief extracts of the extensive results.




Table 2.1 Between-observer disagreement. The number of cases read as negative for tuberculosis (neg) by the first reader that were read as positive (pos) by the second reader. The percentage of between-observer disagreement is calculated as 100*neg/pos
































Readers Neg/pos readings Percentage interobserver disagreement
N/M 21/62 34
O/M 19/62 31
P/M 27/62 43
Q/M 11/62 18
Average 19/62 31



Table 2.2 Within-observer disagreement. The number of cases read as negative for tuberculosis (neg) on a second reading after receiving an initial positive reading (pos). The percentage within-observer disagreement is calculated as 100*neg/pos




































Readers Neg/pos readings Percentage intraobserver disagreement
M 18/118 10
N 4/59 7
O 14/83 17
P 39/96 41
Q 22/106 21
Average 19/92 21

Tables 2.1 and 2.2 illustrate not only the extent of reader disagreement but also the awkward method for summarizing the results. The investigators lacked statistical tools to characterize these data. The JAMA article was accompanied by an editorial (Editorial, 1947) with the title “The ‘personal equation’ in the interpretation of a chest roentgenogram,” which expressed astonishment at the magnitude of observer disagreement and stated: “These discrepancies demand serious consideration.” Indeed, the publication of the USPHS–VA study led to a flurry of activity that is succinctly described by two of the major participants, the radiologist L. Henry Garland (1949) and the project statistician Jacob Yerushalmy (1969).


The phrase “the personal equation” goes back to 1796, when the British Astronomer Royal, Nevil Maskelyne, found that his observations of the time that a certain star crossed the meridian were different from those of his assistant (Stigler, 1968). The transit time was used to set ship navigational clocks and, although the error of eight-tenths of a second only translated into one-quarter of a mile at the equator, it was important to an astronomer. Maskelyne and his assistant tried to get their measurements to agree but after repeated attempts they failed. He fired the assistant! Twenty years later, while writing his Fundamenta Astronomiae, Friedrich Bessel found Maskelyne’s account and did some experiments that also showed observational variation among astronomers. He tried, unsuccessfully, to develop “personal equations” to adjust for the differences between observers.


John’s value=Jane’s value+bias correction(2.1)

It is ironic that in 1994 an editorial (Editorial, 1994) in the New England Journal of Medicine accompanying an article with the title “Variability in radiologists’ interpretation of mammograms” (Elmore et al., 1994) expressed similar sentiments to those in the JAMA editorial. It is distressing that the authors either ignored or were unaware of 50 years of research on observer variation in radiology.


The results of the VA chest screening study eventually were expressed as underreading and overreading. A number of follow-up studies were designed “with the hope of discovering the components responsible for this variability” (Yerushalmy, 1969). Two groups of the radiologists who participated in the studies were designated CRN for Chamberlain, Rigler, and Newell and GMZ for Garland, Miller, and Zwerling. The CRN results (Newell et al., 1954) were published in a paper titled “Descriptive classification of pulmonary shadows: a revelation of unreliability in the roentgen diagnosis of tuberculosis.” The GMZ results (Garland, 1949) were summarized by L. Henry Garland in his presidential address to the RSNA in 1948 titled “On the scientific evaluation of diagnostic procedures.” His 1959 update of error in radiology and medicine in general is often quoted to support a statement that radiologists disagree with each other 30% of the time (Garland, 1959). Diagnostic unreliability has not gone away. Similar observations about disagreement (Felson et al., 1973; Gitlin et al., 2004; Goddard et al., 2001) are made whenever it is specifically studied.


Garland could not explain the observed variability. He classified the errors using a taxonomic approach that was later elaborated by Smith (1967) and updated by Renfrew et al. (1992; see Table 2.3 in section 2.5). The GMZ group also studied reading strategies, which included dual reading and trying to control the attitude of the reader. The use of dual reading as an error-compensating mechanism was the major practical suggestion that resulted from the USPHS–VA study (Yerushalmy et al., 1950). Garland wrote the following about the attitude studies: “In studying the problem, the group was very conscious of the penalty in the form of over-reading which must be paid for the advantage of a reduction in under-reading” (Garland, 1949). They had hit upon attitude or bias toward a particular outcome as a source of variability and recognized that it influenced the ebb and flow of true and false positives but they could not deal with it because they lacked an adequate model. That model was found in signal detection theory and a radiologist, Lee Lusted (1968), was largely responsible for its introduction into both radiology and the entire medical community.



2.4 Receiver Operating Characteristic Analysis



2.4.1 The Introduction of Signal Detection Theory Into Radiology


The theory of signal detectability was developed by mathematicians and engineers at the University of Michigan, Harvard University, and the Massachusetts Institute of Technology partially as a tool for describing the performance of radar operators. Lee Lusted was exposed to the concepts of signal and noise in 1944 and 1945 while working in the radio research laboratory at Harvard University (Lusted, 1984). In 1954, as a radiology resident at the University of California in San Francisco, he was introduced to the problem of observer error when he participated in one of the film-reading studies supervised by Yerushalmy and Garland. Apparently his mind was prepared for a logical leap when in 1956 and 1957 he encountered a plot of percentage false negatives against percentage false positives in the laboratory of W.J. Horvath, who was responsible for optimizing the performance of the cytoanalyzer, which was a device for automatically analyzing Papanicolaou smears (Horvath et al., 1956). At that time Lusted plotted a “performance curve” for chest X-ray interpretation. In 1959 he showed the curve reproduced in Figure 2.1 in the Memorial Lecture at the RSNA and published it in 1960 (Lusted, 1960). This was the first published example of an ROC curve for performance data from radiology.





Figure 2.1 The operating characteristic curve showing the reciprocal relationship between percentage false negatives and percentage false positives. Most of the data were from studies of the interpretation of photofluorograms for tuberculosis.


(Reproduced from Lusted (1960), with permission.)

Although Figure 2.1 shows a plot of false negatives against false positives, the usual convention, shown in Figure 2.2, is to plot true positives against false positives.





Figure 2.2 A conventional receiver operating characteristic curve showing the reciprocal relationship between the fraction of true positives and the fraction of false positives. The data points are the same as those in Figure 2.1. The curve is a binormal curve with an area under the curve (AUC of Az) of 0.87 that was fit by inspection to the data points.


Lusted (1969) saw the ROC curve as a useful tool to accomplish two things: first, to use a parameter such as the area under the curve (AUC) as a single figure of merit for an imaging system and second, to decrease the observed variability in reports about images by separating the intrinsic detectability of the signal, which is a sensory variable, from the decision criteria, which are a matter of judgment. He stated:



It is very difficult for a human observer to maintain a constant decision attitude over a long period of time. This is a possible explanation for the finding that a radiologist will disagree with his own film interpretation about one out of five times on a second reading of the same films (Lusted, 1969).


He wrote a very influential book, Introduction to Medical Decision Making (Lusted, 1968), and went on to become a founding member of the Society for Medical Decision Making and the first editor of the journal, Medical Decision Making.


Signal detection theory is a psychophysical model that describes an observer’s response in terms of some known or estimated distribution of a signal and noise in the stimulus. The theoretical foundations of signal detection theory laid out in a book by Green and Swets were originally published in 1966 and reprinted with revisions in 1974. ROC analysis, which is derived from the theory of signal detectability, has become a powerful tool in visual systems evaluation. It turns out that some of the variability among observers can be reduced by applying a signal detection theory model. Perhaps the personal equation should be written in terms of the linear equation that describes an ROC curve in normal deviate space.


ROC index of detectability      =z (true positive fraction)−z (false positive fraction)(2.2)

where z is the normal deviate.



2.4.2 Early Studies of ROC Analysis in Radiology


ROC analysis was not embraced immediately by the image technology evaluation community. The method was unfamiliar and practical examples illustrating experimental design, data collection using rating scales, and ROC parameter calculation were not readily available. Some early studies used an ROC parameter, the index of detectability, d ’, read from tables published in a book about signal detection theory (Elliott, 1964), to obtain a single estimate of performance from true-positive and false-positive pairs (Kuhl et al., 1972; Kundel et al., 1968). The data lacked estimates of variance and the use of d ’ in the absence of information about the complete ROC curve made assumptions about the ROC parameters that may not have been justified (Metz and Goodenough, 1973).


The situation was improved when Dorfman and Alf (1969) at the University of Iowa published a method using maximum likelihood for estimating the parameters of the ROC curve. Much of the subsequent development of statistical methodology was based on this work. In the 1970s David Goodenough, Kurt Rossmann, and Charles Metz (Goodenough et al., 1972, 1974; Metz et al., 1973) at the University of Chicago demonstrated the use of ROC analysis in the evaluation of film–screen combinations for standard radiography. A number of articles describing the value and the use of ROC technique were published in the 1970s (Lusted et al., 1978; McNeil et al., 1975; Metz, 1978).


In 1979 Swets et al. reported the results of a multi-institutional study, supported by the National Cancer Institute, comparing the accuracy of radionuclide scanning and computed tomography for detecting and classifying brain tumors. The study was more important as a demonstration of the potential power of the ROC methodology for technology evaluation in a clinical environment than as a comparison of two imaging methods. The methodology for evaluating “diagnostic systems” using ROC analysis was described in a book by John Swets and Ronald Pickett (1982) that laid the groundwork for future developments in statistical methodology. A FORTRAN version of the Dorfman–Alf computer program was published in the book as an appendix. This became a prototype for the subsequent development in the 1980s by the group at the University of Chicago of a very influential ROC analysis software package called ROCFIT. It was superseded in the 1990s by a new package called ROCKIT.



2.4.3 Developing Methods for the Statistical Analysis of ROC Data


A test of whether the difference between two values of the area under the ROC curve is due to a real difference in the imaging techniques that yielded the values or just due to chance can be done by calculating a critical ratio (CR), denoted z, which is the ratio of the observed difference (AUC1 – AUC2) to the standard error (se(diff)) of the difference. The CR is then used to estimate the probability that the difference is real (or statistically significant) (Hanley and McNeil, 1983).


z=AUC1−AUC2SE(diff)(2.3)

The AUC can be calculated using the procedure of Dorfman and Alf (1969). Calculating the se(diff) is not as straightforward. Swets and Pickett (1982) proposed a model that included estimates of the variability due to case sampling, reader sampling (between-reader variability), reader inconsistency (within-reader variability), and the multiple correlations between cases and readers. They also presented a methodology for approximating the estimates. In 1992 Dorfman and Berbaum from the University of Iowa and Metz from the University of Chicago published a method for analyzing rating scale data using a combination of the Dorfman–Alf method for calculating ROC parameters and the classical analysis of variance (Dorfman et al., 1992). The so-called multireader multicase (MRMC) or Dorfman, Berbaum, and Metz (DBM) method separates case variance from within- and between-reader variance, providing a method for deciding whether any observed differences are due to the readers or to the cases.


Work on methodology has more recently focused on improving techniques to account for variance (Beiden et al., 2001) and on accurate estimates of statistical power (Chakraborty and Berbaum, 2004; Hillis and Berbaum, 2004; Hillis et al., 2005; Obuchowski, 2000).



2.4.4 ROC Analysis Becomes a Standard Method for Technology Evaluation


In 1989 four articles that reviewed the state of the art of ROC analysis were published by the groups that were most active in methodological development: Berbaum et al. (1989) from the University of Iowa, Gur et al. (1989) from the University of Pittsburgh, Hanley (1989) from McGill University, and Metz (1989) of the University of Chicago. The fact that four reviews were published indicates the growing interest in ROC analysis as a methodology for imaging technology evaluation.


A count of the articles in six radiology journals indexed by PubMed that use the phrase “ROC” in the title, the abstract, or the keywords is shown in Figure 2.3.





Figure 2.3 Results of a survey of the indexing in PubMed of six radiology journals: Academic Radiology, Acta Radiologica, American Journal of Roentgenology, British Journal of Radiology, Investigative Radiology, and Radiology. The journal title, abstract, and keywords were queried for “ROC” (receiver operating characteristic). The sudden jump in 1988 may be due to a change in the indexing but the general upward trend is evident.


There is a steady increase in the number of citations since 1974. Note that the jump in 1988 may be due to an increase in publications but may also be an artifact caused by the addition to the database of abstracts and the keyword “ROC.”


There has been increased use of ROC analysis for technology evaluation and a steady development of the methodology. There is now a new generation of review articles (Metz, 2007; Obuchowski, 2005) and ROC analysis is even beginning to appear in statistics textbooks (Zhou et al., 2002).



2.4.5 Problems with ROC Analysis that Have Not Been Solved


When ROC analysis was first being studied, a few problems were identified that at the time proved to be intractable. They included: (1) analyzing the results from cases with multiple lesions on each abnormal image; (2) accounting for responses given in the wrong location on an abnormal case; and (3) dealing with situations where the diagnostic truth was unknown.


Egan et al. (1961) had addressed the issue of multiple signals in 1961 and called it “the method of free response.” In 1978, Bunch et al. applied the free response method, now called “free-response ROC” (FROC), to images and proposed a method for analyzing the data that was not entirely satisfactory. In 1989 Chakraborty (tackled the problem of data analysis and proposed a model for the FROC curve and a more satisfactory method for analyzing the data. A number of other groups have worked on the problem (Edwards et al., 2002; Obuchowski et al., 2000) and the methodology may now be sufficiently mature so as to be useful for practical applications (Chakraborty and Berbaum, 2004).


Incorrect location data basically result in an image with two possible decision outcomes: a false positive for the incorrectly located lesion and a false negative for the lesion that was missed. Which one should be used in the ROC analysis? The so-called location ROC was first tackled in 1975 by Starr et al. and picked up by Swensson (1993, 1996, 2000) in 1993, who made considerable progress. Chakraborty (2002) has proposed a model that accounts for location errors.


The diagnostic truth remains a problem, especially for those who wish to assemble large verified image databases for use by the research community (Dodd et al., 2004). The original expert panel method used by the GMZ group for studying performance is still in general use despite serious limitations (Revesz et al., 1983). In order to characterize detection, the GMZ group needed to be able to specify whether an image was truly positive or negative. They decided to use a “roentgenographic criterion” and to define positive and negative “not in terms of disease, but in terms of the presence or absence of significant shadows on the roentgenogram” (Yerushalmy, 1969). The procedure was to obtain a large number of interpretations on each image and to call back any individual with a suspicious lesion. The repeat exams were then interpreted again and determined to be either roentgenographically positive or roentgenographically negative. Once the images were dichotomized in this way, reports could be grouped into underreading or missing a positive film (false negatives) and overreading, or calling a negative film positive (false positives) (Garland, 1949). We have not come very far since then. Consensus or agreement of an expert panel is still one of the major methods for determining ground truth. A few investigators have suggested approaches either for establishing an ROC curve without knowledge of the truth (Henkelman et al., 1990) or for estimating reliability rather than accuracy (Kundel and Polansky, 1997; Kundel et al., 2001).



2.5 Classification of Error


The development of methods for describing variability in statistical terms provided a powerful analytical tool but did not explain why the variability existed in the first place. Apparently readers are unable to maintain consistent decision criteria. Lusted (1969) stated:



It is very difficult for a human observer to maintain a constant decision attitude over a long period of time. This is a possible explanation for the finding that a radiologist will disagree with his own film interpretation about one out of five times on a second reading of the same films.


Inconsistency plays only a small part in the generation of error. There are other sources.


From 1964 to 1966 Marcus Smith, a radiologist in New Mexico, with the cooperation of other radiologists and physicians in the area, collected and classified 437 errors using the categories shown in Table 2.3 (Smith, 1967). He did a thorough literature review and actually related the classification of errors back to the seventeenth-century work of Francis Bacon and Thomas Brown. Renfrew et al. (1992) reviewed and classified errors in 182 cases that were presented at problem case conferences from 1986 to 1990. They used a classification that was similar to that of Smith and found about the same percentage of cases in each category.




Table 2.3 Classification of medical errors by Smith (1967) and Renfrew et al. (1992)



































































Cause of error (Smith, 1967) No. % Cause of error (Renfrew et al., 1992) No. %
Underreading 209 48 False negatives 64 35
Complacency1 60 14 False positives 15 8
Lack of knowledge 14 3 Classification 47 26
Faulty reasoning 43 10 Communication 18 10
Communication 66 15 Complications 38 21
Unknown 45 10
Total 437 100 182 100




1 A mixture of false positives and misinterpretations.


It is of interest that in his description of errors of underreading, Smith explicitly includes satisfaction of search (SOS) as a cause.



2.6 Perception of the Medical Image: Understanding the Observer




The process of roentgen diagnosis comprises three basic steps: the recording, the perception, and the interpretation of critical roentgen shadows. Volumes have been written on the recording and interpretation of these shadows but their perception is so spontaneous that radiologists have largely taken it for granted.


(Tuddenham, 1962)


2.6.1 Studies of Visual Search


In 1961 a radiologist, William Tuddenham, gave the Memorial Lecture at the RSNA (Tuddenham, 1962) with the title: “Visual search, image organization and reader error in roentgen diagnosis.” He discussed visual physiology, visual search, and some of the psychological principles that govern what we see in images. He suggested that errors of perception might arise from “incomplete coverage” or “unpatterned search” and that “our quest for meaning may lead us to abandon search prematurely.” He also suggested that improvement in teaching methods could result in decreasing errors. He was not the first radiologist to discuss visual search but he was the first radiologist to actually measure the visual scanning behavior of radiologists (Tuddenham and Calvert, 1961).


Until the development of eye-tracking apparatus, studies of visual search generally used search time as end points. Tuddenham and Calvert (1961) performed an ingenious visual search experiment in which observers scanned radiographs with a spotlight controlled by a joystick. They reported that the observer who used the most systematic scan path had the worst performance. The subsequent availability of devices for recording eye position made it possible to determine exactly when and where visual information was being collected in the image and the results were equally surprising. Thomas and Lansdown (1963) used a head-mounted apparatus to record the eye position of radiology residents searching chest images and found that their scanning was not exhaustive but that visual fixations were concentrated on boundaries in the images. Kundel and Wright (1969) recorded the eye position of radiologists and radiology residents viewing chest radiographs that were either normal or contained a solitary pulmonary nodule. They found that some of the scan paths were like surveys with the eyes moving circumferentially around the lungs, some were concentrated on suspicious regions, but most were too complicated to characterize by inspection. The fixations tended to be concentrated on the lungs when the task was lung-specific, “search for nodules,” as opposed to more distributed over the chest image when the task was general, “search for any abnormality.” A study by Kundel and LaFollette (1972) recorded the eye position of medical students, residents, and radiologists. One of the chest radiographs in the test set showed a large right upper-lobe opacity and small left lower-lobe opacity. The scan paths of six observers at different experience levels are shown in Figure 2.4. By the time students were in the third year of medical school they recognized the lesions and fixated them within 4 seconds.





Figure 2.4 Left, chest radiograph and right, the scan paths recorded during the first 4 seconds of viewing by four medical students (MED 1, MED 2, MED 3, and MED 4), a radiology resident (RES), and a radiologist (RAD). Notice how the pattern of inspection of the lesions changes as well as the prompt fixation of the lesions starting with the third-year medical student.


(Redrawn from Kundel and LaFollette (1972), with permission.)

Figure 2.5 shows the complete scan of the observers shown in Figure 2.4, which averaged about 20 seconds. Note that the fixations were concentrated on the abnormalities, leaving parts of the image without coverage.





Figure 2.5 The complete scan path of first- to fourth-year medical students (MED1–4), radiology resident (RES), and radiologist (RAD) viewing a chest film with a large right upper lobe and a small left lower-lobe opacity. Note the concentration of fixations on the lesions and the more efficient appearance of the pattern of the radiologist.


(Redrawn from Kundel and LaFollette (1972), with permission.)

The two notable findings that were followed up and verified by this and other laboratories were the concentration of fixations on the lesions to the exclusion of other areas of the image and the speed with which an obvious and subtle opacity on a chest radiograph was fixated by observers with appropriate training.



2.6.2 Studies of Gaze Dwell Time on Lung Nodules, Breast Cancers, and Extremity Fractures


A lot of the work on analyzing error concentrated on lung nodule detection. It quickly became apparent that sorting through clinical cases to find lung nodules was both labor-intensive and yielded nodules with a variety of characteristics that were difficult to quantify. Nodules could be simulated optically without much difficulty, providing an endless supply of chest images with nodules in known locations having uniform, mathematically definable characteristics (Kundel et al., 1968, 1969). Kundel et al. (1978) studied the eye position of observers searching for lung nodules. They measured the location of the axis of the gaze and the time (gaze dwell time) that a 5° useful visual field centered on the axis of the gaze included a nodule. They found that, of 20 missed nodules, 30% were never fixated by the useful visual field, 25% were fixated only briefly, and 45% received prolonged visual attention but were not reported. These misses were classified as scanning, recognition, and covert decision errors respectively.


Further studies using chest radiographs (Kundel et al., 1989), mammograms (Krupinski, 1996), and skeletal radiographs (Hu et al., 1994; Krupinski and Lund, 1997) correlated gaze dwell time with lesion location or suspected lesion location and with decision outcome: true positive, false positive, false negative, and true negative. Fixation survival curves, examples of which are shown in Figure 2.6, show the percentage of fixations that remain after the elapsed gaze duration. The true-positive and false-positive outcomes had the longest fixation dwell times. The false negatives on the lung nodules and the mammograms also showed significant fixation dwell times. Kundel et al. (1990) used fixation dwell time information to provide perceptual feedback about the location of potential lesions, much in the manner of computer-aided diagnosis (CAD). The procedure required gaze tracking while the observer viewed a chest radiograph. Immediately after viewing, the observer pointed out positive locations. All positive locations eventually would be scored as true positive or false positive. Then regions with clusters of fixations that had long dwell times and were not pointed out could be identified as potential false negatives and shown to the observer with an appropriate prompt. The work on perceptual feedback was reviewed by Krupinski et al. in 1998.





Figure 2.6 Survival function curves associated with true-positive, false-positive, true-negative, and false-negative decision outcomes for nodules in (a) chest radiographs, (b) tumors in mammograms, (c) traumatic bone injuries, and (d) fractures of the extremities. The survival function is a plot of the percentage of the total fixations located on the lesion that remained on the lesion after the elapsed gaze time.


(Reproduced from Krupinski et al. (1998), with permission.)

The curves are taken from Figure 3 in Krupinski et al. (1998) with permission. The survival functions indicate the probability of survival of a cumulative fixation cluster as a function of gaze duration. The vertical lines indicate the percentage of fixations remaining after 1000 ms.



2.6.3 Satisfaction of Search


To our knowledge, Marcus Smith (1967) first used the term “satisfaction of search” to describe one possible mechanism for missing lesions and Tuddenham (1962) used the more general term “satisfaction of meaning.” Neither of them presented any objective evidence for the existence of a phenomenon in which the presence of one abnormality on an image blocked the perception of a second abnormality. In a series of elegant experiments, Kevin Berbaum and the group at the University of Iowa showed that the SOS phenomenon did really exist (Berbaum et al., 1990, 1991, 1994). The SOS phenomenon was verified independently by Samuel et al. (1995). Most of the missed lesions are fixated but not recognized (Berbaum et al., 2000, 2001; Samuel et al., 1995), suggesting that SOS is not strictly a search or scanning problem but rather a suppression of recognition. The final chapter on SOS still has to be written.



2.6.4 Development of the Concepts of Global Analysis and Holistic Perception


In 1975, Kundel and Nodine, following up the observation of the speed with which experienced observers fixated abnormalities, showed radiologists chest images with a variety of straightforward lung and cardiac abnormalities for 200 ms, equivalent to the duration of one visual fixation, and found that the average performance as measured by ROC analysis was surprisingly good. The readers achieved an average AUC of 0.76 on flash viewing on a set of chest images on which they had achieved an average of 0.96 with an unrestricted viewing time. Oestmann et al. (1988) did a flash viewing study using chest radiographs: 40 normal, 40 with subtle nodules, and 40 with obvious nodules. They found that at a false-positive fraction of 20%, the true-positive fractions for subtle and obvious cancers were 30% and 70% at 0.25 seconds and 74% and 98% at unlimited viewing time, respectively. These experiments reinforced the notion that medical image perception has a major “global” component (Kundel et al., 2007). It seems that most of the visual pattern recognition that goes on occurs at the very onset of viewing and much of the visual activity that follows is largely confirmatory, although there is an element of discovery search. Current research is focusing on identifying image properties that attract visual attention (Mello-Thoms et al., 2003; Perconti and Loew, 2007).



2.7 Psychophysical Modeling: The Observer–Image Interaction


Psychophysics is the scientific study of the relationship between a stimulus, characterized in physical terms, and an observer’s response specified by either sensitivity or discrimination. It is a vast subject encompassing diverse fields like psychology, neurophysiology, engineering, computer vision, and radiology. This section is intended to show how seminal ideas from psychophysics have influenced studies of medical image perception and to indicate where medical imaging scientists have contributed to the broader field of psychophysics.


Threshold contrast and sharpness were among the earliest psychophysical metrics used by radiological physicists. G.C.E. Burger of the Philips Company was interested in optimizing X-ray imaging systems for examining the lungs by determining which system imaged small details with the most contrast (Burger, 1949). He devised a phantom that produced an image consisting of discs of varying size and contrast and determined the smallest disc that could just be detected at each contrast (Burger, 1950). This produced a unique contrast–detail curve, which is usually plotted as threshold contrast against size and was the prototype for many modern test phantoms.


Contrast–detail curves are a relatively simple way to determine the sensitivity of the observer but they are not adequate descriptors of image system performance. In the early 1960s radiological physicists began to adopt methods for characterizing imaging systems based on the work of a Radio Corporation of America (RCA) engineer, Otto Schade (1964), who introduced the concept of the modulation transfer function (MTF). Simplistically, the MTF is a plot of contrast sensitivity expressed as a percentage of a reference value against spatial resolution expressed as cycles/mm. The advantage of using the MTF is that total system MTF can be expressed as the product of the MTF of all of the system components, including the eye. Although necessary, the MTF is still not sufficient to define image quality; image noise had to be considered and expressed in terms of spatial frequency. Kurt Rossmann, a radiological physicist at the University of Chicago, pointed out that, even after accounting for noise, the diagnostic quality of an image also depended on the task that the observer was asked to perform (Rossmann and Wiley, 1970). Rossmann (1969) stated: “Parameters of the optimal system will depend on the object being radiographed and on the image detail that needs to be detected by the radiologist.” This led to the now popular catchphrase of “task-dependent image quality.”


In 1966 a radiologist, Russell Morgan of Johns Hopkins University, delivered the annual oration to the RSNA, “Visual perception in fluoroscopy and radiography,” about the analysis of imaging systems. He included the frequency response of the human visual system in his analysis of the imaging chain and introduced a psychophysical model, the “Rose model,” into the analysis. Rose (1948), working in the discipline of optical engineering on television systems, showed how fluctuations in the photons that produce the image determine performance limits for both human vision and electronic imaging systems. This had been pointed out independently by De Vries (1943), who worked in the discipline of psychology and was apparently unknown to Rose. In the computer vision and psychology literature it is frequently called the De Vries–Rose model (or just the De Vries model by psychologists).


Rose asserted that the visibility of a target that is brighter than the surround depends upon random fluctuations in the number of photons that arrive at the sensor. He further assumed that the arrival of photons with time followed a Poisson distribution and that the standard deviation of the distribution was equivalent to noise. He asserted that the noise limited the ability to detect contrast and indicated that a minimal signal-to-noise ratio was required for signal detection. He also showed that imaging systems can be evaluated using an absolute scale based on quantum efficiency, which involves counting the number of incoming photons per unit area in a given time. Wagner and Brown (1985) showed how the application of signal-to-noise ratio models could be used to evaluate and compare imaging systems. The Rose model also has been used to determine the statistical efficiency of human contrast discrimination in the presence of noisy backgrounds (Burgess et al., 1981). Burgess (1999) has described the implications and limitations of the Rose model for modern signal detection theory.


In the 1979 RSNA New Horizons lecture, Kundel (1979) pointed out that existing psychophysical models relating physical image properties to observer responses were inadequate for use in images as structurally complicated as anatomical radiographs. He suggested that “surround complexity” had to be included as a noise component in the psychophysical equation. Detection experiments performed with George Revesz of Temple University had shown that ribs and vascular shadows interfered with the detection of lung nodules (Kundel and Revesz, 1976). They used the term conspicuity, a concept developed by the psychologist Engel (1971) to express the visibility of a target embedded in a structured surround. They attempted unsuccessfully to develop a psychophysical equation to quantify conspicuity (Revesz et al., 1974; Revesz, 1985).


The effect of anatomical structures on nodule detection has been verified by Ehsan Samei and his colleagues (Samei et al., 1999, 2003) and imaging scientists are beginning to incorporate terms that express the effect of structured noise into ideal observer models (Eckstein and Whiting, 1995). An ideal observer is one who can utilize all of the available information and perform a task with minimal cost or error. Actual performance can be compared with ideal observer performance to gain insight into the efficiency of an imaging system. Kyle Myers (2000) has reviewed progress in the development of ideal observer models, and Harry Barrett and Kyle Myers (2003) put the models into the wider context of imaging science. Reviews of the contribution of medical imaging scientists have been written by Chesters (1992), Eckstein (2001), and Burgess (1995). In 2007, a special issue of Journal of the Optical Society of America featured papers about image quality and observer models from authors in a number of different fields, including radiology (Kupinski et al., 2007). Psychophysical modeling has improved our understanding of perception and still seems to be the most fruitful avenue to the development of criteria and metrics for image quality that reflect human performance.



2.8 Summary and Speculation


Research in medical image perception and psychophysics has been driven by the awareness of the extent of human error and variation in the imaging process and by the need to use human beings for image technology evaluation. The displayed image is a meaningless grayscale pattern unless viewed and analyzed by an intelligent observer and, luckily for radiologists, computers are not as intelligent as people. Reviews show that error is about the same as it was when first systematically measured over 60 years ago (Goddard et al., 2001; Robinson, 1997). The need for objective comparison of imaging modalities is especially important with the development and commercialization of CAD.


In this historical survey, we have tried to show how the need for technology evaluation led to the development and refinement of ROC analysis by medical imaging scientists and how the physics community used the concepts of statistical decision theory to help develop psychophysical models for image system performance that could be used to evaluate image quality independently of human judgment. We have also showed how a few stalwart workers have tried to improve our understanding of perception itself, the mysterious process by which the eye–brain converts the patterns in light into meaningful representations of the world around us. Surely this is a daunting task, understanding the workings of the brain, but it leads us to understand how people learn to recognize patterns and it holds the promise of showing us how to improve our teaching of image interpretation.


In 1949 Garland enumerated three objectives for future research that are still very relevant:




  1. 1. Determine reliable methods for measuring the relative number of lesions missed by a reader.



  2. 2. Study the probable reasons for missing lesions and their characteristics.



  3. 3. Investigate methods of interpretation that might lead to a reduction in the number of lesions missed.


We have come very far in the area of performance measurement and, although still in their infancy, studies of psychophysical models offer the hope of eliminating or at least minimizing the need for observer performance studies. We understand the sources of error a little better but still need a lot of work in that area. Finally, interpretation methods for improving accuracy are still only dreams. As a field that is steeped in technology, we have turned to technology in the form of CAD to aid interpretation, but that is not enough. There has been an increase of interest in defining the nature of expertise and the methods for attaining it (Lesgold et al., 1988; Nodine et al., 1996; Norman et al., 1992; Proctor and Dutta, 1995; Wood, 1999). This may be the path to follow to improve our performance and we believe that further study of expertise should be encouraged because it has the promise of leading to better diagnosis and better patient care.




References


Appiah, K.A. (2008). Experiments in Ethics. Cambridge, MA: Harvard University Press.Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Barrett, H.H., Myers, K.J. (2003). Foundations of Image Science. Hoboken, NJ: John Wiley.Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Béclère, A. (1964). A physiologic study of vision in fluoroscopic examinations. In: Bruwer, A. (ed.) Classic Descriptions in Diagnostic Roentgenology. Springfield, IL: Charles C. Thomas.Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Beiden, S.V., Wagner, R.F., Campbell, G., et al. (2001). Components-of-variance models for random-effects ROC analysis: the case of unequal variance structures across modalities. Acad Radiol, 8, 605615. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Berbaum, K.S., Dorfman, D.D., Franken, E.A., Jr. (1989). Measuring observer performance by ROC analysis: indications and complications. Invest Radiol, 24, 228233. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Berbaum, K.S., Franken, E.A., Jr., Dorfman, D.D., et al. (1990). Satisfaction of search in diagnostic radiology. Invest Radiol, 25, 133140. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Berbaum, K.S., Franken, E.A., Jr., Dorfman, D.D., et al. (1991). Time course of satisfaction of search. Invest Radiol, 26, 640648. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Berbaum, K.S., El-Khoury, G.Y., Franken, E.A., Jr. (1994). Missed fractures resulting from satisfaction of search effect. Emerg Radiol, 1, 242249. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Berbaum, K.S., Franken, E.A., Jr., Dorfman, D.D., et al. (2000). Role of faulty decision making in the satisfaction of search effect in chest radiography. Acad Radiol, 7, 10981106. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Berbaum, K.S., Brandser, E.A., Franken, E.A., et al. (2001). Gaze dwell times on acute trauma injuries missed because of satisfaction of search. Acad Radiol, 8, 304314. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Birkelo, C.C., Chamberlain, W.E., Phelps, P.S., et al. (1947). Tuberculosis case finding. A comparison of the effectiveness of various roentgenographic and photofluorographic methods. JAMA, 133, 359366. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Bunch, P.C., Hamilton, J.F., Sanderson, G.K., et al. (1978). A free-response approach to the measurement and characterization of radiographic observer performance. J Appl Photogr Eng, 4, 166171.Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Burger, G.C.E. (1949). The perceptibility of details in roentgen examinations of the lung. Acta Radiol Diag, 31, 193222. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar | PubMed

Burger, G.C.E. (1950). Phantom tests with X-ray. Philips Technical Review, 11, 291298.Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Burger, G.C.E., Van Dijk, B. (1936). Über die physiologischen Grundlagen der Durchleuchtung. Fortschr Rontg, 54, 492496.Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Burgess, A.E. (1995). Image quality, the ideal observer, and human performance of radiologic detection tasks. Acad Radiol, 2, 522526. CrossRef | Find at Chinese University of Hong Kong Findit@CUHK Library | Google Scholar

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jan 4, 2021 | Posted by in GENERAL RADIOLOGY | Comments Off on 2 – A Short History of Image Perception in Medical Radiology

Full access? Get Clinical Tree

Get Clinical Tree app for offline access