20 – Value and Limitations of Observer Models




20 Value and Limitations of Observer Models



Lucretiu M. Popescu



20.1 Introduction


In this chapter, we discuss aspects of the utility and limitations of using observer models observers for task-based image quality evaluation. We can distinguish three main components in what is generically referred as observer model.


The first component of the task-based evaluation is the task used in the performance assessment. For example, the task can be the detection of a small signal or the measurement of the contrast of a small region compared with the surrounding background. The second component is the observer model itself, which consists of the mathematical model, comprising the sequence of numerical procedures used to perform the task. The third component is the mathematical measure used to convey the task performance of the observer model on a given set of images. As an example, in the case of a signal detection task, we can have a signal-to-noise ratio (SNR) or the area under the relative (or receiver) operating characteristic (ROC) curve (Swets, 1973).


By observer models, we mainly refer to parametric models, derived by analytic means from statistical considerations and generally of moderate complexity, similarly to Barrett et al. (1993) and Wagner and Brown (1985), and spend less time on the emerging artificial intelligence applications for image analysis that have rapidly progressed in recent years and may overcome many limitations discussed in this chapter in a not too distant future.


The use of observer models is motivated by the need to have convenient means of evaluating image quality, avoiding the use of costly human observer experiments, and at the same time retaining a level of simplicity that allows for an analytic understanding of the performance metric as a function of the basic properties of the original data used for image formation, ideally as a closed form expression.



20.2 Uses of Observer Models


Any discussion of the utility and limitations of using observer models is dependent on the context in which they are applied and the objectives pursued. In each case, the question to be answered is whether the design of the experiment involving the observer model fits the experiment objectives. Here we list some of the most generic situations for task-based evaluation using observer models.




  • Imaging device optimization. Models used in the design process of new imaging devices for the selection of various device parameters (device design optimization). We can further distinguish between hardware or software (data processing, image reconstruction) designs.



  • Imaging protocol optimization. Models used to set up an imaging protocol for a given diagnostic task. The imaging device is a given and the imaging parameters are to be determined for a given diagnostic task.



  • Comparative testing of different imaging modalities. For a given diagnostic task, models used to determine the selection of the most effective imaging modality, or combination thereof (e.g., digital mammography vs. digital breast tomosynthesis vs. breast ultrasound).



  • Observer models may also be used in additional applications in response to quality assurance or regulatory requirements (e.g., supporting computed tomography (CT) low-dose claims).



20.3 Desirable Properties of Observer Models


The experimental design and associated performance measure should lead to an evaluation metric satisfying all necessary conditions for a measurement metric as prescribed by measurement theory (Hand, 1996; Suppes and Zinnes, 1963). Without elaborating on measurement theory considerations that are beyond the scope of this chapter, we expect that the surrogate metric provides a representation of the true performance. With M(A) being the measurement of image A, and I(A) the ideally measured quality of the image A (i.e., a gold-standard agreed-upon test metric (e.g., human reader test)), then if I(A) > I(B) we have M(A) > M(B). From this should also derive a transitivity property: if M(A) > M(B) and M(B) > M(C), then M(A) > M(C). This establishes at least a correct ordinal scale for the performance measurements.


However, the surrogate metric provided by the observer model should fulfill additional requirements in order to be of real practical utility. Those include the following.



20.3.1 Sensitivity


Sensitivity is determined by the metric response to differences between imaged objects, or imaging devices parameters, relative to the usual variations due to data acquisition noise or sample variability.



20.3.2 Efficiency


The model should provide the desired sensitivity with a minimum number of image samples, and should require also a minimal number of image samples for calibration/training.



20.3.3 Proportionality


The differences between measured performance values should be proportional to the real changes in performance (as measured by a gold standard, usually human observer tests) in order to be able to balance tradeoffs. If strict proportionality is not realized, at least the nature of the relationship (e.g., logarithmic, power law) to the real performance should be known.



20.3.4 Neutrality


When comparing images obtained with different techniques, the observer model should not favor one technique compared to the other; the model output should be independent of the technique. In that way, the differences can be attributed to differences in inherent image quality and not to differences in the model’s ability to use the information in the images as rendered by a given technique.



20.3.5 Predictive Value


If the model is calibrated/trained for a given set of parameter points, it should be able to provide performance estimates for a range of conditions, in between or beyond the calibration/training points. Basically, the model is useful as a model if is able to represent the behavior of the system it models with an economical use of training samples and calibration tests, avoiding the need for recalibration/training for any new measurement point.


The sensitivity and efficiency properties are related to the significance and power of the statistical hypothesis testing devised for comparing measurements (e.g., noninferiority, equivalence). The proportionality and predictive value properties are related to the interpretation of the results, and their relation to reality. It should be noticed that the predictive value property typically relies on the proportionality property, since some sort of interpolation or extrapolation is involved.



20.4 Challenges Due to Image Characteristics


In order to further advance our understanding of the conditions in which observer models are applied and the resulting limitations on their use, we have to understand the general properties of the images.


Images are fairly large multidimensional arrays (in three dimensions) resulting from applying image reconstruction algorithms to the originally acquired physical data subject to quantum noise. Thus, the images can be seen as random fields with correlated noise.


In many, if not all, cases, the original projection data are not uniform due to the presence of the imaged object itself and other geometrical or physical factors, leading to large variations in the physical and statistical properties of the acquired data field.


In most cases the images are obtained from such data fields via image reconstruction algorithms that may have different noise propagation properties, as in the case of analytic and statistical iterative reconstruction algorithms. The resulting images, although originating from the same data, may have markedly different noise patterns that need to be accounted for accordingly.


Due to various particularities of the hardware implementations (e.g., uneven pixel sensitivities and cross-talk, pixel block arrangements, gaps), or because of features in the object to be imaged, discontinuities and other effects may be present with potential for producing artifacts (e.g., rings, streaks). Therefore, in most cases the image is subject to application of ad hoc procedures to avoid or mitigate them, which is one more source for uneven image properties.


In clinical applications another source of variability is anatomical noise. While for a given patient, the anatomy is fixed at the time of scanning, for a statistical population of patients, some anatomical features vary significantly from case to case, giving overall the appearance of a noisy pattern background (Hoeschen et al., 2005). In certain applications these are the dominant source of uncertainty (Bochud et al., 1999). While this effect can be controlled in experiments using phantoms, it has to be accounted for in tests attempting to get closer to clinical reality.


Summarizing the above considerations, the image can be represented as a random field with correlated noise having varying statistical properties (e.g., noise pattern). Stationarity properties, required by observer models, can be assumed to be held only as local approximations, which may necessitate case-by-case parameter calibrations. Moreover, sometimes the images present random features at medium or full image scale that do not fit well the observer model framework.



20.5 Challenges Due to Task Definition


Generally, diagnostic tasks involve a combination of local and wider-scale image information, as well as other parameters concerning the patient (e.g., history, result of other tests, demographics). For the evaluation of an imaging system, we can limit strictly to the imaging information available to the observer, but we should be aware of the other factors when interpreting the results.


This will leave us with the local and wider-scale image information. As we have discussed above, however, the real-life situation in which one can formulate by analytic means a model valid on wider image scale is very limited. This further limits us to cases when mainly local information is used. Therefore, typical observer models are limited to small signal detection tasks where practically only local image information is used.



20.5.1 Known Location Small Signal Detection


A large part of the observer model literature involves observer models performing the detection of signals at a known location such as the nonprewhitening matched filter (NPWMF) (Sharp et al., 1996) or channelized Hotelling observer (CHO) (Barrett et al., 1993), by applying a signal template to a fixed position in the image. Even if background variability is included (Rolland and Barrett, 1992), these applications have limited correspondence to real clinical tasks, since, with few conceivable exceptions, there is always some degree of uncertainty about signal location. However, these models are able to produce a ranking and are thus useful, especially in situations when only a few aspects of the images compared are changed. Their application is helped by the existence of a comprehensive body of tools and theoretical formulations.


This type of observer model produces results in the form of SNR. It is not uncommon to encounter reports of results showing high values of SNR (> 4) that correspond to quasi-certain detection probabilities. Although some ranking of results is produced in such cases, it is difficult to assess the meaning of their differences. This is an example of lack of metric proportionality, as discussed in section 20.3. The SNR results, with some theoretical assumptions, can be converted to area under the ROC curve (AUC), AROC, arriving at probabilistic interpretations. In this case, in order to see meaningful differences, the AROC values should be well below unity, which requires tests to be designed with signals that are moderately difficult to detect.



20.5.2 Unknown Location Small Signal Detection


Another category of small signal detection tasks is the detection of signals at unknown locations or signals requiring search. This, basically, consists of finding the best match of a signal template applied successively to all points in the given image area.


As opposed to the known location case where the signal realization is compared to the average background (i.e., at a fixed location), in the unknown location signal detection task, the signal realizations (i.e., template-matching values applied to signal locations) are compared to the extreme background values (i.e., signal template-matching values applied to background locations). The direct estimation of the extreme background values distribution is a nontrivial mathematical problem (Adler, 2000); however, it can be estimated empirically from scanning background regions and transparently integrated in the analysis of the scan returns (Popescu, 2011).


From this process, instead of obtaining an SNR-type expression, an AUC-type measure is obtained directly – for ROC, localization ROC, or free-response ROC, depending on design – which has a direct probabilistic interpretation. Also, the results will depend on the scanned area size because the probability of extreme background values increases with area size. In this respect, the known location signal detection case can be seen as a particular case corresponding to a zero search area (with δ-function location uncertainty). A more indepth discussion of the differences between known and unknown location cases is presented in Popescu and Myers (2013).


Variations in the search mechanism implementation and returns analysis are possible (Popescu, 2011), with the free-response approaches (Chakraborty, 2013) proving more efficient and versatile.



20.5.3 Generalizations


As mentioned in section 20.1, because of their analytic derivations and assumptions, which require calculations of covariance matrices or likelihood ratio look-up tables, the observer model framework seems ill suited for further generalizations of the test task definitions, with, probably, a few exceptions. Fortunately, in the past few years the field of machine learning, or artificial intelligence, has progressed considerably, including applications in medical imaging (Kohli et al., 2017; Wang and Summers, 2012). We expect some of these generalizations to become available shortly as tools for task-based image quality evaluation:




  • Detection of signals at unknown location and imprecisely known size, shape, or orientation. Theoretically the framework used for signal searching could be generalized to search in a multidimensional parameter space. However, as the model observer framework requires the building of tables of likelihood ratio estimates, it becomes rapidly unpractical as the number of search dimensions increases. Novel deep-learning methods should be more suitable for this.



  • Detection of signals on background with artifacts. As the applications get closer to the more challenging aspects of image formation, the presence of large-scale features (e.g., like streak artifacts, incomplete data) has to be accounted for.



  • Morphology and texture classification. As medicine advances toward not just detection but characterization of the disease, characterizing a pathology in terms of morphology and texture allows for ascertaining a wider range of diagnostic tasks than signal detection.



20.6 Matching the Human Observer


One important aspect raised in connection with the observer models is their ability to match human observer performance, which came with the inclusion of filters derived from human visual system response into CHO and NPWMF models (Myers and Barrett, 1987). However, there are a few critiques that can be made about such claims, in particular regarding the attempts to generalize them beyond the original conditions of the tests that prompted them.


First, we can notice that the task of small signal detection at known location is simple enough to be tractable by mathematical means, but may be somewhat unfamiliar and difficult for human observers. This is opposite to other situations, like shape recognition, at which human observers excel but are ill suited for analytic modeling. Thus, one can take the ideal mathematical model and handicap it through some parameterizations, so that, with the variation of a few parameters, it will return results that vary continuously from very good to poor, with the humans sitting somewhere in between. Then, by virtue of function continuity, it would be guaranteed that some parameter values will exist for which the model will provide results matching the average of the human observers’ results. However, that is not sufficient to claim one is truly modeling the human observer; such a claim needs to be supported by a more systematic match on a range of imaging conditions. Finding a matching parameter point has retrospective value. However, the predictive value of the model still needs to be demonstrated.


Second, the mechanistic argument of incorporation of visual system response filters would apply only to cases when human observers are reading images from a fixed predefined distance, which differs from the actual practice of image reading. Moreover, the visual system response function stops short of explaining the decision mechanism employed by the human brain, a mechanism that is not yet fully understood. Also, the CHO model is often applied with other types of channel parameterizations, the driving motivation being dimensions reduction in order to make more tractable the computation of the covariance matrix.


The generalizations of observer models for signal searching show more promise for achieving a systematic match of human experiments results (Gifford, 2014), as the task is closer to the human image reading experience (Kundel et al., 1989), and they explicitly allow for incorporating decisions mechanisms (Chakraborty, 2006; Popescu, 2008). However, aspects like search area, signal size, and localization tolerance radius introduce extra degrees of freedom that may leave some ambiguities in the interpretation of the results.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jan 4, 2021 | Posted by in GENERAL RADIOLOGY | Comments Off on 20 – Value and Limitations of Observer Models

Full access? Get Clinical Tree

Get Clinical Tree app for offline access