Part-Based Object Detection and Segmentation

and Dorin Comaniciu¹

(1)

Imaging and Computer Vision, Siemens Corporate Technology, Princeton, NJ, USA

Abstract

The computational framework of Marginal Space Learning (MSL) formulates the detection of an anatomical structure as a whole object. Nevertheless, an object might exhibit different degrees of variations due to many factors, e.g., nonrigid deformation, anatomical variations, or diversity in scanning protocols. When variations are too large for the object to maintain a globally consistent shape or appearance, the MSL cannot be applied directly or the detection robustness may be degraded. In such scenarios the use of part based models is recommended. Different applications may demand different methods to split the object into parts, enforce constraint during part detection, and aggregate the results of part detectors. In this chapter, we demonstrate the robustness and accuracy of part-based object detection and segmentation on three applications, namely, left atrium segmentation in 3D C-arm CT, left ventricle detection in 2D MRI, and aorta segmentation in 3D C-arm CT.

5.1 Introduction

The computational framework of Marginal Space Learning (MSL) formulates the detection of an anatomical structure as a whole object. Nevertheless, an object might exhibit different degrees of variations due to many factors, e.g., nonrigid deformation, anatomical variations, or diversity in scanning protocols . When variations are too large for the object to maintain a globally consistent shape or appearance, the MSL cannot be applied directly or the detection robustness may be degraded. To deal with this problem we observe that complex anatomical structures can be naturally split into different parts. For example, the aorta can be decomposed into the aortic root, ascending aorta, aortic arch, and descending aorta. Although the global object might exhibit large variations, an object part is subject to less distortion due to its smaller size and better specification, therefore it can be detected more reliably [50, 51].

In some cases, in spite of the fact that the global object can still be detected, the shape initialization accuracy for the subsequent boundary delineation may be degraded due to a large shape variation. The MSL based detection and segmentation uses a mean shape aligned to the estimated pose to provide initial segmentation. If the shape variation is too big, the mean shape cannot represent the whole shape population well; therefore, the shape initialization accuracy may be poor. If the object partitioning is properly implemented, the object parts have a more consistent shape and the MSL can be applied to achieve accurate segmentation of the parts. The part segmentation can then be consolidated into a complete segmentation of the global object. For objects with a large shape variation, a part-based segmentation approach may be more accurate than the holistic approach that treats the anatomical structure as a single object [52, 53].

The anatomical variations are one major challenge in many automatic detection/segmentation problems and these are typical in the case of Left Atrium (LA) and Pulmonary Veins (PVs). Accurate segmentation of the whole LA, including the chamber, appendage , and PVs, has important applications in planning and visual guidance for catheter based ablation to treat atrial fibrillation . An automatic segmentation algorithm has to handle all the common anatomical variations. To achieve this, we note that each LA part has a more consistent shape and the anatomical variation is mainly on the connections among the parts. Therefore we developed a part-based detection method to achieve robust and accurate segmentation of the LA. The whole LA structure is split into six parts, namely, the LA chamber, Left Atrial Appendage (LAA), and four major PVs. We train an MSL pose detector for each part. However, if each part is detected independently, detection outliers may happen. For example, due to the similar shape and proximity of neighboring PVs, one PV may be detected as another PV. The solution is to enforce statistical shape constraints among different parts during the detection process, to reduce outliers.

The automatic and precise detection of the Left Ventricle (LV) in Magnetic Resonance Imaging (MRI) is important not only for the analysis of the LV function, but also during the image acquisition process. In the following, we consider the problem of detecting the LV in a long axis view, when the image plane passes through the LV axis , defined by the apex and two basal points on the mitral valve annulus. As a result of the weak constraint, the LV shape variations and changes in the background surrounding the LV are large. In addition, when the four chambers of the heart are visible in the image plane the LV and Right Ventricle (RV) have a similar appearance. A physician needs to rely on more subtle difference to distinguish the LV and RV (e.g., myocardium thickness ). The cardiac motion is another cause of large shape changes for this application. A normal LV pumps at least 50 % of blood out of its cavity; therefore, the size and shape of the LV change significantly from the end-diastolic phase (when the LV size is the largest) to the end-systolic phase (when the LV is the smallest).

Due to all these variations, our approach to the LV detection problem is to use multiple part detectors and an intelligent aggregation scheme that exploits the redundancy among detectors and improves the overall robustness. We call this method ranking based multi-detector aggregation for LV detection in 2D MRI. In addition to the LV detector of the whole chamber, we train two part detectors, one for the LV apex and the other for the base around the mitral valve . During detection, first, each detector is independently run on an input image to generate a few top candidate detections. A ranking based approach is developed to aggregate all information to pick the best detection result.

Another source of variability comes from an object being imaged with different field-of-views across scans. For example, in a C-arm Computed Tomography (C-arm CT) scan for Transcatheter Aortic Valve Implantation (TAVI), the aortic arch and descending aorta may be captured in some volumes, but missing in others. To address this challenge, we developed a part-based aorta model . The whole aorta is represented with four parts: aortic root , ascending aorta , aortic arch , and descending aorta . Since the aortic root and aortic arch are more consistent, we train two MSL pose detectors, one for each part. The length of the ascending aorta and descending aorta varies quite a lot; hence, it is difficult to treat them as a consistent object. Instead, they are detected through a tracking approach, by noting that their 2D transaxial view has a shape close to a circle. Starting from the detected aortic root or arch, we track the intersection circle using a trained circle detector . As a result, using a part-based approach, we achieve superior performance even for cases when aorta is not fully captured by the acquisition. Depending on the parts that are detected, different workflows can be exploited; therefore, a large structural variation can be elegantly handled.

The remainder of this chapter is organized as follows. In Sect. 5.2, we present a part-based detection scheme to handle anatomical variations of the PVs. The ranking based multi-detector aggregation approach is presented in Sect. 5.3 to improve the detection robustness of the LV in 2D MRI. In Sect. 5.4, we describe a part-based aorta detection method to handle variations in the scanning field of view. Conclusions are presented in Sect. 5.5.

5.2 Part-Based Left Atrium Detection and Segmentation in C-arm CT

Affecting more than three million people in the USA [28], Atrial Fibrillation (AF) is the most common cardiac arrhythmia, involving irregular heart beats. AF is associated with an increased risk of stroke, heart failure, cognitive dysfunction, and reduced quality of life. For example, AF patients have a fivefold increased risk of stroke compared to those without AF and about 15–20 % of strokes can be attributed to AF [41]. A widely used minimally invasive surgery to treat AF, the catheter based ablation procedure uses high radio-frequency energy to eliminate the sources of ectopic foci . With the improvement in the ablation technology, this procedure was adopted quickly with 15 % annual increase rate from 1990 to 2005 [22] and the latest estimate of the number of ablations is approximately 50,000/year in the USA and 60,000/year in Europe [9]. The ablation is mainly performed inside the Left Atrium (LA), especially around the ostia of the Pulmonary Veins (PV). Sometimes, other regions may also be ablated, e.g., the roof of the LA chamber, mitral isthmus, and the Left Atrial Appendage (LAA), to treat persistent AF.

Automatic segmentation of the LA has important applications in pre-operative assessment and intra-operative guidance for the ablation procedure [19, 21, 26]. However, there are large variations in the PV drainage patterns [27]. Majority of the population have two separate PVs on each side of the LA chamber, namely the Left Inferior PV (LIPV) and Left Superior PV (LSPV) on the left side, and the Right Inferior PV (RIPV) and Right Superior PV (RSPV) on the right side (Fig. 5.1). A significant proportion (about 20–30 %) of the population have anatomical variations and the most common variations are extra right PVs , where, besides the RIPV and RSPV, one or more extra PVs emerge separately from the right side of the LA chamber, and the left common PV , where the LIPV and LSPV merge into one before joining the chamber. A personalized LA model can help to translate a generic ablation strategy to the patient’s specific anatomy, thus making the ablation strategy more effective for the patient. Fusing the patient-specific LA model together with electro-anatomical maps or overlaying the result onto 2D real-time fluoroscopic images also provides valuable visual guidance during the intervention (Fig. 5.1c).

Most of the existing LA segmentation methods work on Computed Tomography (CT) or MRI data with electrocardiography-gated (i.e., ECG-gated or gated for short) acquisitions , where the boundary between the LA and the surrounding tissues is sufficiently clear to facilitate the segmentation. There is one reported work on LA segmentation on non-gated C-arm Computed Tomography (C-arm CT) [26], which is the most closely related work to ours. To handle the severe imaging artifacts of C-arm CT (e.g., cardiac motion blur), it uses model-based approaches to exploit the prior shape constraints to improve the segmentation robustness. However, using a single holistic shape model to initialize the LA segmentation, Manzke et al.’s method [26] has difficulty to handle anatomical variations of the PVs, e.g., the left common PV and extra right middle PVs . The right middle PVs are missing in their LA model. Furthermore, their model does not include the LA appendage . Although the LA appendage itself is less important than the PVs in the catheter based ablations, the ridge between the LA appendage and the LSPV is an important ablation region. Including the LA appendage into the shape model provides better visual guidance for physicians to ablate the ridge. Furthermore, the LA appendage is important for other catheter based interventions, e.g., the occlusion of the LA appendage in AF patients to reduce the risk of stroke [17].

Fig. 5.1

The part-based Left Atrium (LA) mesh model. (a) Meshes of the separate LA parts. (b) Final consolidated mesh model. (c) Overlay of the model onto a fluoroscopic image to provide visual guidance during catheter based ablation

In the following we present a fully automatic LA segmentation system on C-arm CT data. Compared to conventional CT or MRI, one advantage of C-arm CT is that the overlay of the 3D patient-specific LA model onto a 2D fluoroscopic image is straightforward since both the 3D and 2D images are captured on the same device within a short time interval. Normally, a non-gated acquisition is performed for C-arm CT; therefore, it may contain cardiac motion artifacts . On a C-arm system with a small X-ray detector panel (20 × 20 cm²), part of the body may be missing in some 2D X-ray projections due to the limited field of view , resulting in significant artifacts around the margin of a reconstructed volume. In addition, there may be streak artifacts caused by various catheters inserted in the heart, which are less common in CT/MRI. All these present challenges to non-model based segmentation approaches [19, 21, 24], which assume no or little prior knowledge of the LA shape, although they may work well on highly-contrasted CT/MRI data. In our work, these challenges are addressed using a model based approach, that also takes advantage of the machine learning based object pose detector and boundary detector.

Instead of using one mean shape model as in [26], the PV anatomical variations are addressed using a part-based model , where the whole LA is split into the chamber, appendage, and four major PVs. Each part is a much simpler anatomical structure compared to the holistic one, therefore can be detected and segmented using a model based approach. In order to increase robustness, we detect the most reliable structure (the LA chamber in this case) first and use it to constrain the detection of other parts (the appendage and PVs). Experiments show that it is better to treat the LA chamber and appendage as a single object to improve the detection robustness of the appendage. Due to the anatomical variations (e.g., the presence of the left common PV ), the relative position of the major PVs to the LA chamber varies. A statistical shape model is used to enforce a proper constraint during the detection of the PVs, i.e., estimating their pose parameters, including position, orientation, and size.

After segmenting the LA parts, a consolidated mesh is generated by resolving the gaps and overlaps among the parts [49]. Extra right middle PVs are then extracted using a graph cuts based approach [4, 5]. To avoid segmentation leakage, the right middle PV extraction is constrained in a region of interest defined by the already segmented RIPV and RSPV [43].

Our segmentation accuracy of the LA chamber and major PVs is comparable to [26], although our validation involves hundreds of datasets, versus the only 33 datasets in [26]. In addition to [26], we also extract the LA appendage and right middle PVs. Taking about 2.6 s to process a volume with 256 × 256 × 250 voxels, the proposed method is much faster than [26]. Our method also compares favorably with the other methods in computation speed, e.g., 5 s in [20] and 5–45 s in [19]. Although the atlas-based methods have the potential to handle anatomical variations by using multiple atlases, their computational efficiency is low, since it may take more than two hours to process one dataset [8].

5.2.1 Part-Based Left Atrium Model

Fig. 5.2

Pulmonary Vein (PV) segmentation results on two datasets. (a) and (b) show a patient with separate left inferior and superior PVs. (c) and (d) show a patient with a left common PV

As shown in Fig. 5.1a, our part-based LA model includes the LA chamber body, appendage, and four major PVs . We reuse the LA chamber model from our four-chamber heart model [45, 47]. The LA chamber surface mesh is composed of 545 mesh points and 1,056 triangles with an opening at the mitral valve .

For AF ablation, physicians only care about a short PV trunk connected to the LA chamber; therefore, we only detect a trunk of 20 mm in length, originating from its ostium. In the case of left common PV , the PVs after the bifurcation at the distal end of the common PV are modeled, as shown in Fig. 5.2c and d. Each PV is represented as an open-ended tubular structure with a proximal opening on the LA chamber side and a distal opening away from the LA chamber. On the PV mesh, the two openings are represented as two closed contours, namely the proximal ring and the distal ring, respectively. The PV mesh is uniformly resampled to nine rings (including the proximal and distal rings) perpendicular to its centerline and each ring is uniformly resampled to 24 points; therefore, the PV mesh is composed of a total of 216 points and 384 triangles.

The LA appendage has a complicated shape, which is composed of a lot of small cavities. On C-arm CT, the boundary between cavities is often blurred due to the cardiac motion artifacts . In our application, it is accurate enough to use a smooth mesh tightly enclosing all the appendage cavities. The shape of the appendage mesh is close to a tilted cone with an opening (called a proximal ring) at the connection to the LA chamber. The centerline from the proximal ring center to the appendage tip defines the orientation of the tilted cone. Similar to the PVs, the appendage mesh is also represented as a set of uniformly distributed circular rings perpendicular to its centerline. Since the appendage is larger and has a more complicated shape than the PVs, it is represented as a denser mesh with 18 rings and each ring with 33 points. The most distal ring is represented as a single point to close the mesh at the appendage tip.

The right middle PVs are an optional component of our part-based LA model as they are only present in a relatively small proportion of patients. The right middle PVs originate on the LA chamber around the area between two major right PVs. Majority of the population (70–80 %) have no middle PVs. However, some patients may have up to three middle PVs [27]. If the origin of a PV is too close to a major PV, it is often difficult to identify if this PV is an independent middle PV or just a side branch of a major PV. Due to these difficulties, we do not have a consistent mesh presentation of the right middle PVs. They are extracted using a non-model based graph cuts approach [5].

Our LA model does not include the side branches of a PV since the AF ablation is normally performed around the PV ostia on the LA chamber surface. A side branch may improve the aesthetic effect of the 3D visualization, but is not clinically relevant. The variation of the left PVs is dominated by the left common PV (Fig. 5.2) and it is extremely rare to have extra middle PVs on the left side [27]; therefore, the left middle PVs are not included in our model. Note that the part-based LA model is an internal representation to facilitate the segmentation process in handling anatomical variations. The final LA model presented to physicians is a consolidated mesh with different parts labeled with different colors, as shown in Fig. 5.1b.

5.2.2 Constrained Detection of Left Atrium Parts

Compared to the holistic approach [26], the part-based approach can handle large anatomical variations. The MSL based detection/segmentation method works well for the LA chamber. However, independent detection of other parts is not robust, either due to low contrast at the appendage or small size of PVs. In C-arm CT, the appendage is particularly difficult to detect since the appendage is a pouch without outlet and the blood flow is slow inside the appendage, preventing complete filling of contrast agent . In many datasets, the appendage is only barely visible. The MSL detector may pick the neighboring LSPV , which often touches the appendage and has higher contrast. However, the relative position of the appendage to the chamber is quite consistent. Experiments show that the best performance is achieved by treating the appendage and chamber as a consolidated object. One MSL based pose detector is trained to detect the combined object.

Through comparison experiments, we found that neither a holistic approach nor independent detection worked well for the PVs (refer to Sect. 5.2.3.2). Therefore, we selected a method to enforce a statistical shape constraint in PV detection. The Point Distribution Model (PDM) [7] is often used to enforce the statistical shape constraint among a set of landmarks in an Active Shape Model (ASM) based segmentation. The shape variation is decomposed into orthogonal deformation modes through Principal Component Analysis (PCA). A deformed shape is projected into a low dimensional deformation subspace to enforce a statistical shape constraint. For each PV, an MSL pose detector can estimate nine pose parameters , i.e., three position parameters (T _x, T _y, T _z), three orientation Euler angles (O _x, O _y, O _z), and three anisotropic scaling parameters (S _x, S _y, S _z). Different to the conventional PDM, we also want to enforce constraint among the estimated orientation and size of PVs. One solution is to stack all PV pose parameters into a long vector to perform PCA. However, the position and orientation parameters are measured in different units. If not weighted properly, the extracted deformation modes may be dominated by one part of the transformation. Furthermore, the Euler angles are periodic (with a period of 2π), which prevents the application of PCA. Boisvert et al. [3] proposed to build a shape model on a Riemannian manifold that has an intrinsic measurement of the orientation distance . However, they still need to heuristically assign proper weights to the distances in the translation and orientation spaces.

In our solution we use a new representation of the pose parameters to avoid the above problems. The object pose can be fully represented by the object center T together with three scaled orthogonal axes. Alternative to the Euler angles, the object orientation can be represented as a rotation matrix R = (R _x, R _y, R _z) where each column of R defines an axis. The object pose parameters are then given by a four-point set (T, V _x, V _y, V _z), where

$\displaystyle\begin{array}{rcl} \mathbf{V_{x}}& =& \mathbf{T} + S_{x}\mathbf{R_{x}}, \\ \mathbf{V_{y}}& =& \mathbf{T} + S_{y}\mathbf{R_{y}}, \\ \mathbf{V_{z}}& =& \mathbf{T} + S_{z}\mathbf{R_{z}}.{}\end{array}$

(5.1)

The pose of each PV is represented by four points. Besides the constraint among the PVs, we also add the already detected LA chamber center and appendage center to stabilize the detection. In total, we get a set of 18 points and the point distribution shape subspace is learned on the training set.

After independent detection of the position, orientation, size of the PVs, we project their poses into a subspace with eight dimensions, which explains about 75 % of the total variation, to enforce a statistical shape constraint . Note that the percentage of preserved variations in the sub-space is lower than the typical 95 % used for object segmentation [7]. In a typical setting, the ASM is used for object boundary segmentation; therefore, the sub-space needs to be large enough to achieve a flexible and precise segmentation. In our case, the ASM is used to constrain the PV pose estimation. Since the robustness is more important than precision, we found a subspace with a lower dimension is better. The additional variations can be compensated by the following boundary segmentation process.

After enforcing a statistical shape constraint, the new PV center is given by the point $\mathbf{\hat{T}}$ . We can recover the orientation ( $\mathbf{\hat{R}}$ ) and scale ( $\mathbf{\hat{S}}$ ) from points V _x, V _y, and V _z by simple inversion of Eq. (5.1). However, the estimate $\mathbf{\hat{R}}$ is generally not a true rotation matrix ( ${\mathbf{\hat{R}}}^{T}\mathbf{\hat{R}}\neq \mathbf{I}$ ). We want to find the nearest rotation matrix R _o to minimize the sum of squares of elements in the difference matrix $\mathbf{R_{o}} -\mathbf{\hat{R}}$ , which is equivalent to

$\displaystyle{ \mathbf{R_{o}} = arg\min _{\mathbf{R}}\mbox{ Trace}({(\mathbf{R} -\mathbf{\hat{R}})}^{T}(\mathbf{R} -\mathbf{\hat{R}})), }$

(5.2)

subject to ${\mathbf{R_{o}}}^{T}\mathbf{R_{o}} = \mathbf{I}$ . Here, Trace(.) is sum of the diagonal elements. The optimal solution [18] is given by

$\displaystyle{ \mathbf{R_{o}} = \mathbf{\hat{R}}{({\mathbf{\hat{R}}}^{T}\mathbf{\hat{R}})}^{-1/2}. }$

(5.3)

Using the statistical shape constraint, a proper configuration of the different LA parts is preserved by the segmentation results. As mentioned before, for image data acquired with a C-arm system equipped with a small X-ray detector panel, a significant portion of a PV may be outside of the field of view. Using the proposed method, the partially missing PV can still be detected correctly.

After segmenting the LA parts, a consolidated mesh is generated by resolving the gaps and overlaps among the parts. Please refer to [49, 53] for more details.

5.2.3 Experiments on Left Atrium Segmentation in C-arm CT

5.2.3.1 Data Sets

We collected 687 C-arm CT datasets, scanned by Siemens Axiom Artis zee C-arm systems at 18 clinical sites in Europe and the USA from June 2006 to April 2011. Among them 253 datasets were scanned with large X-ray detector panels (30 × 40 cm²) and reconstructed to volumes composed of 85–254 slices, each slice containing 256 × 256 pixels. The resolution varies from 0.61 to 1.00 mm. A typical large volume has 256 × 256 × 190 voxels with an isotropic resolution of 0.90 mm. The other 434 datasets were scanned with small X-ray detectors (20 × 20 cm²). Each volume contains 164–251 slices and each slice has 256 × 256 pixels. The resulting volume resolution also varies. A typical small volume has 256 × 256 × 245 voxels with an isotropic resolution of 0.44 mm. Because of the limited field of view of small X-ray detectors, the reconstructed volumes may contain artifacts, especially around the volume margin.

The contrast agent is injected via a pigtail catheter inside the pulmonary artery trunk . A single sweep of the C-arm involving a rotation of 200^∘ in 5 s is performed to capture 2D projections and a 3D volume is reconstructed from all 2D projections belonging to various cardiac phases (the so-called non-gated reconstruction). Such non-gated acquisition often results in significant amount of motion blur , especially around the septum wall of the LA . In most cases, the LA has sufficient contrast, but inhomogeneous contrast filling is often observed, especially between the left and right PVs because of the different transition time of the contrast agent from the pulmonary artery trunk to the PVs. The LA appendage often lacks contrast because of the slow blood flow inside the appendage. Different to [26, 35], our imaging protocol uses a single scan to reduce the amount of contrast agent and radiation dose, even for a C-arm system with a small X-ray detector.

To train the proposed segmentation system and perform quantitative evaluation, the LA needs to be annotated on all datasets. The annotation is generated sequentially. First, the LA chamber, appendage, and four major PVs are annotated using the part model presented in Sect. 5.2.1 and are then used to train the part detectors (Sect. 5.2.2). A consolidated mesh is generated from the LA parts using the methods presented in [49]. The mesh is then double checked and segmentation errors are manually corrected if necessary.

5.2.3.2 Quantitative Evaluation of Left Atrium Segmentation

A fourfold cross-validation is performed to evaluate the LA segmentation accuracy. The whole dataset is randomly split into four roughly equal sets. Three sets are used to train the system and the remaining set is reserved for testing. The configuration is rotated, until each set has been tested once. Due to the heterogeneity of the datasets, we train two separate systems, one for the large volumes and one for the small ones, respectively. The evaluation is performed separately for each system.

Table 5.1

Left atrium segmentation errors (based on a fourfold cross-validation) on 253 large C-arm CT volumes. The symmetric point-to-mesh errors, measured in millimeters (mm), are reported

	Holistic		Independent		Proposed
	Mean	Median	Mean	Median	Mean	Median
LA Chamber	1.52	1.38	1.45	1.26	1.41	1.26
Appendage	3.16	2.51	3.39	2.52	2.87	2.20
Left Inf. PV	2.68	2.10	1.67	1.32	1.60	1.30
Left Sup. PV	2.45	1.74	1.72	1.38	1.55	1.15
Right Inf. PV	3.24	2.39	1.85	1.30	1.67	1.34
Right Sup. PV	2.31	1.86	1.42	1.15	1.36	1.21
Whole Mesh Average	2.08	1.94	1.85	1.59	1.70	1.50
Whole Mesh (No Part Label)	1.72	1.61	1.51	1.37	1.47	1.30
Whole Mesh (No Part Label + No LAA)	1.55	1.45	1.30	1.15	1.27	1.14

The LA segmentation accuracy is measured using the symmetric point-to-mesh distance [46]. For each point on a mesh, we search for the closest point on the other mesh to calculate the minimum distance. Different anatomical parts are labeled differently on our consolidated mesh. To include the mesh part labeling errors in the evaluation, when we search for the closest corresponding point, the search is constrained to the region with the same part label. The minimum distances of all mesh points are averaged to calculate the mesh-level error. We calculate the distance from the detected mesh to the ground truth and vice versa to make the measurement symmetric. Our consolidated mesh has an opening around the mitral valve so that the physicians can have an endocardium view inside the LA. The mesh is closed around the distal PVs, although there is no image boundary around that region. Similar to [26], the artificial closing around the distal PVs are excluded from the evaluation (which corresponds to about 5.5 % mesh triangles excluded).

Table 5.2

Left atrium segmentation errors (based on a fourfold cross-validation) on 434 small C-arm CT volumes. The symmetric point-to-mesh errors, measured in millimeters (mm), are reported