Conditional Random Fields for Image Synthesis

-weighted images from -weighted images, showing improved synthesis quality as compared to current image synthesis approaches. We also synthesized Fluid Attenuated Inversion Recovery (FLAIR) images, showing similar segmentations to those obtained from real FLAIRs. Additionally, we generated super-resolution FLAIRs showing improved segmentation.

Keywords

Magnetic resonanceImage synthesisConditional random field

1 Introduction

Image synthesis in MRI is a process in which the intensities of acquired MRI data are transformed in order to enhance the data quality or render them more suitable as input for further image processing. Image synthesis has been gaining traction in the medical image processing community in recent years [6, 15], as a useful pre-processing tool for segmentation and registration. It is especially useful in MR brain imaging, where a staggering variety of pulse sequences like Magnetization Prepared Gradient Echo (MPRAGE), Dual Spin Echo (DSE), FLAIR etc. are used to interrogate the various aspects of neuroanatomy. Versatility of MRI is a boon for diagnosticians but can sometimes prove to be a handicap when performing analysis using image processing. Automated image processing algorithms are not always robust to variations in their input [13]. In large datasets, sometimes images are missing or corrupted during acquisition and cannot be used for further processing. Image synthesis can be used as a tool to supplement these datasets by creating artificial facsimiles of the missing images by using the available ones. An additional source of variability is the differing image quality between different pulse sequences for the same subject. An MPRAGE sequence can be quickly acquired at a resolution higher than 1 mm $$^3$$

, which is not possible for FLAIR. Image synthesis can be used to enhance the resolution of existing low resolution FLAIRs using the corresponding high resolution MPRAGE images, thus leading to improved tissue segmentation.

Previous work on image synthesis has proceeded along two lines, (1) registration-based, and (2) example-based. Registration-based [3, 12] approaches register the training/atlas images to the given subject image and perform intensity fusion (in the case of multiple training/atlas pairs) to produce the final synthesis. These approaches are heavily dependent on the quality of registration, which is generally not accurate enough in the cortex and abnormal tissue regions. Example-based approaches involve learning an intensity transformation from known training data pairs/atlas images. A variety of example-based approaches [5, 8, 15] have been proposed. These methods treat the problem as a regression problem and estimate the synthetic image voxel-by-voxel from the given, available images. The voxel intensities in the synthetic image are assumed to be independent of each other, which is not entirely valid as intensities in a typical MR image are spatially correlated and vary smoothly from voxel-to-voxel.

In this work, we frame image synthesis as an inference problem in a probabilistic discriminative framework. Specifically, we model the posterior distribution $\mathrm {p}(\mathbf{y}|\mathbf{x})$ , where $\mathbf{x}$ is the collection of known images and $\mathbf{y}$ is the synthetic image we want to estimate, as a Gaussian CRF [10]. Markov random field (MRF) approaches lend themselves as a robust, popular way to model images. However in a typical MRF, the observed data $\mathbf{x}$ are assumed to be independent given the underlying latent variable $\mathbf{y}$ , which is a limiting assumption for typical images. A CRF, by directly modeling the posterior distribution allows us to side-step this problem. CRFs have been used in discrete labeling and segmentation problems [9]. A continuous-valued CRF, modeled as a Gaussian CRF, was first described in [17]. Efficient parameter learning and inference procedures of a Gaussian CRF were explored in the Regression Tree Fields concept in [7]. We also model the posterior distribution as a Gaussian CRF, the parameters of which are stored in the leaves of a single regression tree. We learn these parameters by maximizing a pseudo-likelihood objective function given training data. Given a subject image, we build the Gaussian distribution parameters from the learned tree and parameters. The prediction of the synthetic subject image is a maximum a posteriori (MAP) estimate of this distribution and is estimated efficiently using conjugate gradient descent.

We refer to our method as Synthesis with Conditional Random Field Tree or SyCRAFT. We applied SyCRAFT to synthesize $$T_2$$

-weighted ( $$T_2$$

-w) images from $$T_1$$

-w (

-w) images and showed a superior quality of synthesis compared to state-of-the-art methods. We also applied our method to synthesize FLAIRs from corresponding $$T_1$$

-w,

-w, and

-weighted ( $$P_D$$

-w) images and showed that tissue segmentation on synthetic images is comparable to that achieved using real images. Finally, we used our method in an example-based super-resolution framework to estimate a super-resolution FLAIR image and showed improved tissue segmentation. In Sect. 2, we describe our method in detail, followed by experiments and results in Sect. 3 and discussion in Sect. 4.

2 Method

2.1 Model

We start with the definition of a CRF, initially proposed in [10]. A CRF is defined over a graph $$G = (V, E)$$

, V and E are the sets of vertices and edges respectively, of G. In an image synthesis context, the set of all voxels i in the image domain form the vertices of V. A pair of voxels $(i,j), i , j \in V$ , that are neighbors according to a predefined neighborhood, form an edge in E. Let $\mathbf{x}= \{\mathbf{x}_1,\ldots ,\mathbf{x}_m\}$ be the observed data. Specifically, $\mathbf{x}$ represents the collection of available images from m pulse sequences from which we want to synthesize a new image. Let $\mathbf{y}$ be the continuous-valued random variable over V, representing the synthetic image we want to predict. In a CRF framework, $\mathrm {p}(\mathbf{y}| \mathbf{x})$ is modeled and learned from training data of known pairs of $(\mathbf{x}, \mathbf{y})$ . Let $\mathbf{y}= \{y_i, i \in V\}$ . Then $(\mathbf{y}, \mathbf{x})$ is a CRF if, conditioned on $\mathbf{x}$ , $$y_i$$

exhibit the Markov property, i.e. $\mathrm {p}(y_i | \mathbf{x}, \mathbf{y}_{V\setminus i} ) = \mathrm {p}(y_i | \mathbf{x}, \mathbf{y}_{\mathcal {N}_i} )$ , where $\mathcal {N}_i = \{ j \mid (i,j) \in E\}$ , is the neighborhood of i.

Assuming $$$\mathrm {p}(\mathbf{y}| \mathbf{x}) > 0, \forall \mathbf{y}$$” src=”/wp-content/uploads/2016/09/A339424_1_En_58_Chapter_IEq30.gif”></SPAN>, from the Hammersley-Clifford theorem, we can express the conditional probability as a Gibbs distribution. The factorization of <SPAN id=IEq31 class=InlineEquation><IMG alt=$ in terms of association potentials and interaction potentials is given as,

$\begin{aligned} \mathrm {p}(\mathbf{y}| \mathbf{x}) = \frac{1}{Z}\exp [-\{\sum _{i\in V} E_{\mathcal {A}}(y_i, \mathbf{x}; \theta ) + \lambda \sum _{i \in V} \sum _{j \in \mathcal {N}_i} E_{\mathcal {I}}(y_i, y_j,\mathbf{x};\theta )\}]. \end{aligned}$

(1)

$E_{\mathcal {A}}(y_i, \mathbf{x}; \theta )$ is called an association potential, defined using the parameter set $\theta$ , $E_{\mathcal {I}}(y_i, y_j,\mathbf{x};\theta )$ is called an interaction potential, $\lambda$ is a weighting factor, and Z is the partition function. If $E_{A}$ and $E_{I}$ are defined as quadratic functions of $\mathbf{y}$ , we can express this distribution as a multivariate Gaussian as below,

$\begin{aligned} \mathrm {p}(\mathbf{y}| \mathbf{x})&= \frac{1}{ (2\pi )^{\frac{|V|}{2}} |\varvec{\varSigma }|^{\frac{1}{2}} }\exp (-\frac{1}{2}(\mathbf{y}-\varvec{\mu }(\mathbf{x}))^T\varvec{\varSigma }(\mathbf{x})^{-1}(\mathbf{y}-\varvec{\mu }(\mathbf{x})))\nonumber \\&= \frac{1}{Z}\exp (-\frac{1}{2}(\mathbf{y}^T\mathbf{A}(\mathbf{x})\mathbf{y}) - \mathbf{b}(\mathbf{x})^{T}\mathbf{y}). \end{aligned}$

(2)

The parameters $\mathbf{A}(\mathbf{x})$ and $\mathbf{b}(\mathbf{x})$ are dependent on the association and interaction potential definitions. In most classification tasks involving CRF’s the association potential is defined as the local class probability as provided by a generic classifier or a regressor [9]. Image synthesis being a regression task, we chose to model and extract both association and interaction potentials from a single regressor, in our case a regression tree. We define a quadratic association potential as

$\begin{aligned} E_{\mathcal {A}}(y_i, \mathbf{x}; \theta ) = \frac{1}{2}(a_{L(i)}y_i^2) - b_{L(i)}y_i, \end{aligned}$

(3)

where $\{a_{L(i)}, b_{L(i)}\} \in \theta$ are the parameters defined at the leaf L(i). L(i) is the leaf where the feature vector $\mathbf{f}_i(\mathbf{x})$ extracted for voxel i from the observed data $\mathbf{x}$ , lands after having been passed through successive nodes of a learned regression tree, $\varPsi$ . The features and regression tree construction are described in Sect. 2.2.

The interaction potential usually acts as a smoothing term, but can also be designed in a more general manner. We define interaction potentials for each type of neighbor. A ‘neighbor type’ $r \in \{1,\ldots ,|\mathcal {N}_i|\}$ is given by a voxel i and one of its $n~(= |\mathcal {N}_i|$ ) neighbors. For example, a neighborhood system with four neighbors (up, down, left, right) has four types of neighbors, and hence four types of edges. The complete set of edges E can be divided into non-intersecting subsets $\{E_1,\ldots , E_r, \ldots ,E_n\}$ of edges of different types. Let the voxel j be such that $(i,j) \in E$ is a neighbor of type r, that is $(i,j) \in E_r$ . Let the corresponding feature vectors $\mathbf{f}_i(\mathbf{x})$ and $\mathbf{f}_j(\mathbf{x})$ land in leaves L(i) and L(j) of the trained tree $\varPsi$ , respectively. The interaction potential is modeled as

$\begin{aligned} E_{\mathcal {I}}(y_i, y_j, \mathbf{x};\theta ) = \frac{1}{2}({\alpha _{L(i)}}_r y_i^2 + {\beta _{L(i)}}_r y_i y_j + {\gamma _{L(i)}}_r y_j^2) - {\omega _{L(i)}}_{1r} y_i - {\omega _{L(i)}}_{2r} y_j . \end{aligned}$

(4)

Let the set of leaves of the regression tree $\varPsi$ be $\mathcal {L}_{\varPsi }$ . Each leaf $l \in \mathcal {L}_{\varPsi }$ stores the set of parameters $\theta _{l} = \{a_l, b_l, {\alpha _{l}}_1,{\beta _{l}}_1, {\gamma _{l}}_{1},{\omega _{l}}_{11}, {\omega _{l}}_{21}, \ldots , {\alpha _{l}}_n,{\beta _{l}}_n,{\gamma _{l}}_{n},{\omega _{l}}_{1n}, {\omega _{l}}_{2n} \}$ . The complete set of parameters is thus, $\theta = \{\theta _{l} | l \in \mathcal {L}_{\varPsi }\}$ . Our approach bears similarity to the regression tree fields concept introduced in [7], where the authors create a separate regression tree for each neighbor type. Thus with a single association potential and a typical 3D neighborhood of 26 neighbors, they would need 27 separate trees to learn the model parameters. Training a large number of trees with large training sets makes the regression tree fields approach computationally expensive. It was especially not feasible in our application with large 3D images, more neighbors, and high dimensional feature vectors. We can however train multiple trees using bagging to create an ensemble of models to create an average, improved prediction. The training of a single regression tree is described in the next section.

2.2 Learning a Regression Tree

As mentioned before, let $\mathbf{x}= \{\mathbf{x}_1, \mathbf{x}_2,\ldots , \mathbf{x}_m\}$ be a collection of co-registered images, generated by modalities $\varPhi _1, \ldots , \varPhi _m$ , respectively. The image synthesis task entails predicting the image $\mathbf{y}$ of a target modality $\varPhi _t$ . The training data thus consists of known co-registered pair of $\{\mathbf{x}, \mathbf{y}\}$ . At each voxel location i, we extract features $\mathbf{f}_i(\mathbf{x})$ , derived from $\mathbf{x}$ . For our experiments we use two types of features, (1) small, local patches, (2) context descriptors. A small 3D patch, denoted by $\mathbf{p}_i(\mathbf{x}) = [\mathbf{p}_i(\mathbf{x}_1), \ldots , \mathbf{p}_i(\mathbf{x}_m)]$ , where the size of the patch is typically $3\times 3\times 3$ and provides us with local intensity information.

We construct the context descriptors as follows. The brain images are rigidly aligned to the MNI coordinate system [4] with the center of the brain approximately at the center of the image. Thus for each voxel we can find out the unit vector $\mathbf{u}$ from the voxel i to the origin. We can define 8 directions by rotating the component of $\mathbf{u}$ in the axial plane by angles $\{0,\frac{\pi }{4},\ldots ,\frac{7\pi }{4}\}$ . In each of these directions, we select average intensities of cubic regions of cube-widths $\{w_1, w_2,w_3,w_4\}$ at four different radii $\{r_1, r_2, r_3, r_4\}$ respectively. This becomes a 32-dimensional descriptor of the spatial context surrounding voxel i. In our experiments we used $$w_1 = 3, w_2 = 5, w_3 = 7, w_4 = 9$$

and

$$r_1 = 4, r_2 = 8, r_3 = 16, r_4 = 32$$

. These values were chosen empirically. We denote this context descriptor by $\mathbf{c}_i(\mathbf{x})$ . The final feature vector is thus $\mathbf{f}_i(\mathbf{x}) = [\mathbf{p}_i(\mathbf{x}),\mathbf{c}_i(\mathbf{x})]$ . $\mathbf{f}_i(\mathbf{x})$ is paired with the voxel intensity $$y_i$$

at i in the target modality image $\mathbf{y}$ to create training data pairs $(\mathbf{f}_i(\mathbf{x}), y_i)$ . We train the regression tree $\varPsi$ on this training data using the algorithm described in [2]. Once the tree is constructed, we initialize $\theta _l$ at each of the leaves $l \in \mathcal {L}_\varPsi$ . $\theta _l$ is estimated by a pseudo-likelihood maximization approach.