Multimodal Joint Generative Model of Brain Data

given an observed image $\tilde{x}$ from a different imaging modality and a database of N previously observed image pairs $\mathcal {X}=\{y_n,x_n\}$ , indexed by n. Note that the technique can be easily extended to more than 2 modalities. The process of collapsing all the available information from $\mathcal {X}$ into one single estimate $E[\tilde{y}]$ ignores both the fact that generating $\tilde{y}$ is an uncertain process an that $p(\tilde{y})$ can be multimodal. Thus, contrary to [2, 4], where only the expected image $E[\tilde{y}]$ is generated, we characterise $p(\tilde{y},\tilde{x}|\mathcal {X},\mathcal {\theta })$ , i.e. the joint probability of observing the image pair $\{\tilde{y},\tilde{x}\}$ given a set of N previously observed templates $\mathcal {X}$ and the model parameters $\mathcal {\theta }=\{\sigma ^2_{x},\sigma ^2_{y},\varSigma _{j_n},\mu _{j_n},G,w\}$ .

From an intuitive point of view, the main difference between the proposed method and previous approaches pertains with the idea that the process of image synthesis is uncertain. Thus, rather than trying to find the expected observation y from an observation x, we estimate what is the probability of observing different values of y when observing x.

2.1 The Observation Model

In order to estimate $p(\tilde{y},\tilde{x}|\mathcal {X},\mathcal {\theta })$ one has to first note that, due to the presence of pathology or imaging artefacts, not all intensity pairs $\{\tilde{y},\tilde{x}\}$ will be likely given a certain set of previously observed images. Thus, it is important to model the possible presence of outliers – intensity pairs that deviate from previous templates. This does not mean that one would be able to predict unseen pathological intensity pairs. Instead, the model assumes that imaging data can fall outside the predictable intensity patterns.

With this aim in mind, the proposed generative model assumes that the observed data is generated from a mixture of

classes labeled by

, i.e. an inlier class, derived from previous observations, and an outlier class, modelled by a uniform distribution. The probability $p(\tilde{y}_i,\tilde{x}_i,j,l_{i}|,\mathcal {X},\mathcal {\theta })$ of observing an intensity pair $\{\tilde{y}_i,\tilde{x}_i\}$ , a transformation j and label l at location i, is then defined as

$\begin{aligned} p(\tilde{y}_i,\tilde{x}_i,j|l_{i},\mathcal {X},\mathcal {\theta })p(l^k_{i})= w\underbrace{p(\tilde{y}_i,\tilde{x}_i,j|l^{I}_{i},\mathcal {X},\mathcal {\theta })p(l^I_{i})}_\text {inlier model}+(1-w)\underbrace{p(\tilde{y}_i,\tilde{x}_i|l^O_{i})p(l^O_{i})}_\text {outlier model} \end{aligned}$

In this model, the distribution of the pair $\{\tilde{y},\tilde{x}\}$ for the outlier class (

) is given by an uniform distribution $p(\tilde{y}_i,\tilde{x}_i|l^O_{i})=\mathcal {U}$ , an inlier class (

) describing the similarity between the observed pair $\{\tilde{y}_i,\tilde{x}_i\}$ and the previously observed pairs $\mathcal {X}=\{y_n,x_n\}$ , a coordinate mapping

, a prior distribution $p(l^I_{i})$ and $p(l^O_{i})$ over the labelling $l^k_{i}$ , and a global mixing weight w.

In this work, given N previously observed images, the inlier model is defined as a mixture model given by

$\begin{aligned} p(\tilde{y}_i,\tilde{x}_i,j|l^I_{i},\mathcal {X},\mathcal {\theta })=\frac{1}{N}\sum _{n=1}^Np(\tilde{y}_i,\tilde{x}_i|l^I_{i},j_n,\mathcal {X}_n, \mathcal {\theta })p(j_n|l^I_{i},\mathcal {\theta }). \end{aligned}$

i.e., the probability of observing a certain pair of intensities $\{\tilde{y}_i,\tilde{x}_i\}$ is a equally-weighted mixture of the N probabilities of observing the pair given a previously observed pair $\mathcal {X}_n$ .

As in the Non-Local STAPLE algorithm [6], we assume that

is unknown, or at least uncertain, as the coordinate mapping problem is ill posed [7]. As in Simpson et al. [7], $p(j_n|l^I_{i},\mathcal {\theta })$ is a multivariate Gaussian distribution with parameters $\mathcal {\theta }=\{\mu ^n_j,\varSigma _j\}$ . Here, the expectation of the mapping

, i.e. $\mu ^n_j$ , is estimated using a multimodal pairwise b-spline parameterised registration between the observed image pair and the n-th image pair. A multichannel locally normalised cross correlation is used as an image similarity for registration purposes. In addition, the precision matrix $\varSigma _{ji}^{-1}$ at location i, represents the inverse of the local directional estimate of registration uncertainty as described in [7]. In this work, $\varSigma _{ji}^{-1}$ is approximated using the local second-moment matrix, also known as the structure tensor [8], of the observed image $\tilde{x}$ in a cubic convolution region of size $s\times s\times s$ (empirically set to

voxels). This covariance uncertainty approximation assumes that the registration in more uncertain along the edges of the image than across them. Thus,

is less likely if it deviates from $\mu ^n_j$ in a direction orthogonal to the image edges. In future work, this approximation will be replaced by a local covariance estimate as provided by Simpson et al. [7].

Similarly to [6], as

is unknown, $p(\tilde{y}_i,\tilde{x}_i|l^I_{i},j_n,\mathcal {X}, \mathcal {\theta })p(j_n|l^I_{i},\mathcal {\theta })$ is approximated by its expected value given a multivariate Gaussian distribution on the patch

norm and a multivariate Gaussian distribution over the mapping

$\begin{aligned} p(\tilde{y}_i,\tilde{x}_i|l^I_{i},j_n,\mathcal {X}, \mathcal {\theta })p(j_n|l^I_{i},\mathcal {\theta })&\approx E[p(\tilde{y}_i,\tilde{x}_i|l^I_{i},j_n,\mathcal {X}, \mathcal {\theta })p(j_n|l^I_{i},\mathcal {\theta })]\nonumber \\&=\sum _{j^*\in \mathcal {N}_s}p(\tilde{y}_i,\tilde{x}_i|l^I_{i},j^*_n,\mathcal {X}, \mathcal {\theta })p(j^*_n|l^I_{i},\mathcal {\theta }). \end{aligned}$

(1)

Under this approximation and under the assumption of conditional independence between $\tilde{y}$ and $\tilde{x}$ , then $p(\tilde{y}_i,\tilde{x}_i|l^I_{i},j^*_n,\mathcal {X}, \mathcal {\theta })$ is defined as

$\begin{aligned} p(\tilde{y}_i,\tilde{x}_i|l^I_{i},j^*_n,\mathcal {X}, \mathcal {\theta })&=p(\tilde{x}_i|l^I_{i},j^*_n,\mathcal {X}, \mathcal {\theta })p(\tilde{y}_i|l^I_{i},j^*_n,\mathcal {X}, \mathcal {\theta }) \nonumber \\&=e^{-\frac{||\mathcal {N}_{p}(y_i)-\mathcal {N}_{p}(y_{nj_n^*})||^2_2}{\sigma ^2_{y_{in}}}-\frac{||\mathcal {N}_{p}(x_i)-\mathcal {N}_{p}(x_{nj_n^*})||^2_2}{\sigma ^2_{x_{in}}}} \end{aligned}$

(2)

under the assumption of conditional independence between x and y, and

$\begin{aligned} p(j^*_n|l^I_{i},\mathcal {\theta })=\frac{1}{Z_{j^*}}e^{-D\varSigma _j^{-1}D^T} \end{aligned}$

(3)

In Eq. (1), $\mathcal {N}_s$ is an integration neighbourhood of size $s\times s\times s$ (again with

voxels) of an image similarity component (Eq. 2) and the registration uncertainty distance component (Eq. 3). In Eq. 2, $\mathcal {N}_{p}(\varkappa )$ is a patch of size $p\times p\times p$ (with

voxels) centred at location $\varkappa$ , and in Eq. 3, $D=j^*_n-\mu _n$ is a $1\times 3$ vector characterising the 3-dimensional components of a displacement from $\mu _n$ . Both the parameters $\sigma ^2_{y_n}$ and $\sigma ^2_{x_n}$ denote the sum of a local and a global normally distributed noise model (iid) between the observation $\tilde{y}$ and the template

, defined as

$\begin{aligned} \sigma ^2_{y_{in}}={\sigma (y-y_{n\mu _n})^2+\sigma (\mathcal {N}_s(y_i)-\mathcal {N}_s(y_{n\mu _n}))^2} \end{aligned}$

and equivalently for $\sigma ^2_{x_{in}}$ , with $\sigma (\varkappa )$ representing the standard deviation of $\varkappa$ . Note that the

of the patch can here be used as all images $\mathcal {X}$ have been histogram matched to $\{\tilde{y}_i,\tilde{x}_i\}$ using a $3^{rd}$ order polynomial fit after non-rigid registration. Finally, $Z_{xj^*}$ is a partition function enforcing

$\begin{aligned} \sum _{j^*\in \mathcal {N}_s}p(\tilde{x}_i,j_n^*|l^2_{i},x_n,\mathcal {\theta })=1. \end{aligned}$