Method to Discover Genetically Driven Image Biomarkers

Model Variables

$I_{sn}$

image descriptor of supervoxel n in subject s

$G_{sm}$

genetic location of minor allele m in subject s

$z_{sn}^I$

subject-specific topic that generates super-voxel n in subject s, $1 \le z_{s,n}^I \le T$

$z_{sm}^G$

subject-specific topic that generates minor allele m in subject s, $1 \le z_{s,m}^G \le T$

$c_{st}$

population-level topic that serves as subject-specific topic t in subject s, $1 \le c_{st} \le K$

parameter vector that determines the stick-breaking proportions of topics in a population template

$\pi _s$

parameter vector that determines the stick-breaking proportions of topics in subject s

$(\mu _k, \varSigma _k)$

mean and covariance matrix of image descriptors for population-level topic k

$\beta _k$

frequency of different locations in genetic signatures for population-level topic k

$\omega$

hyper-parameters of the Beta prior for v

$\alpha$

hyper-parameters for the Beta prior for $\pi _s$

$\eta ^I$

hyper-parameters of the Normal-Inverse-Wishart prior for $(\mu _k, \varSigma _k)$

$\eta ^G$

hyper-parameters of the Dirichlet prior for $\beta _k$

VB Estimates

$(\hat{\mu }_k, \hat{\varSigma }_k)$

mean and covariance of image descriptors for population-level topic k

$\hat{\beta }_k$

frequency of different locations in genetic signatures for population-level topic k

2 Model

In this section, we describe the generative model for image and genetic data based on a population-wide common patterns that are instantiated in each subject. Our notation is summarized in Table 1 and the generative process is illustrated in Fig. 1.

Fig. 1.

Subject s draws a subset of T topics from K population-level topics. Indices of the subject-level topics are stored in $c_{s1},..,c_{sT}$ drawn from a categorical distribution. At the subject level, indices of the supervoxels $\{z_{sn}^I\}$ and locations of minor alleles $\{z_{s,m}^G \}$ are drawn from the subject-specific categorical distribution. Vector $c_{s}$ acts as a map from subject-specific topics to the population-level topics (i.e., $c_s(z_{sm}^G)$ or $c_s(z_{sn}^I)$ ).

Image and Genetic Data. We assume each subject in a study is characterized by an image and a genetic signature for the loci in the genome previously implicated in the disease. Based on the analogy to the “bag-of-words” representation [14], we assume that an image domain is divided for each subject into relatively homogeneous spatially contiguous regions (i.e., “supervoxels”). We let $I_{sn}\in \mathbb {R}^D$ denote the D-dimensional descriptor of supervoxel n in subject s that summarizes the intensity and texture properties of the supervoxel. The genetic data in our problem comes in a form of minor allele counts (0, 1 or 2) for a set of L loci. Our representation for genetic data is inspired by the commonly used additive model in GWAS analysis [4]. In particular, we assume that the risk of the disease increases monotonically by the minor allele count. We let $G_{sm} \in \{1, \cdots , L \}$ denote minor allele m in genetic signature of subject s. For example, suppose

, and subject s has one and two minor alleles in locations $\ell _1$ and $\ell _2$ respectively. This subject is represented by a list of 3 elements $G_{s} = \{\ell _1, \ell _2, \ell _2\}$ .

Population Model. Our population model is based on the Hierarchical Dirichlet Process (HDP) [17]. The model assumes a collection of K “topics” that are shared across subjects in the population. We let $p_{k}^I$ and $p_{k}^G$ denote the distributions for the image and genetic signatures, respectively, associated with topic k. Each $p_{k}^I = \mathcal {N}(\mu _k,\varSigma _k)$ is a Gaussian distribution that generates supervoxel descriptors $I_{sn}$ ; it is parameterized by its mean vector $\mu _k\in \mathbb {R}^D$ and covariance matrix $\varSigma _k \in \mathbb {R}^D \times \mathbb {R}^D$ . Each $p_{k}^G = \text {Cat}(\beta _k)$ is a categorical distribution that generates minor allele locations $G_{sm}$ ; it is parameterized by its weight vector $\beta _k\in (0,1)^L$ .

When sampling a new subject s, at most

topics are drawn from the population-wide pool to determine the image and genetic signature of this subject. We let $c_{st}$ denote the population topic selected to serve as subject-specific topic t ( $1 \le t \le T$ ) in subject s. We also use $c_s=[c_{s1},\ldots ,c_{sT}]$ to refer to the entire vector of topics selected for subject s. $c_{s}[t] = k$ indicates that population-level topic k was selected to serve as subject-specific topic t. The subject-specific topics inherit their signature distributions from the population prototypes, but each subject is characterized by a different subset and proportions of the population-level topics represented in the subject-specific data.

As $T, K \rightarrow \infty$ , this model converges to a non-parametric Hierarchical Dirichlet Process (HDP) [17]. Rather than choose specific values for T and K, HDP enables us to estimate them from the data. As part of this model, we employ the “stick-breaking” construction [17] to parameterize the categorical distribution for $c_{st}$ :

$\begin{aligned} c_{st} \sim \text {Cat-SB}(v), \end{aligned}$

(1)

where $\text {Cat-SB}(v)$ is a categorical distribution whose weights are generated through the stick-breaking process from the (potentially infinite) parameter vector v whose components are in the interval (0, 1). Formally, if we define a random variable $x\sim \text {Cat-SB}(v)$ , then

$\begin{aligned} p(x) \triangleq v_x \prod _{i=1}^{x-1}{ (1 - v_i ) } \quad \text {for } x = 1,\ldots . \end{aligned}$

(2)

This parameterization accepts infinite alphabets. The stick-breaking construction penalizes high number of topics hence encouraging parsimonious representation of data. A similar construction enables an automatic selection of the number of topics at the population level and at the subject level. We employ a truncated HDP variant that uses finite values for T and K [9]. In this setup, $v\in (0,1)^{K-1}$ . In contrast to finite (fixed) models, we set K to high enough value, and the estimation procedure uses as many topics as needed but not necessarily all K topics to explain the observations.

Subject-Specific Data. To generate an image descriptor for supervoxel n in subject s, we sample random variable $z_{sn}^I \sim \text {Cat-SB}(\pi _s)$ from a categorical distribution parameterized by the vector of stick-breaking proportions $\pi _s \in (0,1)^{T-1}$ . $z_{sn}^I=t$ indicates that the subject-specific topic t generates image descriptor $I_{sn}$ :

$\begin{aligned} I_{sn} | z^I_{sn}, c_s \sim \mathcal {N} \left( \mu _{c_s[z^I_{sn}]}, \varSigma _{c_s[z^I_{sn}]} \right) \!. \end{aligned}$