Method to Discover Genetically Driven Image Biomarkers


Model Variables

$$I_{sn}$$

image descriptor of supervoxel n in subject s

$$G_{sm}$$

genetic location of minor allele m in subject s

$$z_{sn}^I$$

subject-specific topic that generates super-voxel n in subject s, $$ 1 \le z_{s,n}^I \le T $$

$$z_{sm}^G$$

subject-specific topic that generates minor allele m in subject s, $$ 1 \le z_{s,m}^G \le T $$

$$c_{st}$$

population-level topic that serves as subject-specific topic t in subject s, $$ 1 \le c_{st} \le K $$

v

parameter vector that determines the stick-breaking proportions of topics in a population template

$$\pi _s$$

parameter vector that determines the stick-breaking proportions of topics in subject s

$$(\mu _k, \varSigma _k)$$

mean and covariance matrix of image descriptors for population-level topic k

$$\beta _k $$

frequency of different locations in genetic signatures for population-level topic k

$$\omega $$

hyper-parameters of the Beta prior for v

$$\alpha $$

hyper-parameters for the Beta prior for $$\pi _s$$

$$\eta ^I$$

hyper-parameters of the Normal-Inverse-Wishart prior for $$(\mu _k, \varSigma _k)$$

$$\eta ^G$$

hyper-parameters of the Dirichlet prior for $$\beta _k$$

VB Estimates

$$(\hat{\mu }_k, \hat{\varSigma }_k)$$

mean and covariance of image descriptors for population-level topic k

$$\hat{\beta }_k $$

frequency of different locations in genetic signatures for population-level topic k





2 Model


In this section, we describe the generative model for image and genetic data based on a population-wide common patterns that are instantiated in each subject. Our notation is summarized in Table 1 and the generative process is illustrated in Fig. 1.

A339424_1_En_3_Fig1_HTML.gif


Fig. 1.
Subject s draws a subset of T topics from K population-level topics. Indices of the subject-level topics are stored in $$c_{s1},..,c_{sT}$$ drawn from a categorical distribution. At the subject level, indices of the supervoxels $$\{z_{sn}^I\}$$ and locations of minor alleles $$\{z_{s,m}^G \}$$ are drawn from the subject-specific categorical distribution. Vector $$c_{s}$$ acts as a map from subject-specific topics to the population-level topics (i.e., $$c_s(z_{sm}^G)$$ or $$c_s(z_{sn}^I)$$).

Image and Genetic Data. We assume each subject in a study is characterized by an image and a genetic signature for the loci in the genome previously implicated in the disease. Based on the analogy to the “bag-of-words” representation [14], we assume that an image domain is divided for each subject into relatively homogeneous spatially contiguous regions (i.e., “supervoxels”). We let $$I_{sn}\in \mathbb {R}^D$$ denote the D-dimensional descriptor of supervoxel n in subject s that summarizes the intensity and texture properties of the supervoxel. The genetic data in our problem comes in a form of minor allele counts (0, 1 or 2) for a set of L loci. Our representation for genetic data is inspired by the commonly used additive model in GWAS analysis [4]. In particular, we assume that the risk of the disease increases monotonically by the minor allele count. We let $$G_{sm} \in \{1, \cdots , L \}$$ denote minor allele m in genetic signature of subject s. For example, suppose $$L=2$$, and subject s has one and two minor alleles in locations $$\ell _1$$ and $$\ell _2$$ respectively. This subject is represented by a list of 3 elements $$G_{s} = \{\ell _1, \ell _2, \ell _2\}$$.

Population Model. Our population model is based on the Hierarchical Dirichlet Process (HDP) [17]. The model assumes a collection of K “topics” that are shared across subjects in the population. We let $$p_{k}^I$$ and $$p_{k}^G$$ denote the distributions for the image and genetic signatures, respectively, associated with topic k. Each $$p_{k}^I = \mathcal {N}(\mu _k,\varSigma _k)$$ is a Gaussian distribution that generates supervoxel descriptors $$I_{sn}$$; it is parameterized by its mean vector $$\mu _k\in \mathbb {R}^D$$ and covariance matrix $$\varSigma _k \in \mathbb {R}^D \times \mathbb {R}^D$$. Each $$p_{k}^G = \text {Cat}(\beta _k)$$ is a categorical distribution that generates minor allele locations $$G_{sm}$$; it is parameterized by its weight vector $$\beta _k\in (0,1)^L$$.

When sampling a new subject s, at most $$T<K$$ topics are drawn from the population-wide pool to determine the image and genetic signature of this subject. We let $$c_{st}$$ denote the population topic selected to serve as subject-specific topic t ($$1 \le t \le T$$) in subject s. We also use $$c_s=[c_{s1},\ldots ,c_{sT}]$$ to refer to the entire vector of topics selected for subject s. $$c_{s}[t] = k$$ indicates that population-level topic k was selected to serve as subject-specific topic t. The subject-specific topics inherit their signature distributions from the population prototypes, but each subject is characterized by a different subset and proportions of the population-level topics represented in the subject-specific data.

As $$T, K \rightarrow \infty $$, this model converges to a non-parametric Hierarchical Dirichlet Process (HDP) [17]. Rather than choose specific values for T and K, HDP enables us to estimate them from the data. As part of this model, we employ the “stick-breaking” construction [17] to parameterize the categorical distribution for $$c_{st}$$:


$$\begin{aligned} c_{st} \sim \text {Cat-SB}(v), \end{aligned}$$

(1)
where $$\text {Cat-SB}(v)$$ is a categorical distribution whose weights are generated through the stick-breaking process from the (potentially infinite) parameter vector v whose components are in the interval (0, 1). Formally, if we define a random variable $$x\sim \text {Cat-SB}(v)$$, then


$$\begin{aligned} p(x) \triangleq v_x \prod _{i=1}^{x-1}{ (1 - v_i ) } \quad \text {for } x = 1,\ldots . \end{aligned}$$

(2)
This parameterization accepts infinite alphabets. The stick-breaking construction penalizes high number of topics hence encouraging parsimonious representation of data. A similar construction enables an automatic selection of the number of topics at the population level and at the subject level. We employ a truncated HDP variant that uses finite values for T and K [9]. In this setup, $$v\in (0,1)^{K-1}$$. In contrast to finite (fixed) models, we set K to high enough value, and the estimation procedure uses as many topics as needed but not necessarily all K topics to explain the observations.

Sep 16, 2016 | Posted by in GENERAL RADIOLOGY | Comments Off on Method to Discover Genetically Driven Image Biomarkers

Full access? Get Clinical Tree

Get Clinical Tree app for offline access