Orderings of Events in Disease Progression

and an ordering $\sigma = ( \sigma (1) , \dots , \sigma (N) )$ , where $\sigma (k) = i$ means that event $e_{i}$ occurs in position k. In practise we only observe a snapshot of the event sequence for each subject, taken at an unknown stage k. If a subject is at stage k in the sequence $\sigma$ the events $e_{\sigma (1)} \dots e_{\sigma (k)}$ have occurred and events $e_{\sigma (k+1)} \dots e_{\sigma (N)}$ have yet to occur. This adduces a partition of the event set, or partial ranking, $\gamma _{k} = e_{\sigma (1)}, \dots , e_{\sigma (k) } | e_{\sigma (k+1)} , \dots , e_{\sigma (N) }$ , where the vertical bar indicates that the first set of events precedes the second. The occurence of event $e_{i}$ in subject j is informed by biomarker measurement $x_{ij}$ . The generative model of the biomarker data is

$\begin{aligned} k_{j} \sim P(k) , \end{aligned}$

$\begin{aligned} x_{\sigma (i) , j} \sim p(x_{\sigma (i) , j} | e_{\sigma (i)}) \text { if } i \le k_{j} , \end{aligned}$

$\begin{aligned} x_{\sigma (i) , j} \sim p(x_{\sigma (i) , j} | \lnot e_{\sigma (i)}) \text { otherwise}. \end{aligned}$

p(x | e) and $p(x | \lnot e)$ are probability density functions on observing biomarker measurement x given that event e has or has not occurred respectively. P(k) is a prior on the disease stage k.

2.2 The Generalised Mallows Event-Based Model

We formulate the generalised Mallows event-based model by using a generalised Mallows model to parameterise the variance in a central event sequence $\pi$ through the spread parameter $\varvec{\theta } = (\theta _{1}, \dots , \theta _{N-1})$ . Each subject then has their own latent ordering $\sigma _{j}$ , which is assumed to be a sample from a generalised Mallows model. The generative model of the biomarker data in the event-based model is therefore preceded by

$\begin{aligned} \pi , \varvec{\theta } \sim P (\pi , \varvec{\theta } | \nu , \varvec{r}), \end{aligned}$

$\begin{aligned} \sigma _{j} \sim GM(\pi , \varvec{\theta } ). \end{aligned}$

$GM(\pi , \varvec{\theta }) = \frac{1}{\psi (\varvec{\theta })} \exp \left[ -d_{\varvec{\theta }} (\pi , \sigma ) \right]$ is a generalised Mallows distribution with $\psi (\varvec{\theta }) = \prod _{j=1}^{n-1} \psi _{n-j} (\theta _{j} ) = \prod _{j=1}^{n-1} \frac{1-e^{-(n-j+1) \theta _{j} }}{ 1-e^{-\theta _{j}} }$ . $d_{\varvec{\theta }} (\pi , \sigma )$ is the generalised Kendalls tau distance [8], which penalises the number of pairwise disagreements between sequences. $P (\pi , \varvec{\theta } | \nu , \varvec{r})$ is a conjugate prior over the generalised Mallows distribution parameters of the form $P (\pi , \varvec{\theta } | \nu , \varvec{r}) \propto \exp \left( -\nu \sum _{j} [\theta _{j} r_{j} + \ln \psi _{n-j} ( \theta _{j} ) ] \right)$ [12].

2.3 Dirichlet Process Mixtures of Generalised Mallows Event-Based Models

Dirichlet process mixtures of generalised Mallows models assume that each subject has their own central ordering $\pi _{j}$ and spread parameters $\varvec{\theta }_{j}$ , which are sampled from a discrete distribution G that is drawn from a Dirichlet process [9]. A Dirichlet process mixture is a generative clustering model where the number of clusters is a random variable, meaning that the number of clusters is detected automatically depending on the concentration parameter $\alpha$ . The generative model of the biomarker data in the event-based model is now preceded by the process

$\begin{aligned} G \sim DP( \alpha ,P(\pi , \varvec{\theta } | \nu , \varvec{r}) ), \end{aligned}$

$\begin{aligned} \pi _{j}, \varvec{\theta }_{j} \sim G , \end{aligned}$

$\begin{aligned} \sigma _{j} \sim GM(\pi _{j}, \varvec{\theta }_{j} ) , \end{aligned}$

where $DP( \alpha ,P (\pi , \varvec{\theta } | \nu , \varvec{r}) )$ is a Dirichlet process [9]. Each data point $\pi _{j}$ can be characterised by an association with a cluster label $c_{j} \in 1 , \dots , C$ and each cluster c with a set of generalised Mallows parameters $\sigma _{c}$ and $\varvec{\theta }_{c}$ .

3 Inference

3.1 The Event-Based Model

Inference in the event-based model can be performed by taking Markov Chain Monte Carlo (MCMC) samples of $P(\sigma |X) = \frac{P(X|\sigma )P(\sigma )}{P(X)}$ where

$\begin{aligned} P(X | \sigma ) = \prod _{j=1}^{J} \left[ \sum _{k=0}^{K} P(k) \left( \prod _{i=1}^{k} p(x_{\sigma (i) , j} | e_{\sigma (i)}) \prod _{i=k+1}^{N} p(x_{\sigma (i) , j} | \lnot e_{\sigma (i)} ) \right) \right] \!. \end{aligned}$

(1)

3.2 The Generalised Mallows Event-Based Model

We use Gibbs sampling to infer the parameters of the generalised Mallows event-based model. This consists of two stages. First, generating a set of sample event sequences $\sigma _{1:J}$ . We sample from an augmented model [10], by alternating between sampling a subject’s ordering $\sigma _{j}$ and disease stage $k_{j}$ , which are used to deterministically reconstruct their partial ranking $\gamma _{j}$ . The Gibbs sampling updates are therefore

$\begin{aligned} \sigma ^{(j)} \sim P(\sigma | \varvec{\gamma } = \gamma _{j} , \pi , \varvec{\theta }) , \end{aligned}$

$\begin{aligned} k^{(j)} \sim P(k | \varvec{\sigma } = \sigma _{j} , X_{j} ) . \end{aligned}$

Second, sampling the model parameters given the set of sample orderings $\sigma _{1:J}$ using the updates

$\begin{aligned} \pi \sim P(\pi | \varvec{\theta }, \nu , \varvec{r} , \sigma _{1:J} ), \end{aligned}$

$\begin{aligned} \theta _{k} \sim P(\theta _{k} | \pi , \nu , \varvec{r} , \sigma _{1:J} ) .\end{aligned}$

3.3 Dirichlet Process Mixtures of Generalised Mallows Event-Based Models

We formulate another Gibbs sampler to infer the parameters of Dirichlet process mixtures of generalised Mallows event-based models. We generate a set of candidate sample orderings $\sigma _{1:J,1:C}$ , disease stages $k_{1:J,1:C}$ , and partial rankings $\gamma _{1:J,1:C}$ , which are conditioned on the parameters for each cluster via the updates

$\begin{aligned} \sigma ^{(j,c)} \sim P(\sigma | \varvec{\gamma } = \gamma _{jc} , \pi _{c}, \varvec{\theta }_{c}) , \end{aligned}$

$\begin{aligned} k^{(j,c)} \sim P(k | \varvec{\sigma } = \sigma _{jc} , X_{j} ) . \end{aligned}$

From these samples we sample the cluster assignment $c_{j}$ of each subject conditioned on the cluster assignments of the other subjects $c_{-j}$ , where $c_{-j}$ is the set of cluster assignments for all subjects except subject j, the subject’s sample ordering for each cluster $\sigma _{j,1:C}$ , disease stage $k_{j,1:C}$ and their biomarker data $X_{j}$ . We then update the generalised Mallows model parameters for each cluster, $\pi _{c}$ and $\varvec{\theta }_{c}$ , from the set of subject orderings assigned to each cluster, $\varvec{\sigma }_{c}$ . So we have the updates

$\begin{aligned} c^{(j)} \sim P(c| c_{-{j}}, \sigma _{j,1:C}, \varvec{\theta }, \alpha , \nu , \varvec{r}, X_{j}, k_{j,1:C} ) , \end{aligned}$