Testing for Longitudinal Data

Fig. 1.

A toy example in Euclidean space. Top: (a) Cross-sectional data of two groups, illustrated as red circles and blue squares; (b) the same data with longitudinal information (middle) where points on the same line are observations from one subject; (c) the trajectory space, represented by a slope and an intercept. Every point in this space corresponds to a straight line in (b). Bottom: (d) Trajectories generated by points along the 1st principal component (PC) of standard PCA in trajectory space with $\{0, \pm 1, \pm 2\}$ standard deviations (SD); (e) trajectories generated along the 2nd PC (best-viewed in color) (Colour figure online).

2 Distribution of Trajectories in Euclidean Space

We first illustrate the concept of analyzing populations of trajectories in Euclidean space, which is a trivial case of a Riemannian manifold.

Consider the case of two groups of subjects such that each subject is measured at multiple points in time. Such a data configuration is also referred to as a staggered longitudinal design, see Fig. 1(b). If we ignore the within-subject correlations and model the data with a cross-sectional design, illustrated in Fig. 1(a), the two groups cannot be separated using statistical tests that rely on a comparison of means only (cf. Table 1). Hence, to leverage longitudinal information, we first estimate linear regression models on each subject to summarize its trend. The regression line, a smooth trajectory approximating a subject’s data points, is parameterized the tuple of slope and intercept, which can be represented as a point in the space of trajectories. As shown in Fig. 1(c), representing the data in this trajectory space separates the populations (at least visually) in this example. In fact, Table 1 indicates that including longitudinal information allows us to identify differences between the two groups statistically.

To further analyze the group differences, we explore the distribution of trajectories within the (slope, intercept) space, i.e., the trajectory space. Under a Gaussian assumption, principal component analysis (PCA) is a standard tool to estimate the variance and principal directions of a sample. By applying PCA to (slope, intercept) data, we obtain a representation of the population of trajectories, namely their variances and their principal components. For example, the solid lines with different colors in Fig. 1(c) show the principal components of the two groups, respectively. By moving along these two principal components, we generate new points in the trajectory space such that each point represents a straight line in the original space of the data points. Figure 1(d) and (e) visualize the trajectories along the principal components for different standard deviations. The five trajectories in Fig. 1(d), for instance, show the five points along the first principal component in the trajectory space for each group. This Euclidean case illustrates that the proposed approach is a potentially useful tool in the analysis of longitudinal time-varying data.

Table 1.

Distances and estimated p-values (10000 random permutations) on toy data using (1) the mean difference in Euclidean space ( $\bar{D}_E$ ), (2) the Mahalanobis distance ( $\bar{D}_M$ ), and (3) the Bhattacharyya distance ( $$D_B$$

) as a test-statistic.

Bhattacharyya Distance. Visualization of trajectories along principal directions can qualitatively demonstrate differences between groups. However, to quantitatively assess the differences, we need a suitable distance measure that serves as a test-statistic. An appropriate candidate for this is the Bhattacharyya distance [1], which measures the similarity of two probability distributions. Given two multivariate Gaussians, with means $(\mu _1, \mu _2)$ and covariance matrices $(\varSigma _1, \varSigma _2)$ , the Bhattacharyya distance

has the closed-form expression

$\begin{aligned} D_B((\mu _1, \varSigma _1), (\mu _2, \varSigma _2)) = \frac{1}{8} (\mu _1 - \mu _2) \varSigma ^{-1} (\mu _1 - \mu _2)^\top + \frac{1}{2} \ln \left( \frac{|\varSigma |}{\sqrt{|\varSigma _1| \cdot |\varSigma _2|}}\right) , \end{aligned}$

(1)

where $\varSigma = (\varSigma _1 + \varSigma _2)/2$ , and $|\cdot |$ denotes the matrix determinant. The first term in Eq. (1) measures the separability of the distributions w.r.t. their means. It is related to the squared Mahalanobis distance [10], which can be considered a special case of Eq. (1) when the difference between the covariances (as measured by the second term in the summation) is not considered. This additional term makes

more suitable, compared to the Mahalanobis distance, in cases where the distributions differ in variances. In particular, the Mahalanobis distance is zero when two distributions have equal means. However, as

only satisfies three conditions of a distance metric (non-negativity, identity of indiscernibles, and symmetry), but lacks the triangle inequality, it is only a semi-metric.

In fact, Eq. (1) allows us to compute a distance between the two distributions (assuming Gaussianity) in Fig. 1(c), and thereby to define a test-statistic to test for group differences in a permutation testing setup. The null-hypothesis

of the permutation test is that the two distributions (say P, Q) to be tested are the same, i.e.,

. We estimate the empirical distribution of the test-statistic under

by repeatedly permuting the group labels of the points in Fig. 1(c), and re-computing

between the two groups that result from the permuted labels. The p-value under

then is the proportion of the area under the empirical distribution of samples for which the distance is less than the one estimated for the original (unpermuted) label assignments. In Table 1,

, tested on the longitudinal data, exhibits the best performance in separating the groups with an estimated p-value of

1e-4 under 10000 permutations.

3 Distribution of Trajectories on Manifolds

To explore the distribution of trajectories for manifold-valued data, e.g., images or shapes, we need to generalize the statistical test of the previous section from Euclidean space to manifolds. Specifically, let $\{P_{i, j, k}\}$ be a population of longitudinal data on the same manifold, where i is the group identifier, j is the subject identifier, and k identifies the time point. Further assume we have N groups: group i has

subjects ( $i=1,\ldots ,N$ ), and each subject has multiple time points, $\{t_{i,j,k}\}, k = 1,\ldots ,T_{i,j}$ . Our objective is to characterize the distribution of trajectories for each group, $\{D_i\}$ , i.e., to estimate its variance and principal directions, and to assess whether two groups are significantly different.

Individual Trajectories for Longitudinal Data. To perform statistical tests on subjects with associated longitudinal data, our first step is to summarize the variations within a subject as a smooth trajectory. The parametric geodesic regression approaches for data in Kendall’s shape space [4], or images [7, 13], which generalize linear regression in Euclidean space, provide a compact representation of the continuous trajectory for each subject. The trajectory of subject j from group i is parametrized by the initial point $\hat{p}_{i,j}$ and the initial velocity $\hat{u}_{i,j}$ . This trajectory minimizes the sum-of-squared geodesic distances between the observations and their corresponding points on the trajectory, i.e.,

$\begin{aligned} (\hat{p}_{i,j}, \hat{u}_{i,j}) = {\mathop {\mathrm{arg ~ min}}\limits _{(p_{i,j}, u_{i,j})}} \sum _{k=1}^{T_{i,j}} d_g^2({{\mathrm{Exp}}}(p_{i,j}, t_{i,j,k}\cdot u_{i,j}), P_{i,j,k}), \end{aligned}$

(2)

where $d_g(\cdot , \cdot )$ is the geodesic distance and ${{\mathrm{Exp}}}(\cdot ,\cdot )$ denotes the exponential map on some manifold $\mathcal {M}$ [4]. This compact representation, $(\hat{p}_{i,j}, \hat{u}_{i,j})$ , is a point in the tangent bundle $\mathcal {TM}$ of $\mathcal {M}$ . $\mathcal {TM}$ is also a smooth manifold, which can be equipped with a Riemannian metric, such as the Sasaki metric [15]. Since each subject’s longitudinal data is represented as a point on $\mathcal {TM}$ , we work in this space, instead of the space of the data points, to perform group testing.

Principal Geodesic Analysis (PGA) for Trajectories. We generalize principal geodesic analysis to estimate the variance and the principal directions of trajectories on the tangent bundle for each group. We follow the definitions of the exponential- and the log-map on $\mathcal {TM}$ in [12] and use the Sasaki metric. Specifically, given two points $(p_1, u_1), (p_2, u_2) \in \mathcal {TM}$ , the log-map outputs the tangent vector such that $(v, w) = {{\mathrm{Log}}}_{(p_1, u_1)} (p_2, u_2)$ . The exponential map enables us to shoot forward with a given base point and a tangent vector, i.e., $(p_2, u_2) = {{\mathrm{Exp}}}_{\mathcal {TM}}( (p_1, u_1), (v, w))$ . Furthermore, using the log-map, the geodesic distance on $\mathcal {TM}$ can be computed as $d_{\mathcal {TM}}((p_1, u_1), (p_2, u_2)) = \Vert {{\mathrm{Log}}}_{(p_1, u_1)} (p_2, u_2)\Vert$ .

Before computing the variance and the principal directions, we first need to estimate the mean of the trajectories for each group. This is done by minimizing the sum-of-squared geodesic distances, for each group, on $\mathcal {TM}$ as

$\begin{aligned} \forall i: (\bar{p}_i, \bar{u}_i) = {\mathop {\mathrm{arg~min}}\limits _{(p_i, u_i)}} \sum _{j = 1}^{S_i} d_{\mathcal {TM}}^2((p_i, u_i), (\hat{p}_{i,j}, \hat{u}_{i,j})). \end{aligned}$

(3)

Then, following the PGA algorithm of [5], we compute the variance and principal directions w.r.t. the estimated mean of the trajectories. Specifically, we first compute the tangent vector from the mean of group i to the trajectory of its subject j, $(v_{i,j}, w_{i,j}) = {{\mathrm{Log}}}_{(\bar{p}_i, \bar{u}_i)} (\hat{p}_{i,j}, \hat{u}_{i,j})$ and then calculate the covariance matrix $\varSigma _i = \frac{1}{S_i-1} \sum _{j=1}^{S_i} (v_{i,j}, w_{i,j}) (v_{i,j}, w_{i,j})^\top$ . The principal decomposition of $\varSigma _i$ results in the eigenvalues $\lambda _{i,q} \in {\mathbb {R}^+_0}$ and eigenvectors $(v_{i,q}, w_{i,q}) \in \mathcal {T}_{(\bar{p}_i, \bar{u}_i)}\mathcal {M}$ with $q=1,\ldots ,Q_i$ for group i. As a result, we can identify the distribution of trajectories for each group by $D_i = \{(\bar{p}_i, \bar{u}_i), \varSigma _i\}$ with $i=1,\ldots ,N$ . By moving along a principal direction, we can generate points on $\mathcal {TM}$ , which correspond to trajectories on the manifold of the data points.

Generalized Bhattacharyya Distance. Since we can characterize the distribution of trajectories on $\mathcal {TM}$ for each group, to measure the distance between them, we generalize the Bhattacharyya distance from Euclidean space to $\mathcal {TM}$ . Again, the distribution

on $\mathcal {TM}$ , is identified by a mean $\mu _i = (\bar{p}_i, \bar{u}_i) \in \mathcal {TM}$ , and a covariance matrix $\varSigma _i$ with respect to the mean $\mu _i$ .