Permutation Test for Multiresponse Inference on Brain Behavior Associations

(1)

where β is the d × q regression coefficient matrix. The optimal β obtained by solving (1) is equivalent to regressing each column of Y on X independently. Thus, the relations between columns of Y are ignored. To incorporate this information in estimating β, one of the state-of-the-art approaches is to employ the SMR model [2]:

$\mathop {min}\limits_{{\varvec{\upbeta}}} \left\| {{\mathbf{Y}} - {\mathbf{X\beta }}} \right\|_{2}^{2} + \lambda \left\| {{\varvec{\upbeta}}_{g} } \right\|_{2,1} ,{\text{s}}.{\text{t}}.\left\| {{\varvec{\upbeta}}_{g} } \right\|_{2,1} = \sum\limits_{i = 1}^{d} {\left\| {{\varvec{\upbeta}}_{i,:} } \right\|}_{2} = \sum\limits_{i = 1}^{d} {\sqrt {\sum\limits_{j = 1}^{q} {{\varvec{\upbeta}}_{ij}^{2} } } }$

(2)

where ||β _g||_2,1 is the group LASSO penalty and each row of β, denoted as β _i,:, corresponds to a feature. With elements of each β _i,: taken as a group, only features associated with all q response variables would be selected with the corresponding β _i,j ≠ 0 for all j. To set λ, we search over 100 λ’s in [λ _max, λ _min], where $\lambda_{ \hbox{max} } = \, max_{j} ||{\mathbf{X}}_{:,j}^{\text{T}} {\mathbf{Y}}||_{ 2}$ and $\lambda_{ \hbox{min} } = c\lambda_{ \hbox{max} } ,c < { 1}$ . Optimal λ is defined as the one that minimizes the prediction error over 1000 subsamples with the data randomly split into 80 % for model training and 20 % for error evaluation. A fast solver of (2) can be solved with GLMNET [13].

2.2 Stability Selection

A problem with assuming features associated with nonzero β _i,: in SMR are all relevant is that this property is true only under very restricted conditions, which are largely violated in most real applications [3]. In particular, this guarantee on correct feature selection does not hold when features are highly correlated, which is often the case for real data [3]. With correlated features, perturbations to the data can result in drastic changes in the features that are selected. Based on this observation, an intuitive approach to deal with correlated features is to perturb the data and declare features that are consistently selected over different perturbations as relevant, which is the basis of SS. We describe here SS in the case of SMR, but SS can generally be applied to any models that are equipped with a feature selection mechanism. Given X, Y, and [λ _max, λ _min], SS combined with Randomized LASSO proceeds as follows [9]:

Multiply each column of X by 0.5 or 1 selected at random.

Randomly subsample X and Y by half to generate X ^sand Y ^s.

For each λ in [λ _max, λ _min], apply SMR to generate β ^s(λ). Let S ^s(λ) be a d ×1 vector with elements corresponding to selected features, i.e. nonzero β ^s(λ), set to 1.

Repeat steps 2 and 3 S times, e.g. with S = 1000.

Compute the proportion of subsamples, π _i(λ), that each feature i is selected for each λ in [λ _max, λ _min].

Declare feature i as significant if max _λπ _i(λ) > π _th.

A π _th that controls for the expected number of false positives, E(V), is given by [9]:

$E\left( V \right) \le \frac{1}{{2\pi_{\text{th}} - 1}}\frac{{\gamma^{2} }}{d},$

(3)

where V is the number of false positives and γ is the expected number of selected features, which can be approximated by: $1/S \cdot \sum {_{s} } \sum {_{i} } ({\text{U}}_{\lambda } {\mathbf{S}}_{i}^{s} (\lambda ))$ . U_λdenotes the union over λ. We highlight two insights on (3) that have major implications on applying SS in practice. First, (3) is a conservative bound on the family-wise error rate (FWER) =P(V ≥ 1), since $$$ E(V) = \sum {_{v = 1} }^{\infty } P(V \ge v) > P(V \ge 1) $$” src=”/wp-content/uploads/2016/09/A339424_1_En_9_Chapter_IEq4.gif”> . To control FWER at α = 0.05 with multiple comparison correction (MCC), i.e. P(V ≥ 1) ≤ α/d, even for γ = 1, π th based on (<A href=$ 3) is >1. Since π _i(λ) ϵ [0,1], π _th should be clipped at 1, but whether FWER is still controlled is unclear. Second, a key property of SS is that it does not require choosing a specific λ. However, for n/2 > d, a “small enough” λ _min could lead to all features being selected in all subsamples, resulting in max _λπ _i(λ) = 1. Hence, all features would be declared as significant. λ selection is thus translated into λ _min selection, which warrants caution. An example from real data (Sect. 3.2) illustrating the impact of λ _min and π _th is shown in Fig. 1(a). Even with λ _min set to 0.1, i.e. a λ range that would strongly encourage sparsity, a π _th of 0.9 (strictest π _th in the suggested range of [0.6, 0.9] in [9]) declares >40 % of the features as significant, i.e. fails to control for false positives.

Fig. 1.

Behavior of SS and BPT on real data. (a) π _i(λ) at λ = 0.1 (for SS). (b) Gaussian fit on Studentized statistics (for BPT). (c) Gumbel fit on maxima of Studentized statistics (for BPT).

2.3 Bootstrapped Permutation Testing

For models with unknown parameter distribution, including those with no intrinsic feature selection mechanisms, PT is often used to perform statistical inference. PT involves permuting responses a large number of times (e.g. 10000 in this work) and relearning the model parameters for each permutation in generating null distributions of the parameters. Features with original parameter values greater than (or less than) a certain percentile of the null, e.g. >100∙(1‒0.025/d)^th percentile (or <100∙(0.025/d)^th percentile), are declared as significant. Equivalently, one can count the number of permutations with parameter values exceeding/below the original parameter values to generate approximate p-values. A key attribute of PT is that it does not impose any distributional assumptions on the parameters, but the cost of this flexibility is the need for a large number of permutations to ensure the resolution of the approximate p-values are fine enough for proper statistical testing, i.e. the smallest p-value attainable from N permutation is 1/N. Also, if the underlying parameter distribution is known, the associated parametric test is statistically more powerful [12].

The central idea behind BPT is to generate Studentized statistics via bootstrapping to exploit how the variability of the parameter estimates associated with relevant features are likely higher with responses permuted. Similar to PT, BPT can be applied to any models. We describe here BPT in the context of SMR, which proceeds as follows.

Estimation of Studentized statistics, ${\varvec{\upbeta}}_{ij} \,^{st}$ :

Bootstrap X and Y with replacement for B = 1000 times, and denote the bootstrap samples as X ^band Y ^b.

Multiply each column of X ^bby 0.5 or 1 selected at random.

Select the optimal λ for SMR by repeated random subsampling on X and Y, and apply SMR on X ^band Y ^bfor each bootstrap b with this λ to estimate β ^b.

Compute Studentized statistics, ${\varvec{\upbeta}}_{ij} \,^{st} = 1/B \cdot \sum {_{b} } {\varvec{\upbeta}}_{ij} \,^{b} /std({\varvec{\upbeta}}_{ij} \,^{b} )$ where $/std({\varvec{\upbeta}}_{ij} \,^{b})$ is the standard deviation over bootstrap samples.

Estimation of the null distribution of ${\varvec{\upbeta}}_{ij} \,^{st}$ :

Permute Y for N = 500 times.

For each permutation k, compute ${\varvec{\upbeta}}_{ij} \,^{st,k}$ with the same λ, samples, and feature weighting used in each bootstrap b as in the non-permuted case.

$p-value = 2\cdot\hbox{min} (1 - \varPhi ({\varvec{\upbeta}}_{ij} \,^{st} |1/N\cdot\sum {_{k} } {\varvec{\upbeta}}_{ij} \,^{st,k} ,std({\varvec{\upbeta}}_{ij} \,^{st,k} )),\varPhi ({\varvec{\upbeta}}_{ij} \,^{st} |1/N\cdot\sum {_{k} } {\varvec{\upbeta}}_{ij} \,^{st,k} ,std({\varvec{\upbeta}}_{ij} \,^{st,k} ))),$ where Ф(∙) = the normal cumulative distribution function.

Only gold members can continue reading. Log In or Register to continue