Weighted Principal Component Regression for High-Dimensional Prediction
by using high-dimensional data measured on a graph , where is the edge set of and is a set of vertexes, in which m is the total number of vertexes in The response may include cognitive outcome, disease status, and the early onset of disease, among others. Standard graphs including both directed and undirected graphs have been widely used to build complex patterns [10]. Examples of graphs are linear graphs, tree graphs, triangulated graphs, and 2-dimensional (2D) (or 3-dimensional (3D)) lattices, among many others (Fig. 1). Examples of on the graph include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices). Particularly, various structural and functional neuroimaging data are frequently measured in a 3D lattice for the understanding of brain structure and function and their association with neuropsychiatric and neurodegenerative disorders [9].
Fig. 1.
Illustration of graph data structure : (a) two-dimensional lattice; (b) acyclic directed graph; (c) tree; (d) undirected graph.
The aim of this paper is to develop a new framework of spatially weighted principal component regression (SWPCR) to use on graph to predict . Four major challenges arising from such development include ultra-high dimensionality, low sample size, spatially correlation, and spatial smoothness. SWPCR is developed to address these four challenges when high-dimensional data on graphs share two important features including spatial smoothness and intrinsically low dimensional structure. Compared with the existing literature, we make several major contributions as follows:
(i) SWPCR is designed to efficiently capture the two important features by using some recent advances in smoothing methods, dimensional reduction methods, and sparse methods.
(ii) SWPCR provides a powerful dimension reduction framework for integrating feature selection, smoothing, and feature extraction.
(iii) SWPCR significantly outperforms the competing methods by simulation studies and the real data analysis.
2 Spatially Weighted Principal Component Regression
In this section, we first describe the graph data that are considered in this paper. We formally describe the general framework of SWPCR.
2.1 Graph Data
Consider data from n independent subjects. For each subject, we observe a vector of discrete or continuous responses, denoted by , and a vector of high dimensional data for . In many cases, q is relatively small compared with n, whereas m is much larger than n. For instance, in many neuroimaging studies, it is common to use ultra-high dimensional imaging data to classify a binary class variable. In this case, , whereas m can be several million number of features. In many applications, is a set of prefixed vertexes, such as voxels in 2D or 3D lattices, whereas the edge set may be either prefixed or determined by (or other data).
2.2 SWPCR
We introduce a three-stage algorithm for SWPCR to use high-dimensional data to predict a set of response variables . The key stages of SWPCR can be described as follows.
Stage 1. Build an importance score vector (or function) and the spatial weight matrix (or function) .
Stage 2. Build a sequence of scale vectors ranging from the smallest scale vector to the largest scale vector . At each scale vector , use generalized principal component analysis (GPCA) to compute the first few principal components of an matrix , denoted by , based on and for .
Stage 3. Select the optimal and build a prediction model (e.g., high-dimensional linear model) based on the extracted principal components and the responses .
We slightly elaborate on these stages. In Stage 1, the important scores play an important feature screening role in SWPCR. Examples of in the literature can be generated based on some statistics (e.g., Pearson correlation or distance correlation) between and at each vertex g. For instance, let p(g) be the Pearson correlation at each vertex g and then define
(1)
In Stage 1, without loss of generality, we focus on the symmetric matrix throughout the paper. The element is usually calculated by using various similarity criteria, such as Gaussian similarity from Euclidean distance, local neighborhood relationship, correlation, and prior information obtained from other data [21]. In Sect. 2.3, we will discuss how to determine and while explicitly accounting for the complex spatial structure among different vertexes.
In Stage 2, at each scale vector , we construct two matrices, denoted by and based on and as follows:
(2)
where and are two known functions. For instance, let be an indicator function, we may set
(3)
to extract ’significant’ vertexes. There are various ways of constructing . For instance, one may set as
where and is a graph-based distance between vertexes g and . The value of controls the number of vertexes in , which is a patch set at vertex g [18], whereas is used to shrink small s into zero.
After determining and , we set and for independent subjects. Let be the centered matrix of . Then we can extract K principal components through minimize the following objective function given by
(4)
If we consider correlated observations from multiple subjects, we may use to explicitly model their correlation structure. The solution of the objective function (4) at is the SVD of . The we can use a GPCA algorithm to simultaneously calculate all components of for a fixed K as follows. In practice, a simple criterion for determining K is to include all components up to some arbitrary proportion of the total variance, say 85.
For ultra-high dimensional data, we consider a regularized GPCA to generate by minimizing the following objective function
(5)
subject to and for all k, where and are respectively the k-th column of and . We use adaptive Lasso penalties for and and then iteratively solve (5) [1]. For each , we define and minimize
(6)
subject to and . By using the sparse method in [12], we can calculate the solution of (6), denoted by . In this way, we can sequentially compute for .
In Stage 3, select as the minimum point of the objective function (5) or (6) . let and then K principal components . Moreover, K is usually much smaller than . Then, we build a regression model with as responses and (the i-th row of ) as covariates, denoted by , where is a vector of unknown (finite-dimensional or nonparametric) parameters. Specifically, based on , we use an estimation method to estimate as follows:
where is a loss function, which depends on both the regression model and the data, and is a penalty function, such as Lasso. This leads to a prediction model . For instance, for binary response or 0, we may consider a sparse logistic model given by for .
Only gold members can continue reading. Log In or Register to continue