from Determinantal Point Processes for Scalable Manifold Learning

for n points. Since this algorithm has the same complexity as the spectral analysis, it cannot be directly used as a subset selection scheme.

In this paper, we focus on nonlinear dimensionality reduction for large datasets via manifold learning. Popular manifold learning techniques include kernel PCA, Isomap [19], and Laplacian eigenmaps [5]. All of these methods are based on a kernel matrix of size ${\mathcal O}(n^2)$ that contains the information about the pairwise relationships between the input points. The spectral decomposition of the kernel matrix leads to the low-dimensional embedding of the points. For large n, one seeks to avoid the explicit construction and storage of the matrix. In contrast to general rank-k matrix approximation, this is possible by taking the nature of the non-linear dimensionality reduction into account and relating the entries of the kernel matrix directly to the original point set.

We propose to perform DPP sampling on the original point set to extract a diverse set of landmarks. Since the input points lie in a non-Euclidean space, ignoring the underlying geometry leads to poor results. To account for the non-Euclidean geometry of the input space, we replace the Euclidean distance with the geodesic distance along the manifold, which is approximated by the shortest path distance on the graph. Due to the high complexity of DPP sampling, we derive an efficient approximation that runs in ${\mathcal O}(ndk)$ with input dimensionality d and subset cardinality k. The algorithm restricts the updates to be local, which enables sampling on complex geometries. This, together with its low computational complexity, makes the algorithm well suited for the subset selection in large scale manifold learning.

A consequence of the landmark selection is that the manifold is less densely sampled than before, making its approximation with neighborhood graphs more difficult. It was noted in [2], as a critical response to [19], that the approximation of manifolds with graphs is topologically unstable. In order to improve the graph construction, we retain the local geometry around each landmark by locally estimating the covariance matrix on the original point set. This allows us to compare multivariate Gaussian distributions with the Bhattacharyya distance for neighborhood selection, yielding improved embeddings.

2 Background

We assume n points in high dimensional space $x_1, \ldots , x_n \in {\mathbb R}^d$ and let $X \in {\mathbb R}^{d\times n}$ be the matrix whose i-th column represents point

. Non-linear dimensionality reduction techniques are based on a positive semidefinite kernel K, with a typical choice of Gaussian or heat kernel $K_{i,j} = \exp (- \Vert x_i - x_j \Vert ^2 / 2 \sigma ^2)$ . The resulting kernel matrix is of size ${\mathcal O}(n^2)$ . The eigen decomposition of the kernel matrix is necessary for spectral analysis. Unfortunately, its complexity is ${\mathcal O}(n^3)$ . Most techniques require only the top k eigenvectors. The problem can therefore also be viewed as finding the best rank-k approximation of the matrix K, with the optimal solution $K_k = \sum _{i=1}^k \lambda _i u_i u_i^\top$ , where $\lambda _i$ is the i-th largest eigenvalue and

is the corresponding eigenvector.

2.1 Nyström Method

Suppose $J \subseteq \{1, \ldots , n\}$ is a subset of the original point set of size k and $\bar{J}$ is its complement. We can reorder the kernel matrix K such that

$\begin{aligned} K = \begin{bmatrix} K_{J \times J}&K_{J \times \bar{J}} \\ K_{J \times \bar{J}}^\top&K_{\bar{J} \times \bar{J}} \end{bmatrix}, \ \ \ \ \ \ \tilde{K} = \begin{bmatrix} K_{J \times J}&K_{J \times \bar{J}} \\ K_{J \times \bar{J}}^\top \,\,&\ \ K_{J \times \bar{J}}^\top K^{-1}_{J \times J} K_{J \times \bar{J}} \end{bmatrix} \end{aligned}$

(1)

where $\tilde{K}$ is the matrix estimated via the Nyström method [21]. The Nyström extension leads to the approximation $K_{\bar{J} \times \bar{J}} \approx K_{J \times \bar{J}}^\top K^{-1}_{J \times J} K_{J \times \bar{J}}$ . The matrix inverse is replaced by the Moore-Penrose generalized inverse in the case of rank deficiency. The Nyström method leads to the minimal kernel completion [3] conditioned on the selected landmarks and has been reported to perform well in numerous applications [8, 13, 18]. The challenge lies in finding landmarks that minimize the reconstruction error

$\begin{aligned} \Vert K - \tilde{K} \Vert _{\text {tr}} = \text {tr}(K_{\bar{J} \times \bar{J}}) - \text {tr}(K_{J \times \bar{J}}^\top K^{-1}_{J \times J} K_{J \times \bar{J}}). \end{aligned}$

(2)

The trace norm $\Vert .\Vert _{\text {tr}}$ is applied because results only depend on the spectrum due to its unitary invariance.

2.2 Annealed Determinantal Sampling

A large variety of methods have been proposed for selecting the subset J. For general matrix approximation, this step is referred to as row/column selection of the matrix K, which is equivalent to selecting a subset of points X. This property is important because it avoids explicit computation of the ${\mathcal O}(n^2)$ entries in the kernel matrix K. We focus on volume sampling for subset selection because of its theoretical advantages [9]. We employ the factorization $K_{J \times J} = Y^\top _J Y_J$ , which exists because $$K_J$$

is positive semidefinite. Columns in $$Y_J$$

can be thought of as feature vectors describing the selected points. Based on this factorization, the volume $\text {Vol}(\{ Y_i \}_{i \in J})$ of the simplex spanned by the origin and the feature vectors $$Y_J$$

is calculated, which is equivalent to the volume of the parallelepiped spanned by $$Y_J$$

. The subset J is then sampled proportionally to the squared volume. This is directly related to the calculation of the determinant with $\det (K_{J \times J}) = \det (Y^\top _J Y_J) = \det (Y_J)^2 = \text {Vol}^2( \{ Y_i \}_{i \in J})$ . These ideas were further generalized in [3] based on annealed determinantal distributions

$\begin{aligned} p^s(J) \propto \det (K_{J \times J})^s = \det (Y_J^\top Y_J)^s = \det (Y_J)^{2s}. \end{aligned}$

(3)

This distribution is well defined because the principal submatrices of a positive semidefinite matrix are themselves positive semidefinite. Varying the exponent $s \ge 0$ results in a family of distributions, modeling the annealing behavior as used in stochastic computations. For $$ s = 0 $$

this is equivalent to uniform sampling [21]. In the following derivations, we focus on $$s=1$$

. It was shown in [9] that for $J \sim p(J), |J|=k$

$\begin{aligned} {\mathbb E} \left[ \Vert K - \tilde{K}\Vert _{\text {tr}} \right] \le (k+1) \Vert K - K_k \Vert ^2_F, \end{aligned}$

(4)

where $\tilde{K}$ is the Nyström reconstruction of the kernel based on the subset J, $$K_k$$

the best rank-k approximation achieved by selecting the largest eigenvectors, and $\Vert .\Vert _F$ the Frobenius norm. It was further shown that the factor $$k+1$$

is the best possible for a k-subset. Related bounds were presented in [4].

Fig. 1.

DPP sampling from 1,000 points lying on a manifold. We show results for standard DPP sampling, geodesic DPP sampling, and efficient DPP sampling. Note that the sampling is performed in 3D, but we can plot the underlying 2D manifold by reversing the construction of the Swiss roll. Geodesic and efficient sampling yields a diverse subset from the manifold.

3 Method

We first analyze the sampling from determinantal distributions on non-Euclidean geometries. We then introduce an efficient algorithm for approximate DPP sampling on manifolds. Finally, we present our approach for robust graph construction on sparsely sampled manifolds.

3.1 DPP Sampling on Manifolds

As described in Sect. 2.2, sampling from determinantal distributions is used for row/column selection. Independently, determinantal point processes (DPPs) were introduced for modeling probabilistic mutual exclusion. They present an attractive scheme for ensuring diversity in the selected subset. Here we work with the construction of DPPs based on L-ensembles [7]. Given a positive semidefinite matrix $L \in {\mathbb R}^{n \times n}$ , the likelihood for selecting the subset $J \subseteq \{1, \ldots , n\}$ is

$\begin{aligned} P_L(J) = \frac{\det (L_{J \times J})}{\det (L + I)}, \end{aligned}$

(5)

where I is the identity matrix and $L_{J \times J}$ is the sub-matrix of L containing the rows and columns indexed by J. By associating the L-ensemble matrix L with the kernel matrix K, we can apply DPPs to sample subsets from the point set X.

To date, applications using determinantal point processes have assumed Euclidean geometry [15]. For non-linear dimensionality reduction, we assume that the data points lie in a non-Euclidean space, such as the Swiss roll in Fig. 1(a). To evaluate the performance of DPPs on manifolds, we sample from the Swiss roll. Since we know the construction rule in this case, we can invert it and display the sampled 3D points in the underlying 2D space. The result in Fig. 1(b) shows that the inner part of the roll is almost entirely neglected, as a consequence of not taking the manifold structure into account. A common solution is to use geodesic distances [19], which can be approximated by the graph shortest path algorithm. We replace the Euclidean distance $\Vert .\Vert$ in the construction of the kernel matrix K with the geodesic distance $K_{i,j} = \exp ( - \Vert x_i - x_j\Vert _{\text {geo}}^2/2\sigma ^2)$ to obtain the result in Fig. 1(c). We observe a clear improvement in the diversity of the sampling, now also including points in the interior part of the Swiss roll.

3.2 Efficient Approximation of DPP Sampling on Manifolds

While it is possible to adapt determinantal sampling to non-Euclidean geometries and to characterize the error for the subset selection, we are missing an efficient sampling algorithm for handling large point sets. In [9], an approximative sampling based on the Markov chain Monte Carlo method is proposed to circumvent the combinatorial problem with ${n\atopwithdelims ()k}$ possible subsets. Further approximations include sampling proportionally to the diagonal elements $K_{ii}$ or their squared version $K_{ii}^2$ , leading to additive error bounds [4, 11]. In [10], an algorithm is proposed that yields a k! approximation to volume sampling, worsening the approximation from $$(k+1)$$

An exact sampling algorithm for DPPs was presented in [14, 15], which requires the eigen decomposition of $K = \sum _{i=1}^n \lambda _i v_i v_i^\top$ . Algorithm 1 states an equivalent formulation of this sampling approach. First, eigenvectors are selected proportionally to the magnitude of their eigenvalues and stored as columns in V. Assuming m vectors are selected, $V \in {\mathbb R}^{n \times m}$ . By setting $B = V^\top$ , we use $B_i \in {\mathbb R}^m$ to denote the rows of V. In each iteration, we select one of the n points where point i is selected proportionally to the squared norm $\Vert B_i\Vert ^2$ . The selected point is added to the subset J. After the selection of i, all vectors $$B_j$$

are projected to the orthogonal space of $$B_i$$

. Since $\text {Proj}_{\perp B_i} B_i = 0$ , the same point is almost surely not selected twice. The update formulation differs from [15], where an orthonormal basis of the eigenvectors in V perpendicular to the i-th basis vector $e_i \in {\mathbb R}^n$ is constructed. Both formulations are equivalent but provide a different point of view on the algorithm. This modification is essential to motivate the proposed efficient sampling procedure. The following proposition characterizes the behavior of the update rule in the algorithm.