DiffusionMap
C++ implementation of the Diffusion Map, with Python bindings
view repo
Diffusion maps are a nonlinear manifold learning technique based on harmonic analysis of a diffusion process over the data. Out-of-sample extensions with computational complexity O(N), where N is the number of points comprising the manifold, frustrate applications to online learning applications requiring rapid embedding of high-dimensional data streams. We propose landmark diffusion maps (L-dMaps) to reduce the complexity to O(M), where M ≪ N is the number of landmark points selected using pruned spanning trees or k-medoids. Offering (N/M) speedups in out-of-sample extension, L-dMaps enables the application of diffusion maps to high-volume and/or high-velocity streaming data. We illustrate our approach on three datasets: the Swiss roll, molecular simulations of a C_24H_50 polymer chain, and biomolecular simulations of alanine dipeptide. We demonstrate up to 50-fold speedups in out-of-sample extension for the molecular systems with less than 4 manifold reconstruction fidelity relative to calculations over the full dataset.
READ FULL TEXT VIEW PDF
Non-linear manifold learning enables high-dimensional data analysis, but...
read it
Diffusion Maps is a nonlinear dimensionality reduction technique used to...
read it
While spectral embedding is a widely applied dimension reduction techniq...
read it
Diffusion maps are an emerging data-driven technique for non-linear
dime...
read it
In this paper, a novel surrogate model based on the Grassmannian diffusi...
read it
In this work we introduce a manifold learning-based method for uncertain...
read it
Automatic image annotation is one of the most challenging problems in ma...
read it
C++ implementation of the Diffusion Map, with Python bindings
Linear and nonlinear dimensionality reduction algorithms have found broad applications in diverse application domains including computer vision
^{1}, recommendation engines^{2}^{3}, and protein folding^{4}. These techniques all seek to discover within high-dimensional datasets low-dimensional projections that preserve the important collective variables governing the global system behavior and to which the remaining degrees of freedom are effectively slaved
^{5, 6, 7, 8, 9}. Efficient and robust identification of low-dimensional descriptions is of value in improved time-series forecasting^{10}, understanding protein folding pathways^{11}, and generating automated product recommendations^{12}.Linear techniques such as principal component analysis (PCA)^{13}, multidimensional scaling^{14}, and random projection^{15}, have proven successful due to their simplicity and relative efficiency in identifying these low-dimensional manifolds. Nonlinear methods, such as Isomap^{16}, LLE^{17}, and diffusion maps^{18}, have proven successful in areas with more complex manifolds where linear techniques fail, trading the simplicity and efficiency of linear methods for the capability to discover more complex nonlinear relationships and provide more parsimonious representations. Of these methodologies, diffusion maps have received substantial theoretical and applied interest for their firm mathematical grounding in the harmonic analysis of diffusive modes over the high-dimensional data^{19, 18}
. Specifically, under relatively mild assumptions on the data, the collective order parameters spanning the low-dimensional nonlinear projection discovered by diffusion maps correspond to the slowest dynamical modes of the relaxation of a probability distribution evolving under a random walk over the data
^{19, 18, 20, 21}. This endows the low-dimensional embedding discovered by diffusion maps with two attractive properties. First, “diffusion distances” in the high-dimensional space measuring the accessibility of one system configuration from another map to Euclidean distances in the low-dimensional projection ^{19, 20, 21}. Second, the collective order parameters supporting the low-dimensional embedding correspond to the important nonlinear collective modes containing the most variance in the data
^{21}. These properties of the diffusion map have been exploited in diverse applications, including the understanding of protein folding and dynamics^{11, 22}, control of self-assembling Janus colloids^{23}, tomographic image reconstruction^{24}, image completion^{25}, graph matching^{26}, and automated tissue classification in pathology slices^{19}.Practical application of diffusion maps typically involves two distinct but related operations: (i) analysis of a high-dimensional dataset in to discover and define a low-dimensional projection in where
and (ii) projection of new out-of-sample points into the low-dimensional manifold. Discovery of the manifold requires calculation of pairwise distances between all data points to construct a Markov matrix over the high-dimensional data, and subsequent diagonalization of this matrix to perform a spectral decomposition of the corresponding random walk
^{20, 27, 26}. This eigendecomposition yields an ordered series of increasingly faster relaxing modes of the random walk, and the identification of a gap in the eigenvalue spectrum of implied time scales informs the effective dimensionality of the underlying low-dimensional manifold and collective coordinates with which to parametrize it
^{19, 18, 20, 9, 28, 21}. Exact calculation of all pairwise distances is of complexity , although it is possible to exploit the exponential decay in the hopping probabilities to threshold these matrix dements to zero and to avoid calculation of all distance pairs using clustering or divide and conquer approaches^{29}. Diagonalization of the -by- matrix has complexity ^{30}, but the typically sparse nature of the Markov matrix admits sparse Lanczos algorithms that reduce the complexity to , whereis the number of non-zero matrix elements
^{26, 31, 32}. Calculation of only the top eigenvectors can further reduce the cost to ^{31, 26}. The overall complexity of the discovery of the nonlinear low-dimensional projection is then .Projection of out-of-sample points into the nonlinear manifold is complicated by the unavailability of an explicit transformation matrix for the low-dimensional projection ^{21}. Naïvely, one may augment the original data with the
samples and diagonalize the augmented system, which is an exact but exceedingly expensive operation to perform for each new sample. Accordingly, a number of approximate interpolation techniques have been proposed
^{33}, including principal component analysis-based approaches^{34}, Laplacian pyramids^{35}, and the Nyström extension^{36, 37, 38}. The Nyström extension is perhaps the simplest and most widely used and scales as , requiring the calculation of pairwise distances with the points constituting the low-dimensional manifold. Linear scaling with the size of the original dataset can be prohibitively costly for online dimensionality reduction applications to high-velocity and/or high-volume streaming data, where fast and efficient embedding of new data points is of paramount importance. For example, in online threat or anomaly detection where excursions of the system into unfavorable regions of the nonlinear manifold must be quickly recognized in order to take corrective action^{39}, in robotic motion planning where movements are generated based on localized paths through a low-dimensional manifold to maintain kinematic admissibility while navigating obstacles^{40, 41}, and in parallel replica dynamics^{42} or forward flux sampling simulations^{43, 44} of biomolecular folding where departures from a particular region of the manifold can be used to robustly and rapidly identify conformational changes in any one of the simulation replicas and permit responsive reinitialization of the replicas to make maximally efficient use of computational resources.To reduce the computational complexity of the application of diffusion maps to streaming data we propose a controlled approximation based on the identification of a subset of “landmark” data points with which to construct the original manifold and embed new data. Reducing the number of points participating in these operations can offer substantial computational savings, and the degree of approximation can be controlled and tuned by the number and location of the landmarks. Our approach is inspired by and analogous to the landmark Isomap (L-Isomap) adaptation of the original Isomap nonlinear manifold learning approach due to Tenenbaum, de Silva, and Langford ^{16, 45}, which has demonstrated robust landscape recovery and computational savings considering small numbers of randomly selected landmark points. Silva et al. subsequently introduced a systematic means to select landmarks based on -regularized minimization of a least-squares objective function ^{46}. As observed by de Silva and Tenenbaum ^{45}, the underlying principle of this approach is analogous to global positioning using local distances, whereby a point can be uniquely located within the manifold given sufficient distance measurements to points distributed over its surface ^{47}. In analogy with this work, we term our approach the landmark diffusion maps (L-dMaps).
It is the purpose of this manuscript to introduce L-dMaps, in which the selection of landmark points reduces the computational complexity of the out-of-sample projection of a new data point from to offering speedups , which can be a substantial factor when the landmarks constitute a small fraction of the data points constituting the manifold. The use of landmarks also substantially reduces the memory requirements, leading to savings in both CPU and RAM requirements that enable applications of diffusion maps to higher volume and velocity streaming data than is currently possible.
The structure of this paper is as follows. In Materials and Methods, we introduce the computational and algorithmic details of our L-dMaps approach along with theoretical error bounds on its fidelity relative to diffusion maps applied to the full dataset. In Results and Discussion, we demonstrate and analyze the accuracy and performance of L-dMaps on three test systems – the Swiss roll, molecular simulations of a C24H50 polymer chain, and biomolecular simulations of alanine dipeptide – in which we report up to 50-fold speedups in out-of-sample extension with less than 4% errors in manifold reconstruction fidelity relative to those calculated by dMaps applied to the full dataset. In our Conclusions we close with an appraisal of our approach and its applications, and an outlook for future work.
First, we briefly describe the original diffusion maps (dMaps) approach and the Nyström extension for out-of-sample projection. Second, we introduce landmark diffusion maps (L-dMaps) presenting two algorithms for systematic identification of landmark points – one fully automated spanning tree approach, and one based on k-medoids that can be tuned to achieve specific error tolerances – and the subsequent use of these landmarks to perform nonlinear manifold discovery and Nyström projection of new data. Third, we develop theoretical estimates of L-dMaps error bounds based on a first-order perturbation expansion in the errors introduced by the use of landmarks compared to consideration of the full dataset. Finally, we detail the three datasets upon which we demonstrate and validate our approach: the Swiss roll, molecular simulations of a C24H50 polymer chain, and biomolecular simulations of alanine dipeptide.
The diffusion map is a nonlinear dimensionality reduction technique that discovers low-dimensional manifolds within high-dimensional datasets by performing harmonic analysis of a random walk constructed over the data to identify nonlinear collective variables containing the predominance of the variance in the data^{18, 19, 9, 20}. The first step in applying diffusion maps is to compute a measure of similarity between the high-dimensional data points to construct the -by- pairwise distance matrix with elements,
(1) |
where is an appropriate distance metric for the system under consideration (e.g. Euclidean, Hamming, earth movers distance, rotationally and translational aligned root mean squared deviation (RMSD)). These distances are then used to define a random walk over the data by defining hopping probabilities from point to point as proportional to the convolution of with a Gaussian kernel,
(2) |
where is a soft-thresholding bandwidth that limits transitions between points within an neighborhood. Systematic procedures exist to select appropriate values of for a particular dataset.^{24, 7} Forming the diagonal matrix containing the row sums of the matrix,
(3) |
we normalize the hopping probabilities to obtain the Markov matrix ,
(4) |
describing a discrete diffusion process over the data. Although other choices of kernels are possible, the symmetric Gaussian kernel is the infinitesimal generator of a diffusion process such that the Markov matrix is related to the normalized graph Laplacian,
(5) |
where
is the identity matrix, and in the limit of
and the matrix converges to a Fokker-Planck operator describing a continuous diffusion process over the high-dimensional data ^{18, 19, 48, 7}.Diagonalizing by solving the -by- eigenvalue problem,
(6) |
where is a diagonal matrix holding the eigenvalues in non-ascending order and is a matrix of right column eigenvectors corresponding to the relaxation modes and implied time scales of the random walk ^{18, 19}. By the Markov property the top pair are trivial, and the steady state probability distribution over the high-dimensional point cloud given by the top left eigenvector ^{7}. The graph Laplacian and Markov matrix share left and right biorthogonal eigenvectors, and the eigenvectors of are related to those of as ^{7}. Accordingly, the leading eigenvectors of are the slowest relaxing modes of the diffusion process described by the graph Laplacian ^{19}.
A gap in the eigenvalue spectrum exposes a separation of time scales between slow and fast relaxation modes, informing an embedding into the slow collective modes of the discrete diffusion process that excludes the remaining quickly relaxing fast degrees of freedom ^{48}. Identifying this gap at informs an embedding into the top non-trivial eigenvectors,
(7) |
This diffusion mapping can be interpreted as a nonlinear projection of the high-dimensional data in onto a low-dimensional “intrinsic manifold” in discovered within the data where . The diffusion map determines both the dimensionality of the intrinsic manifold and good collective order parameters with which to parameterize it. As detailed above, the overall complexity of manifold discovery and projection via diffusion maps is , where efficient diagonalization routines leave this calculation dominated by calculation of pairwise distances.
The Nyström extension presents a means to embed new data points outside the original used to define the low-dimensional nonlinear embedding by approximate interpolation of the new data point within the low-dimensional manifold^{36, 37, 38, 49, 33}. This operation proceeds by computing the distances of the new point to the existing points defining the manifold , and using these values to compute and augment the matrix with an additional row ^{33, 49}. The projected coordinates of the new point onto the -dimensional intrinsic manifold defined by the top non-trivial eigenvectors of are then given by,
(8) |
The computational complexity for the Nyström projection of a single new data point is , requiring distance calculations in the original -dimensional space and then projection onto the -dimensional manifold. Projections are exact for points in the original dataset, accurate for the interpolation of new points within the kernel bandwidth of the intrinsic manifold defined by the original points, but poor for extrapolations to points residing beyond this distance away from the manifold ^{33, 21, 50}.
The application of diffusion maps to the online embedding of streaming data is limited by the computational complexity of the out-of-sample extension that requires the calculation of pairwise distances of each new data point with the points defining the embedding. Here we propose landmark diffusion maps (L-dMaps) that employs a subset of points providing adequate support for discovery and construction of the low-dimensional embedding to reduce this complexity to . In this way, we trade-off errors in the embedding of new data with projection speedups that scale in inverse proportion to the number of landmarks . We quantify the reconstruction error introduced by the landmarking procedure, and demonstrate that these errors can be made arbitrarily small by selection of sufficiently many landmark points. The use of landmarks has previously been demonstrated in the L-Isomap variant of the Isomap dimensionality reduction technique to offer substantial gains in computational efficiency^{45}.
Landmarking can be conceived as a form of lossy compression that represents localized groups of points in the high-dimensional feature space by attributing them to a central representative landmark point. Provided the landmarks are sufficiently well distributed over the intrinsic manifold mapped out by the data in the high-dimensional space, then the pairwise distances between landmarks to a new out-of-sample data point provide sufficient distance constraints to accurately embed the new point onto the manifold^{45, 47}. The original L-Isomap algorithm proposed landmarks be selected randomly^{45}, and a subsequent sophistication by Silva et al. proposed a selection procedure based on -regularized minimization of a least-squares objective function ^{46}. In this work, we propose two efficient and systematic approaches to selection: a pruned spanning tree (PST) approach that offers an automated means to select landmarks, and a k-medoids approach that allows the user to tune the number of landmarks to trade-off speed and embedding fidelity to achieve a particular error tolerance. Both approaches require pre-computation of the -by- pairwise distances matrix, making them expensive for large datasets^{46}. However, it is the primary goal of this work to select good landmarks for the rapid and efficient embedding of streaming data into an existing manifold, so the high one-time overhead associated with their selection is of secondary importance relative to optimal landmark identification for subsequent online performance.
Pruned spanning tree (PST) landmark selection. The square root of the soft-thresholding bandwidth employed by diffusion maps defines the characteristic step size of the random walk over the high dimensional data (Eqn. 2)^{7}. In order to reliably construct a low-dimensional embedding, the graph formed by applying this neighborhood threshold to the pairwise distance matrix must be fully connected to assure than no point is unreachable from any other (i.e., the Markov matrix is irreducible) ^{21}. This assures that a single connected random walk can be formed over the data, and that the diffusion map will discover a single unified intrinsic manifold as opposed to a series of disconnected manifolds each containing locally connected subsets of the data. This connectivity criterion imposes two requirements on the selection of landmarks as a subset of the data points : (i) all points are within a neighborhood of (i.e., covered by) a landmark,
(9) |
and (ii) the graph of pairwise distances over the landmarks is fully connected, with each landmark point within a distance of of at least one other,
(10) |
In practice, a threshold of a few multiples of may be sufficient to maintain coverage and connectivity.
These coverage and connectivity conditions motivate a landmark selection procedure based on spanning trees of the pairwise distances matrix that naturally enforces both of these constraints and identifies landmarks that ensure out-of-sample points residing within the manifold can be embedded within a neighborhood of a landmark point. Residing within the characteristic step size of the random walk, this condition is expected to produce accurate interpolative out-of-sample extensions using the Nyström extension. As described above, extrapolative extensions are expected to perform poorly for distances greater than ^{33, 21, 50}. First, we form the binary adjacency matrix by hard-thresholding the -by- pairwise distances matrix (Eqn. 1),
(11) |
The binary adjacency matrix defines a new graph in which two data points and are connected if and only if . Next, we seek the minimal subset of edges that contains no cycles and connects all nodes in the graph, which is equivalent to identifying a spanning tree representation of the graph that may be determined in many ways^{51}. We elect to use Prim’s algorithm^{52}, which randomly selects a root node then recursively adds edges between tree nodes and unassigned nodes until all nodes are incorporated into the tree. As the edge weights of are either 0 or 1, Prim’s algorithm at each step randomly selects an edge from connecting an unassigned node to a node in . By only creating new connections between tree nodes and non-tree nodes, this method guarantees that is cycle-free and, provided that is connected, is a spanning tree of . Finally, we recognize that each leaf node lies within of their parent, meaning that all leaves of the tree can be pruned, with the remaining nodes comprising a pruned spanning tree (PST) defining a set of landmarks satisfying both the covering (Eqn. 9) and connectivity (Eqn. 10) conditions. We summarize PST landmark identification procedure in Algorithm 1.
K-medoid landmark selection. Growing and pruning a spanning tree offers an automated means to identify landmarks that ensure any out-of-sample point can be interpolatively embedded within a neighborhood of a landmark. This procedure is expected to offer good reconstruction accuracy, but the error tolerance is not directly controlled by the user. Accordingly, we introduce a second approach to landmark selection that allows the user to tune the number of landmark points to trade-off computational efficiency against embedding accuracy in the Nyström extension to achieve a particular runtime target or error tolerance relative to the use of all data points. Specifically, we partition the data into a set of
distinct clusters, and define landmarks within each of these clusters to achieve pseudo-optimal coverage of the intrinsic manifold for a particular number of landmark points. Numerous partitioning techniques are available, including spectral clustering
^{53}, affinity propagation^{54}, and agglomerative hierarchical clustering
^{55}. We use the well-known k-medoids algorithm, using Voronoi iteration to update and select medoid points^{56}. Compared to k-means clustering, k-medoids possesses the useful attribute that cluster prototypes are selected as medoids within the data points constituting the cluster as opposed to as linear combinations defining the cluster mean. We select the initial set of landmarks randomly from the pool of all samples, although we note that alternate seeding methods exist such as
-means++^{57}. We summarize the k-medoids landmark selection in Algorithm 2.The primary purpose of landmark identification is to define an ensemble of supports for the efficient out-of-sample extension projection of streaming data. The nonlinear manifold can be defined by applying diffusion maps to all data points. The expensive calculation of the pairwise distances matrix has already been performed for the purposes of landmark identification, leaving only a relatively cheap calculation of the top eigenvectors where is the number of non-zero elements in the matrix ^{31, 26, 32}. Nevertheless, having identified these landmarks, additional computational savings may be achieved by constructing the manifold using only the landmarks.
Naïve application of diffusion maps to the landmarks will yield a poor reconstruction of the original manifold since these landmark points do not preserve the density distribution of the original points over the high-dimensional feature space. Ferguson et al. previously proposed a means to efficiently apply diffusion maps to datasets containing multiple copies of each point in the context of recovering nonlinear manifolds from biased data^{50}. We adapt this approach to apply diffusion maps to only the landmark points while approximately maintaining the density distribution of the full dataset. Given a set of landmark points we may characterize the local density of points in the high-dimensional space around each landmark by counting the number of data points residing within the Voronoi volume of each landmark point defined by the set,
(12) |
Following Ref. 50, we now apply diffusion maps to the landmark points each weighted by a multiplicity . Mathematically, this corresponds to solving the -by- eigenvalue problem analogous to that in Eqn. 6 of Section 2.1,
(13) |
where , the elements of are given by,
(14) |
defining the unnormalized hopping probability between landmark points and , is a diagonal matrix containing the multiplicity of each landmark point,
(15) |
is a diagonal matrix with elements,
(16) |
and is a diagonal matrix holding the eigenvalues in non-ascending order, and is the matrix of right column eigenvectors. It can be shown that by enforcing the normalization condition on the eigenvectors, that the diffusion map embedding,
(17) |
is precisely that which would be obtained from applying diffusion maps to an ensemble of points in which each landmark point is replicated times, and the non-zero eigenvalues are identical to ^{50}.
The net result of this procedure is that we can define the intrinsic manifold by considering only the landmark points and diagonalizing a -by- matrix as opposed to a -by- matrix with an attendant reduction in computational complexity from to . The diffusion mapping in Eqn. 17 defines a reduced -point intrinsic manifold that can be considered a lossy compression of the complete -point manifold, and which can be stored in a smaller memory footprint^{45}. For large datasets, the computational and memory savings associated with calculation and storage of this manifold can be significant. Although not necessary, if desired the non-landmark points can be projected into the intrinsic manifold using the landmark Nyström extension described in the next section. This is also precisely the procedure that will be used to perform out-of-sample embeddings of new data points not contained within the original data points.
Geometrically, the approximation we make in formulating the reduced eigenvalue problem is that all points within the Voronoi cell of a landmark point are equivalent to the landmark point itself, weighting each landmark point according to the number of points inside its Voronoi volume. This is a good assumption provided that the variation between the points within each Voronoi volume is small relative to the variation over the rest of the high-dimensional space, and becomes exact in the limit that every point is treated as a landmark (i.e., ). In Section 2.3 we place theoretical bounds on the errors introduced by this approximation in the nonlinear projection of new out-of-sample points onto the intrinsic manifold relative to that which would have been computed by explicitly considering all points using the original diffusion map.
The heart of our L-dMaps approach is the nonlinear projection of new out-of-sample points using the reduced manifold defined by the landmark points as opposed to the full -point manifold. Nyström embeddings of new points over the reduced manifold proceed in an analogous manner to that detailed in Section 2.1.1, but now considering only the landmark points. Specifically, we compute the distance of the new point to all landmarks to compute elements with which to augment the matrix with an additional row with elements . The projected coordinates of the new point onto the -dimensional reduced intrinsic manifold defined by the top non-trivial eigenvectors of is,
(18) |
This landmark form of the Nyström extension reduces the computational complexity from to
by reducing both the number of distance computations and the size of the matrix-vector multiplication. The attendant runtime speedup
can be substantial for , offering accelerations to permit the application of diffusion map out-of-sample extension to higher volume and higher velocity streaming data than is currently possible.The diffusion mapping in Eqn. 17 defines a reduced intrinsic manifold comprising the landmark points that we subsequently use to perform efficient landmark Nyström projections of out-of-sample data using Eqn. 18. As detailed in Section 2.2.2 the leading eigenvectors defining the intrinsic manifold come from the solution of a -by- eigenvalue problem in which each landmark point is weighted by the number of points falling in its Voronoi volume , that we solve efficiently and exactly by mapping it to the reduced -by- eigenvalue problem in Eqn. 13^{50}. The approximation we make in formulating this eigenvalue problem is to consider each point in the Voroni volume of each landmark point as identical to the landmark itself. Were we not to make the landmark approximation, we would be forced to solve a different -by- eigenvalue problem explicitly treating all points using the original diffusion map with leading eigenvectors . In using landmarks we massively accelerate the out-of-sample embedding of new points, but the penalty we pay is that the resulting intrinsic manifold we discover is not exactly equivalent to that which would be discovered by the original diffusion map in the limit . In this section we estimate the errors in manifold reconstruction introduced by the landmarking procedure by developing approximate analytical expressions for the discrepancy between the landmark and true eigenvectors parameterizing the intrinsic manifold.
We formulate our approximate landmark eigenvalue problem by collapsing the set of points within the Voronoi volume of each landmark point onto the landmark itself. This amounts to perturbing each data point in the high dimensional space by a vector where is the landmark point within the Voronoi volume of which falls. The elements of the -by- pairwise distance matrix between are correspondingly perturbed from to . This introduces perturbations into the pairwise distances, the precise form of which depends on the choice of distance metric . By propagating these differences through the eigenvalue problem formulated by the original diffusion map over the true (unshifted) locations of the points as a first-order perturbation expansion, we will develop analytical approximations valid in the limit of small perturbations (i.e., sufficiently many landmark points) for the corresponding perturbations in the eigenvalues and eigenvectors introduced by our landmarking procedure. We observe that for a given choice of landmark points and distance metric, the elements are explicitly available, allowing is to use our terminal expressions to predict the errors introduced for a particular choice of landmarks. In Section 3.1 we will validate our analytical predictions against errors calculated by explicit numerical solution of the full and landmark eigenvalue problems.
We first show how to compute the perturbations in our kernel matrix and diagonal row sum matrix as a function of the pairwise distance perturbations , from which we estimate perturbations in the Markov matrix using this expression. We then use these values to develop analytical approximations for the perturbations to the eigenvalues and eigenvectors due to the landmark approximation. The perturbation analysis of the eigenvalue problem presented below follows a similar development to that developed by Deif ^{58}.
Starting from the full diffusion map eigenvalue problem over the original data (Eqn. 6), we define the perturbed eigenvalue problem under the perturbation in the pairwise distances matrix as,
(19) |
and where from Eqn. 4,
(20) |
where in going from the first line to the second we have employed the Maclaurin expansion,
(21) |
and Eqn. 4 in going from the third to the fourth. The perturbations in the elements of the Markov matrix may then be estimated to first order in the perturbations as,
(22) |
To estimate the and required by this expression as a function of the perturbations in the pairwise distances , we commence from the expression for the elements of keeping terms in to first order,
(23) |
from which the perturbations in follow as,
(24) |
The elements of the diagonal and matrices are computed from the and row sums respectively, from which the perturbations in follow immediately as,
(25) |
Using Eqn. 22 we now estimate the corresponding perturbations in the eigenvalues and eigenvectors. Expanding the perturbed eigenvalue problem in Eqn. 19 and keeping terms to first order in the perturbations yields,
(26) |
and where in going from the first to the second line we have canceled the first term on each side using original eigenvalue problem (Eqn. 6). Treating the unperturbed eigenvectors as an orthonormal basis, we expand the perturbations to each eigenvector in this basis as,
(27) |
where are the expansion coefficients for the perturbation to the eigenvector .
To solve for the perturbation to the eigenvalue we restrict the perturbed eigenvalue problem in Eqn. 26 to the particular eigenvalue/eigenvector pair and their corresponding perturbations by extracting the column, left multiplying each side by , and substituting in our expansion for ,
(28) |
where we have exploited the orthonormality of the eigenvectors, in going from the third line to the fourth used the original eigenvalue problem in Eqn. 6 to make the substitution , and used Eqn. 20 to go from the penultimate to ultimate line.
Using a similar procedure we develop expressions for the expansion coefficients that we combine with Eqn. 27 to estimate perturbations in the eigenvectors. To compute for , we again extract the column of the perturbed eigenvalue problem in Eqn. 26 but this time left multiply by ,
(29) |
With the for in hand, we recover by enforcing normalization of the perturbed eigenvectors ,
(30) |
where we have appealed to the orthonormality of the eigenvectors and in the last line solved the quadratic in by completing the square and taking the positive root that corresponds to shrinking of the perturbed eigenvector along to maintain normalization while preserving its original sense. Finally, we restrict our perturbative analysis (Eqns. 27 and 28) to the subspace spanned by the top non-trivial eigenvalues and eigenvectors such that we model only the perturbations within the -dimensional subspace containing the intrinsic manifold.
The result of our analysis is a first-order perturbative expression for the errors introduced by the landmarking procedure in the eigenvalues (Eqn. 28) and eigenvectors (Eqns. 27, 29, and 30) from our landmarking procedure. In Section 3.1 we demonstrate that for sufficiently small perturbations in the pairwise distances (i.e., sufficiently many landmark points) that these perturbative analytical predictions are in good agreement with errors calculated by explicit numerical solutions of the full and landmark eigenvalue problems.
We demonstrate and benchmark the proposed L-dMaps approach in applications to three datasets: the Swiss roll, molecular simulations of a C24H50 polymer chain, and biomolecular simulations of alanine dipeptide (Fig. 1).
Swiss roll. The “Swiss roll” – a 2D surface rolled into 3D space – is a canonical test system for nonlinear dimensionality reduction techniques ^{16, 21}. We illustrate in Figure 1a the 20,000-point Swiss roll dataset employed by Tenenbaum et al. in their introduction of the Isomap algorithm ^{16} and available for free public download from http://isomap.stanford.edu/datasets.html. In applying (landmark) diffusion maps to these data, distances between points are measured using Euclidean distances computed by MATLAB’s L2-norm function.
Molecular simulations of a C24H50 n-alkane chain. Hydrophobic polymer chains in water possess rich conformational dynamics, and serve as prototypical models for hydrophobic folding that we have previously studied using diffusion maps to determine collapse pathways^{7} and reconstruct folding funnels from low-dimensional time series^{59}. We consider a molecular dynamics trajectory of a C24H50 n-alkane chain in water reported in Ref. [ 59]. Simulations were conducted in the GROMACS 4.6 simulation suite^{67} at 298 K and 1 bar employing the TraPPE potential^{68} for the chain that treats each CH2 and CH3 group as a single united atom, and the SPC water model^{69}. This dataset comprises 10,000 chain configurations harvested over the course of a 100 ns simulation represented as 72-dimensional vectors corresponding to the Cartesian coordinates of the 24 united atoms. Distances between chains are computed as the rotationally and translationally minimized root-mean-square deviation (RMSD), calculated using the GROMACS g_rms tool (http://manual.gromacs.org/archive/4.6.3/online/g_rms.html).
Molecular simulations of alanine dipeptide. Finally, we study a molecular dynamics simulation trajectory of alanine dipeptide in water previously reported in Ref. [50]. The “hydrogen atom of protein folding”, this peptide is a standard test system for new simulation and analysis methods in biomolecular simulation^{50, 60, 61, 62, 63, 64, 65, 66}. Unbiased molecular dynamics simulations were conducted in the Gromacs 4.0.2 suite^{67} at 298 K and 1 bar modeling the peptide using the OPLS-AA/L force field^{70, 71} and the TIP3P water model^{72}. The dataset comprises 25,001 snapshots of the peptide collected over the course of a 50 ns simulation represented as 66-dimensional vectors comprising the Cartesian coordinates of the 22 atoms of the peptide. Distances between snapshots were again measured as the rotationally and translationally minimized RMSD calculated using the GROMACS g_rms tool.
The principal goal of L-dMaps is to accelerate out-of-sample extension by considering only a subset of landmark points at the expense of the fidelity of the nonlinear embedding relative to that which would have been achieved using all samples. First, we quantify the accuracy of L-dMaps nonlinear embeddings for both PST and k-medoid landmark selection strategies for the three datasets considered, and compare calculated errors with those estimated from our analytical expressions. Second, we benchmark the speedup and performance of L-dMaps for out-of-sample projection of new data.
We numerically quantify the accuracy of L-dMaps manifold reconstruction for each of the three datasets using 5-fold cross validation, where we randomly split the data into five equal partitions and consider each partition in turn as the test set and the balance as the training set. We perform a full diffusion map embedding of the training partition to compute the “true” diffusion map embedding of the complete training partition using Eqn. 6. We then use the embedding of the training data onto the intrinsic manifold to perform the Nyström out-of-sample extension projection of the test data onto the manifold using Eqn. 8 to generate their “true” images .
We assess the fidelity of the nonlinear projections of the training data generated by our landmarking approach by taking the ensemble of training data and defining landmarks using the PST and k-medoids selection criteria. We then compute the nonlinear embedding of these landmarks onto the intrinsic manifold by solving the reduced eigenvalue problem in Eqn. 13 and projecting in the remainder of the training data (i.e., the non-landmark points) using the landmark Nyström extension in Eqn. 18. This defines a landmark projection of the training data . Finally, we use the landmark Nyström extension again to project in the test partition to generate the landmark embedding of the test data .
To compare the fidelity of the landmark embedding we define a normalized percentage deviation between the true and landmark embeddings of each point as,
(31) |
where and are the true and landmark embeddings of data point into the component of the nonlinear embedding into the -dimensional intrinsic manifold spanned by the top non-trivial eigenvectors, is the linear span of dimension , and the dimensionality is determined by a gap in the eigenvalue spectrum at eigenvalue . The dimensionality of the intrinsic manifolds for the Swiss roll, C24H50 chain, and alanine dipeptide have previously been determined to be 2, 4, and 2, respectively.^{16, 59, 50}. We then compute the root mean squared (RMS) normalized percentage error as,
(32) |
where is the number of points constituting either the training or test partition.
We illustrate in Figure 2 the true and landmark embeddings of the training and test data for the Swiss roll dataset, where we selected from the ensemble of = 16,000 points in the training partition a set of = 4,513 landmarks using the PST algorithm. The maximum normalized percentage deviation of any one point in either the training or test data using this set of landmarks is less than 3.2%, and the RMS errors are and . This demonstrates that using only 28% of the training set as landmarks, we can reconstruct a high-fidelity embedding with average normalized percentage errors per point of less than 1%.
We present in Table 1 the results of our error analysis for the three datasets using the PST and k-medoids landmark selection strategies at the smallest value of the kernel bandwidth supporting a fully connected random walk over the data (cf. Eqn. 2). This value of maximally preserves the fine-grained details of the manifold; we consider larger bandwidths below. For each of the three datasets we observe high-fidelity embeddings using relatively small fractions of the data as landmarks. For the Swiss roll, 30% of the data are required to achieve a 2.5% reconstruction error. Due to the approximately constant density of points over the manifold, further reductions provide too few points to support accurate embeddings. For the C24H50 n-alkane chain and alanine dipeptide, the existence of an underlying energy potential produces large spatial variations in the density over the manifold, which is exploited by our density weighted landmark diffusion map (Eqns. 13 and 18) to place landmarks approximately uniformly over the manifold and eliminate large numbers of points in the high density regions without substantial loss of accuracy. For C24H50, PST landmarks constituting 1.19% of the training data attain a RMS reconstruction error of 8%, and k-medoids landmarks comprising 3.75% of the data achieve a reconstruction error of 3%. For alanine dipeptide, landmark selection using the PST selection policy achieves a 1% error rate using only 1.75% of the training points, whereas k-medoids requires more than 5% of the data to achieve that same level of accuracy.
Swiss roll (, , ) | ||||
Method | / (%) | (%) | (%) | |
PST | 4551.0 (25.5) | 28.44 (0.16) | 2.42 (1.94) | 2.43 (1.95) |
2000 | 12.50 | 13.43 (13.41) | 13.37 (13.30) | |
K-medoid | 4000 | 25.00 | 3.74 (3.15) | 3.75 (3.15) |
8000 | 50.00 | 1.22 (1.05) | 1.22 (1.04) | |
C24H50 (, , ) | ||||
Method | / (%) | (%) | (%) | |
PST | 95.2 (2.6) | 1.19 (0.03) | 7.85 (3.44) | 8.48 (3.56) |
50 | 0.63 | 9.97 (2.86) | 10.65 (2.93) | |
K-medoid | 100 | 1.25 | 5.38 (1.23) | 5.76 (1.47) |
300 | 3.75 | 3.19 (0.53) | 3.49 (0.65) | |
Alanine dipeptide (, , ) | ||||
Method | / (%) | (%) | (%) | |
PST | 347.2 (8.4) | 1.74 (0.04) | 0.88 (0.26) | 0.94 (0.30) |
200 | 1.00 | 5.93 (2.81) | 6.31 (3.04) | |
K-medoid | 400 | 2.00 | 2.92 (0.92) | 3.05 (0.84) |
1000 | 5.00 | 1.43 (0.57) | 1.50 (0.60) |
supporting a fully connected random walk was employed in the kernel bandwidth. We report the mean and standard deviation of
Comments
There are no comments yet.