Learning manifolds from non-stationary streams

Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap , are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.


Introduction
High-dimensional data is inherently difficult to explore and analyze, owing to the "curse of dimensionality" that render many statistical and machine learning techniques inadequate.In this context, non-linear dimensionality reduction (NLDR) has proved to be an indispensable tool.Manifold learning based NLDR methods, such as Isomap [1], Local Linear Embedding (LLE) [2], etc., assume that the distribution of the data in the high-dimensional observed space is not uniform.Instead, the data is assumed to lie near a non-linear low-dimensional manifold embedded in the high-dimensional space.By exploiting the geometric properties of the manifold, e.g., smoothness, such methods infer the low-dimensional representation of the data from the high-dimensional observations.
A key shortcoming of NLDR methods is their O(n 3 ) complexity, where n is the size of the data [1].If directly applied on streaming data, where data arrives one point at a time, NLDR methods have to recompute the entire manifold at every time step, making such a naive adaptation prohibitively expensive.To alleviate the computational problem, landmark-based methods [3] or general out-of-sample extension methods [4] have been proposed.However, these techniques are still computationally expensive for practical applications.Recently, a streaming adaptation of the Isomap algorithm [1], which is a widely used NLDR method, was proposed [5].This method, called S-Isomap, relies on exact learning from a small initial batch of observations, followed by approximate mapping of subsequent stream of observations.An extension to the case when the observations are sampled from multiple, and possibly intersecting, manifolds, called S-Isomap++, was subsequently proposed [6].
Empirical results on benchmark data sets show that these methods can reliably learn the manifold with a small initial batch of observations.However two issues still remain.First, no theoretical bounds on the quality of the manifold, as a function of the initial batch size, exist.Second, these methods assume that the underlying generative distribution is stationary over the stream, and are unable to detect when the Fig. 1 Impact of changes in the data distribution on streaming NLDR.In the top panel, the true data lies on a 2D manifold (top-left) and the observed data is in R 3 obtained by using the swiss-roll transformation of the 2D data (top-middle).The streaming algorithm (S-Isomap [5]) uses a batch of samples from a 2D Gaussian (black), and maps streaming points sampled from a uniform distribution (gray).The streaming algorithm performs well on mapping the batch points to R 2 but fails on the streaming points that "drift" away from the batch (top-right).In the bottom panel, the streaming algorithm (S-Isomap++ [6]) uses a batch of samples from three 2D Gaussians (black).The stream points are sampled from the three Gaussians and a new Gaussian (gray).The streaming algorithm performs well on mapping the batch points to R 2 but fails on the streaming points that are "shifted" from the batch (bottom-right).Both streaming algorithms are discussed in Sect."Problem statement and preliminaries" distribution "drifts" or abruptly "shifts" away from the base, resulting in incorrect low-dimensional mappings (see Fig. 1).
The focus of this paper is two-fold.We first provide theoretical results that show that the quality 1 of a manifold, as learnt by Isomap, asymptotically converges as the data size, n, increases.This is a necessary result to show the correctness of streaming methods such as S-Isomap and S-Isomap++, under the assumption of stationarity.Next, we propose a methodology to detect changes in the underlying distribution of the stream properties (drifts and shifts), and inform the streaming methods to update the base manifold.
We employ a Gaussian Process (GP) [7] based adaptation of Isomap to process highthroughput streams.The use of GP is enabled by a kernel that measures the relationship between a pair of observations along the manifold, and not in the original high-dimensional space.We prove that the low-dimensional representations inferred using the GP based method -GP-Isomap -are equivalent to the representations obtained using the state-of-art streaming Isomap methods [5,6].Additionally, we empirically show, on synthetic and real data sets, that the predictive variance associated with the GP predictions is an effective indicator of the changes (either gradual drifts or sudden shifts) in the underlying generative distribution, and can be employed to inform the algorithm to "relearn" the core manifold.

Related works
Processing data streams efficiently using standard approaches is challenging in general, given streams require real-time processing and cannot be stored permanently.Any form of analysis, including detecting concept drift, requires adequate summarization which can deal with the inherent constraints and that can approximate the characteristics of the stream well.Sampling based strategies include random sampling [8,9] as well as decision-tree based approaches [10] which have been used in this context.To identify concept drift, maintaining statistical summaries on a streaming "window" is a typical strategy [11][12][13].However, none of these are applicable in the setting of learning a latent representation from the data, e.g., manifolds, in the presence of changes in the stream distribution.
We discuss limitations of existing incremental and streaming solutions that have been specifically developed in the context of manifold learning, specifically in the context of the Isomap algorithm in .Coupling Isomap with GP Regression (GPR) has been explored in the past [14][15][16][17], though not in the context of streaming data.This includes a Mercer kernel-based Isomap technique [14] and an emulator pipeline using Isomap to determine a low-dimensional representation, whose output is fed to a GPR model [15].
The intuition to use GPR for detecting concept drift is novel even though the Bayesian non-parametric approach [18], primarily intended for anomaly detection, comes close to our work in a single manifold setting.However, their choice of the Euclidean distance (in original R D space) based kernel for its covariance matrix, can result in high Procrustes error, as shown in Fig. 4. Additionally, their approach does not scale, given it does not use any approximation to be able to process the new streaming points "cheaply". 1See Sect."Problem statement and preliminaries" for the definition of manifold quality.
We also note that a family of GP based non-spectral2 non-linear dimensionality reduction methods exist, called Gaussian Process Latent Variable Model (GPLVM) [20] and its variants [19,21].GPLVM assumes that the high-dimensional observations are generated from the corresponding low-dimensional representations, using a GP prior.The latent low-dimensional representations are then inferred by maximizing the marginalized log-likelihood of the observed data, which is an optimization problem with n unknown d-dimensional vectors, where d is the length of the low-dimensional representation.In contrast, the GP-Isomap algorithm assumes that the low-dimensional representations are generated from the corresponding high-dimensional data, using a manifold-specific kernel matrix.
There has been a considerable body of literature dealing with dimensionality reduction [22,23], including recent work that uses deep learning based models [24], however, these cannot be applied in a streaming setting.While there have been some recent works that use PCA in a streaming setting [25], these are inherently linear and hence are not applicable where the manifolds are non-linear.

Problem statement and preliminaries
We first formulate the NLDR problem and provide background on Isomap and discuss its out-of-sample and streaming extensions [5,6,26,27].Additionally, we provide brief introduction to Gaussian Process (GP) analysis.

Non-linear dimensionality reduction
Given high-dimensional data Y = {y i } i=1...n , where y i ∈ R D , the NLDR problem is con- cerned with finding its corresponding low-dimensional representation X = {x i } i=1...n , such that x i ∈ R d , where d ≪ D.
NLDR methods assume that the data lies along a low-dimensional manifold embedded in a high-dimensional space, and exploit the global (Isomap [1], Minimum Volume Embedding [28]) or local (LLE [2], Laplacian Eigenmaps [29], Hessian Eigenmaps [30]) properties of the manifold to map each y i to its corresponding x i .
The Isomap algorithm [1] maps each y i to its low-dimensional representation x i in such a way that the geodesic distance along the manifold between any two points, y i and y j , is as close to the Euclidean distance between x i and x j as possible.The geo- desic distance is approximated by computing the shortest path between the two points using the k-nearest neighbor graph 3 and is stored in the geodesic distance matrix G = {g i,j } 1≤i,j≤n , where g i,j is the geodesic distance between the points y i and y j .G = {g 2 i,j } 1≤i,j≤n contains squared geodesic distance values.The Isomap algorithm recovers x i by using the classical Multi Dimensional Scaling (MDS) on G .Let B be the inner product matrix between different x i .B can be retrieved as B = −H GH/2 by assuming n i=1 x i = 0 , where H = {h i,j } 1≤i,j≤n and h i,j = δ i,j − 1/n , where δ i,j is the Kronecker delta.Isomap uncovers X such that X T X is as close to B as possible.This is achieved by setting X = { √ 1 q 1 √ 2 q 2 . . .√ d q d } T where 1 , 2 . . .d are the d larg- est eigenvalues of B and q 1 , q 2 . . .q d are the corresponding eigenvectors.
The Isomap algorithm makes use of G to approximate the pairwise Euclidean dis- tances on the generated manifold.Isomap demonstrates good performance when the computed geodesic distances are close to Euclidean.In this scenario, the matrix B behaves like a positive semi-definite (PSD) kernel.The opposite scenario requires a modification to be made to G to make it PSD.In MDS literature, this is commonly referred to as the Additive Constant Problem (ACP) [14,31,32].
To measure error between the true, underlying low-dimensional representation to that uncovered by NLDR methods, Procrustes analysis [33] is typically used.Procrustes analysis involves aligning two matrices, A and B , by finding the optimal trans- lation, t , rotation, R , and scaling factor, s , that minimizes the Frobenius norm between the two aligned matrices, i.e.,: The above optimization problem has a closed-form solution obtained by performing Singular Value Decomposition (SVD) of AB T [33].Consequently, one of the properties of Procrustes analysis is that ǫ Proc (A, B) = 0 when A = sRB + t i.e. when one of the matri- ces is a scaled, translated and/or rotated version of the other, which we leverage upon in this work.

Streaming Isomap
Given that the Isomap algorithm has a complexity of O(n 3 ) (where n = size of data) since it needs to perform Eigen Decomposition on B as described in the previous sec- tion, recomputing the manifold is computationally impractical to use in a streaming setting.Incremental techniques have been proposed in the past [5,27], which can efficiently process the new streaming points, without affecting the quality of the embedding significantly.
The S-Isomap algorithm relies on the assumption that a stable manifold can be learnt using only a fraction of the stream (denoted as the batch data set B ), and the remain- ing part of stream (denoted as the stream data set S ) can be mapped to the manifold in a significantly less costly manner.A convergence proof that justifies this assumption is provided in Sect."Convergence proofs for S-Isomap and S-Isomap++".Alternatively, this can be justified by considering the convergence of eigenvectors and eigenvalues of B , as the number of points in the batch increase [34].In particular, the bounds on the convergence error for a similar NLDR method, i.e., kernel PCA, is shown to be inversely proportional to the batch size [34].Similar arguments can be made for Isomap, by considering the equivalence between Isomap and Kernel PCA [26,35].This relationship has also been empirically shown for multiple data sets [5].The S-Isomap algorithm computes the low-dimensional representation for each new point i.e. x n+1 ∈ R d by solving a least-squares problem formulated by matching the dot product of the new point with the low-dimensional embedding of the points in the batch data set X , computed using (1) Isomap, to the normalized squared geodesic distances vector f .The least-squares prob- lem has the following form: where 4   where g i,j refer to the geodesic distance discussed in Sect."Problem statement and preliminaries".

Handling multiple manifolds
In the ideal case, when manifolds are densely sampled and sufficiently separated, clustering can be performed before applying NLDR techniques [37,38], by choosing an appropriate local neighborhood size so as not to include points from other manifolds and still be able to capture the local geometry of the manifold.However, if the manifolds are close or intersecting, such methods typically fail.While methods such as Generalized Principal Component Analysis (GPCA) [39] have been proposed to generalize linear methods such as PCA for a case where the data lies on multiple sub-spaces, such ideas have not been explored for non-linear methods.
The S-Isomap++ [6] algorithm overcomes limitations of the S-Isomap algorithm and extends it to be able to deal with multiple manifolds.It uses the notion of Multi-scale SVD [40] to define tangent manifold planes at each data point, computed at the appropriate scale, and computes similarity in a local neighborhood.Additionally, it includes a novel manifold tangent clustering algorithm to be able to deal with the above issue of clustering manifolds which are close and in certain scenarios, intersecting, using these tangent manifold planes.After initially clustering the high-dimensional batch data set, the algorithm applies NLDR on each manifold individually and eventually "stitches" them together in a global ambient space by defining transformations which can map points from the individual low-dimensional manifolds to the global space.S-Isomap++ does not assume that the number of manifolds (p) is specified and automatically infers p using its clustering mechanism. 5Given that the data points lie on low-dimensional and potentially intersecting manifolds, it is evident that the standard clustering methods, such as K-Means [42], that operate on the observed data in R D , will fail in correctly identifying the clusters. (2) 4 Note that the Incremental Isomap algorithm [27] has a slightly different formulation where where g i,j refer to the geodesic distance discussed in Sect."Problem statement and preliminaries".The S-Isomap algo- rithm assumes that the data stream draws from an uniformly sampled, unimodal distribution p(x) and that the stream S and the batch B data sets get generated from p(x) .Additionally it assumes that the manifold has stabilized i.e. |B| = n is large enough.Using these assumptions in (3) above, we have that g 2 l,m = ǫ ≃ 0 i.e. the expectation of squared geodesic distances for points in the batch data set B is close to those for points in the stream data set S .The line of reasoning for this follows from [36].Thus (3) simplifies to (4). (3) In cases of uneven/low density sampling, the clustering strategy discussed might possibly generate many small clusters.In such cases, one can try to merge clusters [41], based on their affinity/closeness to make the clusters' size reasonable.
However, S-Isomap++ can only detect manifolds which it encounters in its batch learning phase and not those which it might encounter in the streaming phase.Thus, S-Isomap++ ceases to "learn" and evolve to be able to limit the embedding error for points in the data stream, even though it has a "stitching" mechanism to embed individual low-dimensional manifolds, which might themselves be of different dimensions.

Gaussian process regression
Let us assume that we are learning a probabilistic regression model to obtain the prediction at a given test input, y , using a non-linear and latent function, f (•) .Assuming6 d = 1 , the observed output, x, is related to the input as: Given a training set of inputs, Y = {y i } i=1...n and corresponding outputs, X = {x i } i=1...n , 7the Gaussian Process Regression (GPR) model assumes a GP prior on the latent function values, i.e., f (y) ∼ GP(m(y), k(y, y ′ )) , where m(y) is the mean of f (y) and k(y, y ′ ) is the covariance between any two evaluations of f (•) , i.e, m(y Here we use a zero-mean function ( m(y) = 0 ), though other functions could be used as well.The GP prior states that any finite collection of the latent function evaluations are jointly Gaussian, i.e., where the ij th entry of the n × n covariance matrix, K, is given by k(y i , y j ) .The GPR model uses ( 5) and ( 6) to obtain the predictive distribution at a new test input, y n+1 , as a Gaussian distribution with following mean and variance: where k n+1 is a n × 1 vector with i th value as k(y n+1 , y i ).
The kernel function, k(•) , specifies the covariance between function values, f (y i ) and f (y j ) , as a function of the corresponding inputs, y i and y j .A popular choice is the squared exponential kernel, which has been used in this work: where σ 2 s is the signal variance and ℓ is the length scale.The quantities σ 2 s , ℓ , and σ 2 n from ( 5) are the hyper-parameters of the model and can be estimated by maximizing the marginal log-likelihood of the observed data ( Y and X ) under the GP prior assumption. ( One can observe that predictive mean, E[x n+1 ] in (7) can be written as an inner prod- uct, i.e.where β = (K + σ 2 n I) −1 X .We will utilize this form in subsequent proofs.

Convergence proofs for S-isomap and S-isomap++
In this section, we demonstrate the convergence of the S-Isomap algorithm for a single manifold setting, subsequent to which we extend it to the multi-manifold setting i.e. for the S-Isomap++ algorithm described above.
Theorem 1 Given a uniformly sampled, uni-modal distribution from which the random batch data set B = {y i ∈ R D } i=1...n of the S-Isomap algorithm is derived from, there exists a threshold n 0 , such that when n ≥ n 0 , the Procrustes Error ǫ Proc τ B , τ ISO between Proof Based on the setting described above, the S-Isomap algorithm acts like a generative model which is trying to learn the inverse mapping φ(•) −1 , where the associated embedding error is the Procrustes Error ǫ Proc τ B , τ ISO .
The proof follows from [43] who showed that in a setting, where given 1 , 2 , µ > 0 and for appropriately chosen ǫ > 0 , as well as a data set Y = {y i } i=1...n sampled from a Poisson distribution with density function α which satisfies the δ-sampling condition i.e. wherein the ǫ-rule is used to construct a graph G on Y , the ratio between the graph based distance d G (x, y) and the true Euclidean distance d M (x, y) ∀x , y ∈ Y is bounded.More concretely, the following holds with probability at least (1 − µ) for ∀x , y ∈ Y: where V is the volume of the manifold M and is the volume of the smallest metric ball in M of radius r and δ > 0 is such that ( A similar result can be derived in the scenario where n points are sampled independently from the fixed probability distribution p(y ; θ ) , in which case we have : where α is the probability of selecting a sample from p(y ; θ ).Using ( 13), ( 14) and ( 15) in (11), we have: where is the condition which ensures that ( 12) is satisfied.
Thus we have an adequate threshold for the size of the batch data set B which ensures ( 17) is satisfied for the ǫ-rule.We can derive a similar threshold for the K-rule, observ- ing that there is a direct one-to-one mapping between K and ǫ (See Sect."Non-linear dimensionality reduction" for more details).
To complete the proof, we observe that (12) implies that d G (x, y) , the graph based distance between points x , y ∈ G is a perturbed version of d M (x, y) , the true Euclidean distance between points x and y in R

Extension to the multi-manifold setting
The above proof can be extended to show the convergence of the S-Isomap++ [6] algorithm, described in Sect."Handling Multiple Manifolds" as follows.(15)

Corollary 1 The batch phase of the S-Isomap++ algorithm converges under appropriate conditions.
Proof Similar to the proof for the S-Isomap algorithm, we consider a corresponding setting for the multi-manifold scenario now, wherein we are attempting to learn the inverse mappings φ(•) −1 i=1,...,p for each of the p manifolds.The initial clustering step of the S-Isomap++ algorithm separates the samples from the batch data set B into different individual clusters B i , such that each cluster is mutually exclusive of the others and cor- responds to one of the multiple manifolds present in the data i.e.
The intuition for clustering and subsequently processing each of the clusters separately is based on the setting described above that the observed data was generated by first sampling points from multiple U i=1,...,p i.e., convex domains in R d8 and subse- quently mapping those points in non-linear fashion, using possibly different φ(•) i=1,...,p to B ∈ R D .Thus, to learn the different inverse mappings effectively, there is a need to be able to cluster the data appropriately.
After the initial clustering step, a similar analysis as in Theorem 1 provides thresholds n i , ∀i ∈ {1, . . ., p} for each of the p clusters beyond which when the batch phase of the S-Isomap++ algorithm converges provided each of the p clusters B i=1,...,p exceeds the appropriate threshold n i (similar to (17) above).
The S-Isomap++ algorithm does not assume that the number of manifolds ( p ) is specified.Refer to Sect."Handling multiple manifolds" for more details.

Methodology
The proposed GP-Isomap algorithm follows a two-phase strategy (similar to the S-Isomap and S-Isomap++), where exact manifolds are learnt from an initial batch B , and subsequently a computationally inexpensive mapping procedure processes the remainder of the stream.To handle multiple manifolds, the batch data B is first clustered via manifold tangent clustering or other standard techniques.Exact Isomap is applied on each cluster.The resulting low-dimensional data for the clusters is then "stitched" together to obtain the low-dimensional representation of the input data.The difference from the past methods is the mapping procedure which uses GPR to obtain the predictions for the low-dimensional mapping (see (7)).At the same time, the associated predictive variance (see (8)) is used to detect changes in the underlying distribution.
The overall GP-Isomap algorithm is outlined in 1 and takes a batch data set, B and the streaming data, S as inputs, along with other parameters.The processing is split into two phases: a batch learning phase (Lines 1-15) and a streaming phase (Lines 16-32), which are described later in this section.

Kernel function
The key innovation here is to use a manifold-specific kernel matrix in the GPR method.The matrix B , which is the inner product matrix between the points in the low-dimen- sional space (see Sect. "Non-linear dimensionality reduction"), could be a reasonable starting point.However, as past researchers have shown [16], typical kernels, such as squared exponential kernel, can only be generalized to a positive definite kernel on a geodesic metric space if the space is flat.Thus B will not necessarily yield a valid positive semi-definite kernel matrix.However, a result by [32] shows that a small positive constant, max , can be added to B to guarantee that it will be PSD.This constant can be calculated as the largest eigenvalue of the matrix: where P = −HGH/2 .Here, G is the geodesic distance matrix and H = {h i,j } 1≤i,j≤n , h i,j = δ i,j − 1/n , where δ i,j is the Kronecker delta.B can be derived from B as [32]: where max is the largest eigenvalue of M.
The proposed GP-Isomap algorithm uses a novel geodesic distance based kernel function defined as: (18) where b i,j is the ij th entry of the matrix B , σ 2 s is the signal variance (whose value we fix as 1 in this work) and ℓ is the length scale hyper-parameter.Thus the kernel matrix K can be written as: This kernel function plays a key role in using the GPR model for mapping streaming points on the learnt manifold, by measuring similarity along the low-dimensional manifold, instead of the original space ( R D ), as is typically done in GPR based solutions.
The matrix B , is positive semi-definite.Consequently, we note that the kernel matrix, K , is positive definite (refer ( 22) below).
Using 1, the novel kernel we propose can be written as where eigenvector pairs of B as discussed in Sect."Non-linear dimensionality reduction".

Batch learning
The batch learning phase consists of these tasks: i). Clustering: The first step in the batch phase involves clustering of the batch data set B into p individual clusters which represent the manifolds (Line 1).In case, B contains a single cluster, the algorithm can correctly detect it.Refer to Sect."Handling multiple manifolds" for more details, ii).Dimensionality reduction: Subsequently, full Isomap is executed on each of the p individual clusters to get low-dimensional representations LDE i=1,2...p of the data points belonging to each individual cluster (Lines 3-5), iii).Hyper-parameter estimation: The geodesic distance matrix for the points in the i th manifold G i and the corresponding low-dimensional representation LDE i , are fed to the GP model for each of the p manifolds, to perform hyper-parameter esti- mation, which outputs {φ GP i } i=1,2...p (Lines 6-8), and, iv).Learning mapping to global space: The low-dimensional embedding uncovered for each of the manifolds can be of different dimensionalities.Consequently, a mapping to a unified global space is needed.To learn this mapping, a support set ξ s is formulated, which contains the k pairs of nearest points and l pairs of far- thest points, between each pair of manifolds.Subsequently, MDS is executed on this support set ξ s to uncover its low-dimensional representation GE s .Individual scaling and translation factors {R i , t i } i=1,2...p are learnt via solving a least squares problem involving ξ s , which map points from each of the individual manifolds to the global space (Lines 9-15).

Stream processing
In the streaming phase, each sample s in the stream set S is embedded using each of the p GP models to evaluate the prediction µ i , along with the variance σ i (Lines 22-24).
The manifold with the smallest variance get chosen to embed the sample s into, using the corresponding scaling R j and translation factor t j , provided min i |σ i | is within the allowed threshold σ t (Lines 25-28), otherwise sample s is added to the unassigned set S u (Lines 29-31).When the size of unassigned set S u exceeds certain threshold n s , we add them to the batch data set and re-learn the base manifold (Line 18-20).The assimilation of the new points in the batch maybe done more efficiently in an incremental manner.

Complexity
The runtime complexity of our proposed algorithm is dominated by the GP regression step as well as the Isomap execution step, both of which have O(n 3 ) complexity, where n is the size of the batch data set B .This is similar to the S-Isomap and S-Isomap++ algo- rithms, that also have a runtime complexity of O(n 3 ) .The stream processing step is O(n) for each incoming streaming point.The space complexity of GP-Isomap is dominated by O(n 2 ) .This is because each of the samples of the stream set S get processed separately.
Thus, the space requirement as well as runtime complexity does not grow with the size of the stream, which makes the algorithm appealing for handling high-volume streams.

Theoretical analysis
We first state the main result for the single manifold case, and prove it using results in the Appendix section, and then present proofs for the multi-manifold case.
Theorem 2 For a single manifold setting, the prediction τ GP of GP-Isomap is equivalent to the prediction τ ISO of S-Isomap, i.e., the Procrustes Error ǫ Proc τ GP , τ ISO between τ GP and τ ISO is 0.
Proof The prediction of GP-Isomap is given by (10).Using Lemma 5, we note that The term K * for GP-Isomap, using the novel kernel function evaluates to: where G 2 * represents the vector containing the squared geodesic distances of x n+1 to X containing {x i } i=1,2...n .
Considering the above equation element-wise, we note that the i th term of K * equates to exp − 2ℓ 2 .Using Taylor's series expansion we have, The prediction by the S-Isomap is given by (4) as follows: where f = {f i } is as defined by (4).Rewriting (4) we have: where γ = 1 n j g 2 i,j is a constant with respect to x n+1 , since it depends only on squared geodesic distance values associated within the batch data set B and x n+1 is part of the stream data set S.
We now consider the 1 st dimension of the predictions for GP-Isomap and S-Isomap only and demonstrate their equivalence via Procrustes Error.The analysis for the remaining dimensions follows a similar line of reasoning.Thus for the 1 st dimension, using (27) the S-Isomap prediction is: Similarly using Lemma 5, ( 24) and ( 25), we have that the 1 st dimension for GP-Isomap prediction is given by, We can observe that τ GP 1 is a scaled and translated version of τ ISO 1 .Similarly for each of the dimensions ( 1 ≤ i ≤ d ), the prediction for the GP-Isomap τ GP i can be shown to be a scaled and translated version of the prediction for the S-Isomap τ ISO i .These individual (24) 2ℓ 2 for large ℓ scaling s i and translation t i factors can be represented together by single collective scal- ing s and translation t factors.Consequently, the Procrustes Error ǫ Proc τ GP , τ SI is 0. (refer Sect."Non-linear dimensionality reduction").

Results and analysis
In this section, we demonstrate the performance of the proposed algorithm on both synthetic and real-world data sets.In Sect."Results on synthetic data sets", we present results for synthetic data sets, whereas Sect."Results on sensor data set" contains results on benchmark sensor data sets.All experiments were done using Python 3.0 implementations of the proposed and related methods, and were run on a MacBook Pro (2.8 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3).Our results demonstrate that: (i).GP-Isomap is able to perform good quality dimension reduction on a manifold, (ii).the reduction produced by GP-Isomap is equivalent to the corresponding output of S-Isomap (or S-Isomap++), and (iii).the predictive variance within GP-Isomap is able to identify changes in the underlying distribution in the data stream on all data sets considered in this paper.
In the interest of space, we avoid comparing the quality of the dimensionality reduction with other methods, and refer readers to S-Isomap and S-Isomap++ where these equivalent methods were shown to be better than existing approaches for dimensionality reduction.
GP-Isomap has the following hyper-parameters: ǫ , k, l, , σ t , n s .We set k, l, to have values of 16, 1 and 0.005, respectively.We study the effect of σ t and n s using the differ- ent data sets listed in Sects."Results on synthetic data sets" and "Results on sensor data set" respectively.

Results on synthetic data sets
Swiss roll data sets are typically used for evaluating manifold learning algorithms.To evaluate our method on concept drift, we use the Euler Isometric Swiss Roll data set [5] consisting of four R 2 Gaussian patches having n = 2000 points each, chosen at random, which are embedded into R 3 using a non-linear function ψ(•) .The points for each of the Gaussian modes were divided equally into training and test sets randomly.To test incremental concept drift, we use one of the training data sets from the above data set, along with a uniform distribution of points for testing (refer to Fig. 1 for details).Figures 2a  and 3a demonstrate our results on this data set.
To evaluate our method on sudden concept drift, we trained our GP-Isomap model using the first three out of four training sets of the Euler Isometric Swiss Roll data set.Subsequently we stream points randomly from the test sets from only the first three classes initially and later stream points from the test set of the fourth class, keeping track of the predictive variance all the while.Figure 2a demonstrates the sudden increase (see red line) in the variance of the stream when streaming points are from the fourth class i.e. unknown mode.Thus GP-Isomap is able to detect concept drift correctly, and is able to map all of the data points correctly on the lower dimensional manifold, as shown in Fig. 3a.The bottom panel of Fig. 1 demonstrates the performance of S-Isomap++ on this data set.It fails to map the streaming points of the unknown mode correctly, given it had not encountered the unknown mode during the batch training phase.
In Sect."Theoretical analysis", we proved the equivalence between the prediction of S-Isomap with that of GP-Isomap, using our novel kernel.In Fig. 4, we show empirically via Procrustes Error (PE) that the prediction of S-Isomap matches that of GP-Isomap, irrespective of size of batch used.PE for GP-Isomap with the Euclidean distance based kernel remains high irrespective of the size of the batch, which clearly demonstrates the unsuitability of this kernel to adequately learn mappings in the low-dimensional space.

Results on sensor data set
In this section, we present results from different benchmark sensor data sets to demonstrate the efficacy of our algorithm.

Results on gas sensor array drift data set
The Gas Sensor Array Drift [45] data set is a benchmark data set ( n = 13910 ) avail- able to research communities to develop strategies to dealing with concept drift and uses measurements from 16 chemical sensors used to discriminate between 6 gases Fig. 2 Using variance to detect concept drift for different data sets.The x-axis represents time and the y-axis represents the model's predictive variance for the stream.Initially, when stream consists of samples generated from known modes, variance is low.Later, when samples from an unrecognized mode appear, variance drastically shoots up.For the first two data sets, noisy instances in the initial part get assigned a large variance, sporadically.The variance is well-behaved for the third data set.The optimal values of hyper-parameters, n s and σ t , were set to (1000, 0.7), (412, 1.2), (855, 0.5), for the three data sets (class labels) at various concentrations.We demonstrate the performance of our proposed method on this data set.
We first remove instances which had invalid/empty entries as feature values.Subsequently the data is mean normalized.Data points from the first five classes were divided into training and test sets.We train our model using the training data from four out of these five classes.While testing, we stream points randomly from the test sets of these four classes first and later stream points from the test set of the fifth class.Figures 2b and 3b demonstrate our results on this data set.From Fig. 2b, we observe that our model can clearly detect concept drift due to the unknown fifth class by tracking the variance of the stream, using the running average (red line).However, as shown in Fig. 3b, a two-dimensional manifold is not sufficient to capture the cluster structure in the data set.For the Swiss roll data, GP-isomap is able to learn the structure in the data using a 2-D reduction, while for the real-world census data sets, the structure is not evident in 2-D and possibly a higher dimensional manifold is required

Results on human activity recognition (HAR) data set
The Human Activity Recognition [46] data set consists of multiple data sets which are focused on discriminating between different activities, i.e. to predict which activity was performed at a specific point in time.In this work, we focused on the Weight Lifting Exercises (WLE) data set ( n = 39242 ) which investigates how well an activity was per- formed by the wearer of different sensor devices.The WLE data set consists of six young health participants who performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
The data set was cleaned i.e. instances with invalid/empty entries were removed.Subsequently the data points from the different classes were mean normalized and divided into training and test sets.Figures 2c and 3c demonstrate our results on this data set.Figure 2c demonstrates the concept drift phenomenon.Similar to the methodology we used earlier to detect concept drift, we initially trained our algorithm using instances from the latter four classes only, whereas during the streaming phase we randomly selected instances from the streaming set of these four classes first and later streamed points from the first class, keeping track of the predictive variance.

Conclusions
We have proposed a streaming Isomap algorithm (GP-Isomap) that can be used to learn non-linear low-dimensional representation of high-dimensional data arriving in a streaming fashion.This algorithm can have significant applications in areas involving analysis of high-dimensional data streams [47], especially in constrained environments, such as real-time situational monitoring [48,49], healthcare monitoring [50], scientific data visualization [51], etc.We prove that using a GPR formulation to map incoming data instances onto an existing manifold is equivalent to using existing geometric strategies [5,6].Moreover, by utilizing a small batch for the initial learning of the manifold, as well as for training the GPR model, the method scales linearly with the size of the stream, thereby ensuring its applicability for practical problems.Using the Bayesian inference of the GPR model allows us to estimate the variance associated with the mapping of the streaming instances.The variance is shown to be a strong indicator of changes in the underlying stream properties on a variety of data sets.By utilizing the variance, one can devise re-training strategies that can include expanding the batch data set.While in the experiments we have demonstrated the ability of GP-Isomap to detect shifts in the underlying distributions, the algorithm can also be used to detect gradual shifts, as illustrated in Fig. 1.While we have focused on Isomap algorithm in this paper, similar formulations can be applied for other NLDR methods such as LLE [2], Laplacian Eigenmaps [29], Hessian Eigenmaps [30], etc., and will be explored as future research.

Appendix
Lemma 1 The matrix exponential for M for rank M = d and symmetric M is given by where { i } i=1,2...d are the d largest eigenvalues of M and {q i } i=1,2...d are the corresponding eigenvectors such that q ⊤ i q j = δ i,j .
Proof Let M be an n × n real matrix.The exponential e M is given by where I is the identity.Real, symmetric M has real eigenvalues and mutually orthogonal eigenvectors i.e.M = n i=1 i q i q ⊤ i where { i } i=1...n are real and q ⊤ i q j = δ i,j .Given M has rank d, we have M = d i=1 i q i q ⊤ i .
2 !+ . . .q d q ⊤ d = I + e 1 − 1 q 1 q ⊤ 1 + e 2 − 1 q 2 q ⊤ 2 + . . .+ e d − 1 K + σ n 2 I −1 y i = αI − α 2 d j=1 c j q j q ⊤ j 1 + αc j the true underlying representation and τ ISO = φ−1 B , the embedding uncovered by Isomap is small ( ǫ Proc ≈ 0 ) i.e. the batch phase of the S-Isomap algorithm converges, where φ(•) is the non-linear function which maps data points from the under- lying low-dimensional ground truth representation U to B ∈ R D and the ground truth U originally resides in a convex R d Euclidean space.

2 4
d .Let D M and D G represent the squared distance matrix corresponding to d M (x, y) and d G (x, y) respectively.Thus we have D G = D M + D M where � D M = {� d M (i, j)} 1≤i,j≤n and � d M (i, j) are bounded due to(12).In the past[44], the robustness of MDS to small perturbations was demonstrated as follows.Let F represent the zero-diagonal symmetric matrix which perturbs the true squared distance matrix B to B + �B = B + ǫF .Then the Procrustes Error between the embeddings uncovered by MDS for B and for B + B is given by ǫ very small for small entries {f i,j } 1≤i,j≤n ∈ F , {e k ( k )} k=1...n represent the eigenvectors (eigenvalues) of B and the double summation is over pairs of (j, k) = 1, 2, . . .(n − 1) but excluding those pairs (j, k) wherein both entries of which lie in the range (K + 1), (K + 2), . . .(n − 1) , K = n k=1 I( k > 0) and I(•) is the indicator function.We substitute ǫ = 1 and replace B with D M and B with D M above to complete the proof, since the entries of D M are very small i.e. {0 ≤ d M (i, j) ≤ 2 } 1≤i,j≤n where = max( 1 , 2 ) for small 1 , 2 , given the condition n > n 0 is satisfied for(12).Thus we have that the embedding uncovered by S-Isomap for a batch data set B where |B| = n > n 0 converges asymptotically to their true embedding upto translation, rota- tion and scaling factors.

Fig. 3
Fig. 3 Low dimensional representations uncovered by GP-Isomap for three different data sets.For the Swiss roll data, GP-isomap is able to learn the structure in the data using a 2-D reduction, while for the real-world census data sets, the structure is not evident in 2-D and possibly a higher dimensional manifold is required

Fig. 4
Fig. 4 Procrustes error (PE) between the ground truth with a GP-Isomap (blue line) with the geodesic distance based kernel, b S-Isomap (dashed blue line with dots) and c GP-Isomap (green line) using the Euclidean distance based kernel, for different fractions (f) of data used in the batch B .The behavior of PE for a closely matches that for b.However, the PE for GP-Isomap using the Euclidean distance kernel remains high irrespective of f demonstrating its unsuitability for manifolds

1 1+σ n 2 − 1 . 1 1+σ n 2 as α and exp − 1 2ℓ 2 − 1 as c 1 and using 1 + σ n 2 I as A , c 1 q 1 as u and q 1 Lemma 3
and c 1 = exp − 1 2ℓ 2 Proof Using (22) for d = 1 , we have Representing as v in the Sherman-Morrison identity [52], we have The inverse of the Gaussian kernel for rank M = d and symmetric M is given by