 Research
 Open access
 Published:
Learning manifolds from nonstationary streams
Journal of Big Data volume 11, Article number: 42 (2024)
Abstract
Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifoldspecific kernel function and is trained on an initial batch of sufficient size, can closely approximate the stateofart streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to realworld factors, such as introduction of a new gas in the system, while efficiently mapping data on a lowdimensional manifold.
Introduction
Highdimensional data is inherently difficult to explore and analyze, owing to the “curse of dimensionality” that render many statistical and machine learning techniques inadequate. In this context, nonlinear dimensionality reduction (NLDR) has proved to be an indispensable tool. Manifold learning based NLDR methods, such as Isomap [1], Local Linear Embedding (LLE) [2], etc., assume that the distribution of the data in the highdimensional observed space is not uniform. Instead, the data is assumed to lie near a nonlinear lowdimensional manifold embedded in the highdimensional space. By exploiting the geometric properties of the manifold, e.g., smoothness, such methods infer the lowdimensional representation of the data from the highdimensional observations.
A key shortcoming of NLDR methods is their \(O(n^3)\) complexity, where n is the size of the data [1]. If directly applied on streaming data, where data arrives one point at a time, NLDR methods have to recompute the entire manifold at every time step, making such a naive adaptation prohibitively expensive. To alleviate the computational problem, landmarkbased methods [3] or general outofsample extension methods [4] have been proposed. However, these techniques are still computationally expensive for practical applications. Recently, a streaming adaptation of the Isomap algorithm [1], which is a widely used NLDR method, was proposed [5]. This method, called SIsomap, relies on exact learning from a small initial batch of observations, followed by approximate mapping of subsequent stream of observations. An extension to the case when the observations are sampled from multiple, and possibly intersecting, manifolds, called SIsomap++, was subsequently proposed [6].
Empirical results on benchmark data sets show that these methods can reliably learn the manifold with a small initial batch of observations. However two issues still remain. First, no theoretical bounds on the quality of the manifold, as a function of the initial batch size, exist. Second, these methods assume that the underlying generative distribution is stationary over the stream, and are unable to detect when the distribution “drifts” or abruptly “shifts” away from the base, resulting in incorrect lowdimensional mappings (see Fig. 1).
The focus of this paper is twofold. We first provide theoretical results that show that the quality^{Footnote 1} of a manifold, as learnt by Isomap, asymptotically converges as the data size, n, increases. This is a necessary result to show the correctness of streaming methods such as SIsomap and SIsomap++, under the assumption of stationarity. Next, we propose a methodology to detect changes in the underlying distribution of the stream properties (drifts and shifts), and inform the streaming methods to update the base manifold.
We employ a Gaussian Process (GP) [7] based adaptation of Isomap to process highthroughput streams. The use of GP is enabled by a kernel that measures the relationship between a pair of observations along the manifold, and not in the original highdimensional space. We prove that the lowdimensional representations inferred using the GP based method – GPIsomap – are equivalent to the representations obtained using the stateofart streaming Isomap methods [5, 6]. Additionally, we empirically show, on synthetic and real data sets, that the predictive variance associated with the GP predictions is an effective indicator of the changes (either gradual drifts or sudden shifts) in the underlying generative distribution, and can be employed to inform the algorithm to “relearn” the core manifold.
Related works
Processing data streams efficiently using standard approaches is challenging in general, given streams require realtime processing and cannot be stored permanently. Any form of analysis, including detecting concept drift, requires adequate summarization which can deal with the inherent constraints and that can approximate the characteristics of the stream well. Sampling based strategies include random sampling [8, 9] as well as decisiontree based approaches [10] which have been used in this context. To identify concept drift, maintaining statistical summaries on a streaming “window” is a typical strategy [11,12,13]. However, none of these are applicable in the setting of learning a latent representation from the data, e.g., manifolds, in the presence of changes in the stream distribution.
We discuss limitations of existing incremental and streaming solutions that have been specifically developed in the context of manifold learning, specifically in the context of the Isomap algorithm in . Coupling Isomap with GP Regression (GPR) has been explored in the past [14,15,16,17], though not in the context of streaming data. This includes a Mercer kernelbased Isomap technique [14] and an emulator pipeline using Isomap to determine a lowdimensional representation, whose output is fed to a GPR model [15].
The intuition to use GPR for detecting concept drift is novel even though the Bayesian nonparametric approach [18], primarily intended for anomaly detection, comes close to our work in a single manifold setting. However, their choice of the Euclidean distance (in original \({\mathbb {R}}^D\) space) based kernel for its covariance matrix, can result in high Procrustes error, as shown in Fig. 4. Additionally, their approach does not scale, given it does not use any approximation to be able to process the new streaming points “cheaply”.
We also note that a family of GP based nonspectral^{Footnote 2} nonlinear dimensionality reduction methods exist, called Gaussian Process Latent Variable Model (GPLVM) [20] and its variants [19, 21]. GPLVM assumes that the highdimensional observations are generated from the corresponding lowdimensional representations, using a GP prior. The latent lowdimensional representations are then inferred by maximizing the marginalized loglikelihood of the observed data, which is an optimization problem with n unknown ddimensional vectors, where d is the length of the lowdimensional representation. In contrast, the GPIsomap algorithm assumes that the lowdimensional representations are generated from the corresponding highdimensional data, using a manifoldspecific kernel matrix.
There has been a considerable body of literature dealing with dimensionality reduction [22, 23], including recent work that uses deep learning based models [24], however, these cannot be applied in a streaming setting. While there have been some recent works that use PCA in a streaming setting [25], these are inherently linear and hence are not applicable where the manifolds are nonlinear.
Problem statement and preliminaries
We first formulate the NLDR problem and provide background on Isomap and discuss its outofsample and streaming extensions [5, 6, 26, 27]. Additionally, we provide brief introduction to Gaussian Process (GP) analysis.
Nonlinear dimensionality reduction
Given highdimensional data \({\textbf {Y}} = \{ {\textbf {y}} _i \}_{i = 1 \ldots n}\), where \({\textbf {y}} _i \in {\mathbb {R}}^D\), the NLDR problem is concerned with finding its corresponding lowdimensional representation \({\textbf {X}} = \{ {\textbf {x}} _i \}_{i = 1 \ldots n}\), such that \({\textbf {x}} _i \in {\mathbb {R}}^d\), where \({d} \ll {D}\).
NLDR methods assume that the data lies along a lowdimensional manifold embedded in a highdimensional space, and exploit the global (Isomap [1], Minimum Volume Embedding [28]) or local (LLE [2], Laplacian Eigenmaps [29], Hessian Eigenmaps [30]) properties of the manifold to map each \({\textbf {y}} _i\) to its corresponding \({\textbf {x}} _i\).
The Isomap algorithm [1] maps each \({\textbf {y}} _i\) to its lowdimensional representation \({\textbf {x}} _i\) in such a way that the geodesic distance along the manifold between any two points, \({\textbf {y}} _i\) and \({\textbf {y}} _j\), is as close to the Euclidean distance between \({\textbf {x}} _i\) and \({\textbf {x}} _j\) as possible. The geodesic distance is approximated by computing the shortest path between the two points using the knearest neighbor graph^{Footnote 3} and is stored in the geodesic distance matrix \({{\textbf {G}} } = \{ {{\textbf {g}} }_{i,j} \}_{1 \le i, j \le n}\), where \({{\textbf {g}} }_{i,j}\) is the geodesic distance between the points \({\textbf {y}} _i\) and \({\textbf {y}} _j\). \(\widetilde{{{\textbf {G}} }}= \{ {{{\textbf {g}} }^{2}_{i,j}} \}_{1 \le i, j \le n}\) contains squared geodesic distance values. The Isomap algorithm recovers \({\textbf {x}} _i\) by using the classical Multi Dimensional Scaling (MDS) on \(\widetilde{{{\textbf {G}} }}\). Let \({{\textbf {B}} }\) be the inner product matrix between different \({\textbf {x}} _i\). \({{\textbf {B}} }\) can be retrieved as \({{\textbf {B}} } = {{\textbf {H}} }\widetilde{{{\textbf {G}} }}{{\textbf {H}} }/2\) by assuming \(\sum\nolimits_{{i = 1}}^{n} {{\mathbf{x}}_{i} = \,0}\), where \({{\textbf {H}} } = \{ {{\textbf {h}} }_{i,j} \}_{1 \le i, j \le n}\) and \({{\textbf {h}} }_{i,j} = {{\varvec{\delta }}}_{i,j}  1/{n}\), where \({{\varvec{\delta }}}_{i,j}\) is the Kronecker delta. Isomap uncovers \({\textbf {X}}\) such that \({\textbf {X}} ^T{\textbf {X}}\) is as close to \({{\textbf {B}} }\) as possible. This is achieved by setting \({\textbf {X}} = \{ \sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1 \; \sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2 \; \ldots \; \sqrt{{\varvec{\lambda }}}_d{{\textbf {q}} }_d \}^T\) where \({{\varvec{\lambda }}}_1, {{\varvec{\lambda }}}_2 \dots {{\varvec{\lambda }}}_d\) are the d largest eigenvalues of \({{\textbf {B}} }\) and \({{\textbf {q}} }_1, {{\textbf {q}} }_2 \dots {{\textbf {q}} }_d\) are the corresponding eigenvectors.
The Isomap algorithm makes use of \(\widetilde{{{\textbf {G}} }}\) to approximate the pairwise Euclidean distances on the generated manifold. Isomap demonstrates good performance when the computed geodesic distances are close to Euclidean. In this scenario, the matrix \({{\textbf {B}} }\) behaves like a positive semidefinite (PSD) kernel. The opposite scenario requires a modification to be made to \(\widetilde{{{\textbf {G}} }}\) to make it PSD. In MDS literature, this is commonly referred to as the Additive Constant Problem (ACP) [14, 31, 32].
To measure error between the true, underlying lowdimensional representation to that uncovered by NLDR methods, Procrustes analysis [33] is typically used. Procrustes analysis involves aligning two matrices, \({{\textbf {A}} }\) and \({{\textbf {B}} }\), by finding the optimal translation, \({{\textbf {t}} }\), rotation, \({{\textbf {R}} }\), and scaling factor, \({\textbf {s}}\), that minimizes the Frobenius norm between the two aligned matrices, i.e.,:
The above optimization problem has a closedform solution obtained by performing Singular Value Decomposition (SVD) of \({{\textbf {A}} }{{\textbf {B}} }^T\) [33]. Consequently, one of the properties of Procrustes analysis is that \({{\varvec{\epsilon }}}_{\text {Proc}}({{\textbf {A}} }, {{\textbf {B}} }) = 0\) when \({{\textbf {A}} } = {\textbf {s}} {{\textbf {R}} }{{\textbf {B}} } + {{\textbf {t}} }\) i.e. when one of the matrices is a scaled, translated and/or rotated version of the other, which we leverage upon in this work.
Streaming Isomap
Given that the Isomap algorithm has a complexity of \({\mathcal {O}}(n^3)\) (where n = size of data) since it needs to perform Eigen Decomposition on \({{\textbf {B}} }\) as described in the previous section, recomputing the manifold is computationally impractical to use in a streaming setting. Incremental techniques have been proposed in the past [5, 27], which can efficiently process the new streaming points, without affecting the quality of the embedding significantly.
The SIsomap algorithm relies on the assumption that a stable manifold can be learnt using only a fraction of the stream (denoted as the batch data set \({{\mathcal {B}}}\)), and the remaining part of stream (denoted as the stream data set \({{\mathcal {S}}}\)) can be mapped to the manifold in a significantly less costly manner. A convergence proof that justifies this assumption is provided in Sect. “Convergence proofs for SIsomap and SIsomap++”. Alternatively, this can be justified by considering the convergence of eigenvectors and eigenvalues of \({{\textbf {B}} }\), as the number of points in the batch increase [34]. In particular, the bounds on the convergence error for a similar NLDR method, i.e., kernel PCA, is shown to be inversely proportional to the batch size [34]. Similar arguments can be made for Isomap, by considering the equivalence between Isomap and Kernel PCA [26, 35]. This relationship has also been empirically shown for multiple data sets [5]. The SIsomap algorithm computes the lowdimensional representation for each new point i.e. \({{\textbf {x}} _{n+1}}\in {\mathbb {R}}^d\) by solving a leastsquares problem formulated by matching the dot product of the new point with the lowdimensional embedding of the points in the batch data set \({\textbf {X}}\), computed using Isomap, to the normalized squared geodesic distances vector \({\textbf {f}}\). The leastsquares problem has the following form:
where^{Footnote 4}
where \({{\textbf {g}} }_{i,j}\) refer to the geodesic distance discussed in Sect. “Problem statement and preliminaries”.
Handling multiple manifolds
In the ideal case, when manifolds are densely sampled and sufficiently separated, clustering can be performed before applying NLDR techniques [37, 38], by choosing an appropriate local neighborhood size so as not to include points from other manifolds and still be able to capture the local geometry of the manifold. However, if the manifolds are close or intersecting, such methods typically fail. While methods such as Generalized Principal Component Analysis (GPCA) [39] have been proposed to generalize linear methods such as PCA for a case where the data lies on multiple subspaces, such ideas have not been explored for nonlinear methods.
The SIsomap++ [6] algorithm overcomes limitations of the SIsomap algorithm and extends it to be able to deal with multiple manifolds. It uses the notion of Multiscale SVD [40] to define tangent manifold planes at each data point, computed at the appropriate scale, and computes similarity in a local neighborhood. Additionally, it includes a novel manifold tangent clustering algorithm to be able to deal with the above issue of clustering manifolds which are close and in certain scenarios, intersecting, using these tangent manifold planes. After initially clustering the highdimensional batch data set, the algorithm applies NLDR on each manifold individually and eventually “stitches” them together in a global ambient space by defining transformations which can map points from the individual lowdimensional manifolds to the global space. SIsomap++ does not assume that the number of manifolds (p) is specified and automatically infers p using its clustering mechanism.^{Footnote 5} Given that the data points lie on lowdimensional and potentially intersecting manifolds, it is evident that the standard clustering methods, such as KMeans [42], that operate on the observed data in \({\mathbb {R}}^{\textbf {D}}\), will fail in correctly identifying the clusters.
However, SIsomap++ can only detect manifolds which it encounters in its batch learning phase and not those which it might encounter in the streaming phase. Thus, SIsomap++ ceases to “learn” and evolve to be able to limit the embedding error for points in the data stream, even though it has a “stitching” mechanism to embed individual lowdimensional manifolds, which might themselves be of different dimensions.
Gaussian process regression
Let us assume that we are learning a probabilistic regression model to obtain the prediction at a given test input, \({\textbf {y}}\), using a nonlinear and latent function, \(f(\varvec{\cdot })\). Assuming^{Footnote 6}\(d=1\), the observed output, x, is related to the input as:
Given a training set of inputs, \({\textbf {Y}} = \{ {\textbf {y}} _i \}_{i = 1 \ldots n}\) and corresponding outputs, \({\textbf {X}} = \{{\bf x}_i\}_{i = 1 \ldots n}\),^{Footnote 7} the Gaussian Process Regression (GPR) model assumes a GP prior on the latent function values, i.e., \(f({\textbf {y}} ) \sim GP(m({\textbf {y}} ),k({\textbf {y}} ,{\textbf {y}} '))\), where \(m({\textbf {y}} )\) is the mean of \(f({\textbf {y}} )\) and \(k({\textbf {y}} ,{\textbf {y}} ')\) is the covariance between any two evaluations of \(f(\varvec{\cdot })\), i.e, \(m({\textbf {y}} ) = {\mathbb {E}}[f({\textbf {y}} )]\) and \(k({\textbf {y}} ,{\textbf {y}} ') = {\mathbb {E}}[(f({\textbf {y}} )  m({\textbf {y}} ))(f({\textbf {y}} ')  m({\textbf {y}} '))]\). Here we use a zeromean function (\(m({\textbf {y}} ) = 0\)), though other functions could be used as well. The GP prior states that any finite collection of the latent function evaluations are jointly Gaussian, i.e.,
where the \(ij^{th}\) entry of the \(n \times n\) covariance matrix, K, is given by \(k({\textbf {y}} _i, {\textbf {y}} _j)\). The GPR model uses (5) and (6) to obtain the predictive distribution at a new test input, \({\textbf {y}} _{n+1}\), as a Gaussian distribution with following mean and variance:
where \({\textbf {k}} _{n+1}\) is a \(n \times 1\) vector with \(i^{th}\) value as \(k({\textbf {y}} _{n+1},{\textbf {y}} _i)\).
The kernel function, \(k(\varvec{\cdot })\), specifies the covariance between function values, \(f({\textbf {y}} _i)\) and \(f({\textbf {y}} _j)\), as a function of the corresponding inputs, \({\textbf {y}} _i\) and \({\textbf {y}} _j\). A popular choice is the squared exponential kernel, which has been used in this work:
where \({{\varvec{\sigma }}}_s^2\) is the signal variance and \({{\varvec{\ell }}}\) is the length scale. The quantities \({{\varvec{\sigma }}}_s^2\), \({{\varvec{\ell }}}\), and \({{\varvec{\sigma }}}_n^2\) from (5) are the hyperparameters of the model and can be estimated by maximizing the marginal loglikelihood of the observed data (\({\textbf {Y}}\) and \({\textbf {X}}\)) under the GP prior assumption.
One can observe that predictive mean, \({\mathbb {E}}[{\textbf {x}} _{n+1}]\) in (7) can be written as an inner product, i.e.
where \({{\varvec{\beta }}} = (K + {{\varvec{\sigma }}}_n^2I)^{1}{\textbf {X}}\). We will utilize this form in subsequent proofs.
Convergence proofs for Sisomap and Sisomap++
In this section, we demonstrate the convergence of the SIsomap algorithm for a single manifold setting, subsequent to which we extend it to the multimanifold setting i.e. for the SIsomap++ algorithm described above.
Theorem 1
Given a uniformly sampled, unimodal distribution from which the random batch data set \({\mathcal {B}} = \{ {\textbf {y}} _i \in {\mathbb {R}}^{\textbf {D}} \}_{i = 1 \ldots n}\) of the SIsomap algorithm is derived from, there exists a threshold \({\textbf {n}} _{0}\), such that when \({\textbf {n}} \ge {\textbf {n}} _{0}\), the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}}\big ({{\varvec{\tau }}}_{{\mathcal {B}}}\), \({{\varvec{\tau }}}_{\text {ISO}}\big )\) between \({{\varvec{\tau }}}_{{\mathcal {B}}} = {{\varvec{\phi }}}^{1}\big ({\mathcal {B}}\big )\), the true underlying representation and \({{\varvec{\tau }}}_{\text {ISO}}= \hat{{\varvec{\phi }}}^{1}\big ({\mathcal {B}}\big )\), the embedding uncovered by Isomap is small (\({{\varvec{\epsilon }}}_{\text {Proc}}\approx 0\)) i.e. the batch phase of the SIsomap algorithm converges, where \({{\varvec{\phi }}}(\varvec{\cdot })\) is the nonlinear function which maps data points from the underlying lowdimensional ground truth representation \({\textbf {U}}\) to \({\mathcal {B}}\in {\mathbb {R}}^{\textbf {D}}\) and the ground truth \({\textbf {U}}\) originally resides in a convex \({\mathbb {R}}^{\textbf {d}}\) Euclidean space.
Proof Based on the setting described above, the SIsomap algorithm acts like a generative model which is trying to learn the inverse mapping \({{\varvec{\phi }}}(\varvec{\cdot })^{1}\), where the associated embedding error is the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}}\big ({{\varvec{\tau }}}_{\mathcal {B}}\), \({{\varvec{\tau }}}_{\text {ISO}}\big )\).
The proof follows from [43] who showed that in a setting, where given \({{\varvec{\lambda }}}_1\), \({{\varvec{\lambda }}}_2\), \({{\varvec{\mu }}} > 0\) and for appropriately chosen \({{\varvec{\epsilon }}} > 0\), as well as a data set \({\textbf {Y}} = \{ {\textbf {y}} _i \}_{i = 1 \ldots n}\) sampled from a Poisson distribution with density function \({{\varvec{\alpha }}}\) which satisfies the \({{\varvec{\delta }}}\)sampling condition i.e.
wherein the \({{\varvec{\epsilon }}}\)rule is used to construct a graph \({{\textbf {G}} }\) on \({\textbf {Y}}\), the ratio between the graph based distance \({\textbf {d}} _{G}({{\textbf {x}} , {\textbf {y}} })\) and the true Euclidean distance \({\textbf {d}} _{M}({{\textbf {x}} , {\textbf {y}} }) \; \forall {\textbf {x}}\), \({\textbf {y}} \in {\textbf {Y}}\) is bounded. More concretely, the following holds with probability at least \((1  {{\varvec{\mu }}})\) for \(\forall {\textbf {x}}\), \({\textbf {y}} \in {\textbf {Y}}\):
where \({\textbf {V}}\) is the volume of the manifold \({\mathcal {M}}\) and
is the volume of the smallest metric ball in \({\mathcal {M}}\) of radius \({\textbf {r}}\) and \({{\varvec{\delta }}} >0\) is such that
A similar result can be derived in the scenario where \({\textbf {n}}\) points are sampled independently from the fixed probability distribution \(p({\textbf {y}}\); \({\varvec{\theta }})\), in which case we have :
where \(\widetilde{{\varvec{\alpha }}}\) is the probability of selecting a sample from \(p({\textbf {y}}\); \({\varvec{\theta }})\).
Using (13), (14) and (15) in (11), we have:
where \({\textbf {n}} _{0} = (1/\widetilde{{\varvec{\alpha }}})\big [ \log ({\textbf {V}} /{{\varvec{\mu }}}{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/16)}^{\textbf {d}} ) \big ]/{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/8)}^{\textbf {d}}\), is the condition which ensures that (12) is satisfied.
Thus we have an adequate threshold for the size of the batch data set \({\mathcal {B}}\) which ensures (17) is satisfied for the \({{\varvec{\epsilon }}}\)rule. We can derive a similar threshold for the \({{\textbf {K}} }\)rule, observing that there is a direct onetoone mapping between \({{\textbf {K}} }\) and \({{\varvec{\epsilon }}}\) (See Sect. “Nonlinear dimensionality reduction” for more details).
To complete the proof, we observe that (12) implies that \({\textbf {d}} _{\text {G}}({{\textbf {x}} ,{\textbf {y}} })\), the graph based distance between points \({\textbf {x}}\), \({\textbf {y}} \in {{\textbf {G}} }\) is a perturbed version of \({\textbf {d}} _{\text {M}}({{\textbf {x}} , {\textbf {y}} })\), the true Euclidean distance between points \({\textbf {x}}\) and \({\textbf {y}}\) in \({\mathbb {R}}^{\textbf {d}}\). Let \(\widetilde{{\textbf {D}} }_{\text {M}}\) and \(\widetilde{{\textbf {D}} }_{\text {G}}\) represent the squared distance matrix corresponding to \({\textbf {d}} _{\text {M}}({{\textbf {x}} ,{\textbf {y}} })\) and \({\textbf {d}} _{\text {G}}({{\textbf {x}} ,{\textbf {y}} })\) respectively. Thus we have \(\widetilde{{\textbf {D}} }_{\text {G}} = \widetilde{{\textbf {D}} }_{\text {M}}\) \(+\) \(\Delta \widetilde{{\textbf {D}} }_{\text {M}}\) where \(\Delta \widetilde{{\textbf {D}} }_{\text {M}} = \{ \Delta \widetilde{{\textbf {d}} }_{\text {M}}({\textbf {i}} , {\textbf {j}} )\}_{1 \le i, j \le n}\) and \(\Delta \widetilde{{\textbf {d}} }_{\text {M}}({\textbf {i}} , {\textbf {j}} )\) are bounded due to (12).
In the past [44], the robustness of MDS to small perturbations was demonstrated as follows. Let \({\textbf {F}}\) represent the zerodiagonal symmetric matrix which perturbs the true squared distance matrix \({{\textbf {B}} }\) to \({{\textbf {B}} } + {\varvec{\Delta }}{{\textbf {B}} } = {{\textbf {B}} } + {{\varvec{\epsilon }}}{\textbf {F}}\). Then the Procrustes Error between the embeddings uncovered by MDS for \({{\textbf {B}} }\) and for \({{\textbf {B}} } + {\varvec{\Delta }}{{\textbf {B}} }\) is given by \(\frac{\varvec{\epsilon}^2}{4}\sum\nolimits_{{j,k}} {\frac{{{\bf e}_{j}^{T} {\bf F}{\bf e}_{k} ^{2} }}{{\varvec{\lambda} _{j} + \,\varvec{\lambda} _{k} }}}\), which is very small for small entries \(\{ {\textbf {f}} _{i,j} \}_{1 \le i, j \le n} \in {\textbf {F}}\), \(\{{\textbf {e}} _k ({{\varvec{\lambda }}}_k)\}_{k = 1 \ldots n}\) represent the eigenvectors (eigenvalues) of \({{\textbf {B}} }\) and the double summation is over pairs of \(({\textbf {j}}, {\textbf {k}}) = 1,2,\ldots ({\textbf {n}}1)\) but excluding those pairs \(({\textbf {j}}, {\textbf {k}})\) wherein both entries of which lie in the range \(({{\textbf {K}} }+1), ({{\textbf {K}} }+2), \ldots ({\textbf {n}}1)\), \({{\textbf K}} \, = \,\sum\nolimits_{{k = 1}}^{n} {{\mathcal {I}} \left( {{\varvec{\lambda }}_{k} > \,0} \right)}\) and \({\mathcal {I}}(\varvec{\cdot })\) is the indicator function. We substitute \({{\varvec{\epsilon }}} = 1\) and replace \({{\textbf {B}} }\) with \(\widetilde{{\textbf {D}} }_{\text {M}}\) and \({\varvec{\Delta }}{{\textbf {B}} }\) with \({\varvec{\Delta }}\widetilde{{\textbf {D}} }_{\text {M}}\) above to complete the proof, since the entries of \({\varvec{\Delta }}\widetilde{{\textbf {D}} }_{\text {M}}\) are very small i.e. \(\{ 0 \le {\varvec{\Delta }}{\textbf {d}} _{\text {M}}(i, j) \le {{\varvec{\lambda }}}^2 \}_{1 \le i, j \le n}\) where \({{\varvec{\lambda }}} = \max ({{\varvec{\lambda }}}_1, {{\varvec{\lambda }}}_2)\) for small \({{\varvec{\lambda }}}_1\), \({{\varvec{\lambda }}}_2\), given the condition \({\textbf {n}} > {\textbf {n}} _{0}\) is satisfied for (12). Thus we have that the embedding uncovered by SIsomap for a batch data set \({\mathcal {B}}\) where \(\left {\mathcal {B}}\right = {\textbf {n}} > {\textbf {n}} _{0}\) converges asymptotically to their true embedding upto translation, rotation and scaling factors. \(\square\)
Extension to the multimanifold setting
The above proof can be extended to show the convergence of the SIsomap++ [6] algorithm, described in Sect. “Handling Multiple Manifolds” as follows.
Corollary 1
The batch phase of the SIsomap++ algorithm converges under appropriate conditions.
Proof Similar to the proof for the SIsomap algorithm, we consider a corresponding setting for the multimanifold scenario now, wherein we are attempting to learn the inverse mappings \({{\varvec{\phi }}}(\varvec{\cdot })_{i = 1, \ldots , {\textbf {p}}}^{1}\) for each of the \({\textbf {p}}\) manifolds. The initial clustering step of the SIsomap++ algorithm separates the samples from the batch data set \({\mathcal {B}}\) into different individual clusters \({\mathcal {B}}_{i}\), such that each cluster is mutually exclusive of the others and corresponds to one of the multiple manifolds present in the data i.e. \(\bigcup \limits _{i=1}^{\textbf {p}} {\mathcal {B}}_{i} = {\mathcal {B}}\) and \({\mathcal {B}}_{i}\bigcap \limits _{\begin{array}{c} \forall i,j, i \ne j \end{array}}{\mathcal {B}}_{j} = \phi\).
The intuition for clustering and subsequently processing each of the clusters separately is based on the setting described above that the observed data was generated by first sampling points from multiple \({\textbf {U}} _{i = 1, \ldots , {\textbf {p}}}\) i.e., convex domains in \({\mathbb {R}}^{\textbf {d}}\)^{Footnote 8} and subsequently mapping those points in nonlinear fashion, using possibly different \({{\varvec{\phi }}}(\varvec{\cdot })_{i = 1, \ldots , {\textbf {p}}}\) to \({\mathcal {B}}\in {\mathbb {R}}^{\textbf {D}}\). Thus, to learn the different inverse mappings effectively, there is a need to be able to cluster the data appropriately.
After the initial clustering step, a similar analysis as in Theorem 1 provides thresholds \({\textbf {n}}_i, \forall i \in \{1, \ldots , {\textbf {p}}\}\) for each of the \({\textbf {p}}\) clusters beyond which when \(\left {\mathcal {B}}_{i}\right = {\textbf {n}} \ge {\textbf {n}} _{i}\), the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}}\big ({{\varvec{\tau }}}_{{\mathcal {B}}_{i}}\), \({{\varvec{\tau }}}_{\text {ISO}_{i}}\big )\) between \({{\varvec{\tau }}}_{{\mathcal {B}}_{i}} = {{\varvec{\phi }}}_{i}^{1}\big ({\mathcal {B}}_{i}\big )\), the true underlying representation and \({{\varvec{\tau }}}_{{\text {ISO}}_{i}}= \hat{{\varvec{\phi }}}_{i}^{1}\big ({\mathcal {B}}_{i}\big )\), the embedding uncovered by Isomap is small (\({{\varvec{\epsilon }}}_{\text {Proc}}\approx 0\)) i.e. the batch phase of the SIsomap++ algorithm converges provided each of the \({\textbf {p}}\) clusters \({\mathcal {B}}_{i = 1, \ldots , {\textbf {p}}}\) exceeds the appropriate threshold \({\textbf {n}}_i\) (similar to (17) above). \(\square\)
The SIsomap++ algorithm does not assume that the number of manifolds (\({\textbf {p}}\)) is specified. Refer to Sect. “Handling multiple manifolds” for more details.
Methodology
The proposed GPIsomap algorithm follows a twophase strategy (similar to the SIsomap and SIsomap++), where exact manifolds are learnt from an initial batch \({\mathcal {B}}\), and subsequently a computationally inexpensive mapping procedure processes the remainder of the stream. To handle multiple manifolds, the batch data \({\mathcal {B}}\) is first clustered via manifold tangent clustering or other standard techniques. Exact Isomap is applied on each cluster. The resulting lowdimensional data for the clusters is then “stitched” together to obtain the lowdimensional representation of the input data. The difference from the past methods is the mapping procedure which uses GPR to obtain the predictions for the lowdimensional mapping (see (7)). At the same time, the associated predictive variance (see (8)) is used to detect changes in the underlying distribution.
The overall GPIsomap algorithm is outlined in 1 and takes a batch data set, \({\mathcal {B}}\) and the streaming data, \({\mathcal {S}}\) as inputs, along with other parameters. The processing is split into two phases: a batch learning phase (Lines 1–15) and a streaming phase (Lines 16–32), which are described later in this section.
Kernel function
The key innovation here is to use a manifoldspecific kernel matrix in the GPR method. The matrix \({{\textbf {B}} }\), which is the inner product matrix between the points in the lowdimensional space (see Sect. “Nonlinear dimensionality reduction”), could be a reasonable starting point. However, as past researchers have shown [16], typical kernels, such as squared exponential kernel, can only be generalized to a positive definite kernel on a geodesic metric space if the space is flat. Thus \({{\textbf {B}} }\) will not necessarily yield a valid positive semidefinite kernel matrix. However, a result by [32] shows that a small positive constant, \({{\varvec{\lambda }}}_{\text {max}}\), can be added to \({{\textbf {B}} }\) to guarantee that it will be PSD. This constant can be calculated as the largest eigenvalue of the matrix:
where \({\textbf {P}} = {{\textbf {H}} }{{\textbf {G}} }{{\textbf {H}} }/2\). Here, \({{\textbf {G}} }\) is the geodesic distance matrix and \({{\textbf {H}} } = \{ {{\textbf {h}} }_{i,j} \}_{1 \le i, j \le n}\), \({{\textbf {h}} }_{i,j} = {{\varvec{\delta }}}_{i,j}  1/{n}\), where \({{\varvec{\delta }}}_{i,j}\) is the Kronecker delta. \(\widetilde{{{\textbf {B}} }}\) can be derived from \({{\textbf {B}} }\) as [32]:
where \({{\varvec{\lambda }}}_{\text {max}}\) is the largest eigenvalue of \({\textbf {M}}\).
The proposed GPIsomap algorithm uses a novel geodesic distance based kernel function defined as:
where \(\widetilde{{\textbf {b}} }_{i,j}\) is the \({ij}^{th}\) entry of the matrix \(\widetilde{{{\textbf {B}} }}\), \({{\varvec{\sigma }}}^2_s\) is the signal variance (whose value we fix as 1 in this work) and \({{\varvec{\ell }}}\) is the length scale hyperparameter. Thus the kernel matrix \({{\textbf {K}} }\) can be written as:
This kernel function plays a key role in using the GPR model for mapping streaming points on the learnt manifold, by measuring similarity along the lowdimensional manifold, instead of the original space (\({\mathbb {R}}^D\)), as is typically done in GPR based solutions.
The matrix \(\widetilde{{{\textbf {B}} }}\), is positive semidefinite. Consequently, we note that the kernel matrix, \({{\textbf {K}} }\), is positive definite (refer (22) below).
Using 1, the novel kernel we propose can be written as
where \(\widetilde{\varvec{\Lambda }} = \begin{bmatrix} \big [ \exp {\left( \frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) }  1 \big ] &{} 0 &{} 0 \\ 0 &{} \ddots &{} 0 \\ 0 &{} 0 &{} \big [ \exp {\left( \frac{{{\varvec{\lambda }}}_d}{2{{\varvec{\ell }}}^2}\right) }  1 \big ] \end{bmatrix}\) and \(\{{{\varvec{\lambda }}}_i, {{\textbf {q}} }_i \}_{i = 1 \ldots d}\) are eigenvalue/eigenvector pairs of \(\widetilde{{{\textbf {B}} }}\) as discussed in Sect. “Nonlinear dimensionality reduction”.
Batch learning
The batch learning phase consists of these tasks:

i).
Clustering: The first step in the batch phase involves clustering of the batch data set \({\mathcal {B}}\) into \({\textbf {p}}\) individual clusters which represent the manifolds (Line 1). In case, \({\mathcal {B}}\) contains a single cluster, the algorithm can correctly detect it. Refer to Sect. “Handling multiple manifolds” for more details,

ii).
Dimensionality reduction: Subsequently, full Isomap is executed on each of the \({\textbf {p}}\) individual clusters to get lowdimensional representations \({\mathcal {LDE}}_{i=1,2 \ldots {\textbf {p}}}\) of the data points belonging to each individual cluster (Lines 3–5),

iii).
Hyperparameter estimation: The geodesic distance matrix for the points in the \({{\varvec{i}}}^{\text {th}}\) manifold \({{{\mathcal {G}}}_{i}}\) and the corresponding lowdimensional representation \({{\mathcal {LDE}}_{i}}\), are fed to the GP model for each of the \({\textbf {p}}\) manifolds, to perform hyperparameter estimation, which outputs \(\{{{\varvec{\phi }}}_{i}^{GP} \}_{i = 1,2 \ldots {\textbf {p}}}\) (Lines 6–8), and,

iv).
Learning mapping to global space: The lowdimensional embedding uncovered for each of the manifolds can be of different dimensionalities. Consequently, a mapping to a unified global space is needed. To learn this mapping, a support set \({\varvec{\xi }}_{s}\) is formulated, which contains the \({\textbf {k}}\) pairs of nearest points and \({\textbf {l}}\) pairs of farthest points, between each pair of manifolds. Subsequently, MDS is executed on this support set \({\varvec{\xi }}_{s}\) to uncover its lowdimensional representation \({\mathcal{G}\mathcal{E}}_{s}\). Individual scaling and translation factors \(\{ {{\mathcal {R}}}_{i}, {t}_{i} \}_{i = 1,2 \ldots {\textbf {p}}}\) are learnt via solving a least squares problem involving \({\varvec{\xi }}_{s}\), which map points from each of the individual manifolds to the global space (Lines 9–15).
Stream processing
In the streaming phase, each sample \({\textbf {s}}\) in the stream set \({\mathcal {S}}\) is embedded using each of the \({\textbf {p}}\) GP models to evaluate the prediction \({{\varvec{\mu }}}_{i}\), along with the variance \({{\varvec{\sigma }}}_{i}\) (Lines 22–24). The manifold with the smallest variance get chosen to embed the sample \({\textbf {s}}\) into, using the corresponding scaling \({{\mathcal {R}}}_{j}\) and translation factor \({t}_{j}\), provided \({ min_i} \left {{\varvec{\sigma }}}_{i} \right\) is within the allowed threshold \({{\varvec{\sigma }}}_{t}\) (Lines 25–28), otherwise sample \({{\varvec{s}}}\) is added to the unassigned set \({{\mathcal {S}}}_{u}\) (Lines 29–31). When the size of unassigned set \({{\mathcal {S}}}_{u}\) exceeds certain threshold \({\textbf {n}}_{s}\), we add them to the batch data set and relearn the base manifold (Line 18–20). The assimilation of the new points in the batch maybe done more efficiently in an incremental manner.
Complexity
The runtime complexity of our proposed algorithm is dominated by the GP regression step as well as the Isomap execution step, both of which have \({\mathcal {O}}(n^3)\) complexity, where n is the size of the batch data set \({\mathcal {B}}\). This is similar to the SIsomap and SIsomap++ algorithms, that also have a runtime complexity of \({\mathcal {O}}(n^3)\). The stream processing step is \({\mathcal {O}}(n)\) for each incoming streaming point. The space complexity of GPIsomap is dominated by \({\mathcal {O}}(n^2)\). This is because each of the samples of the stream set \({\mathcal {S}}\) get processed separately. Thus, the space requirement as well as runtime complexity does not grow with the size of the stream, which makes the algorithm appealing for handling highvolume streams.
Theoretical analysis
We first state the main result for the single manifold case, and prove it using results in the Appendix section, and then present proofs for the multimanifold case.
Theorem 2
For a single manifold setting, the prediction \({{\varvec{\tau }}}_{\text {GP}}\) of GPIsomap is equivalent to the prediction \({{\varvec{\tau }}}_{\text {ISO}}\) of SIsomap, i.e., the Procrustes Error \({{\varvec{\epsilon }}} _{\text {Proc}}\big ({{\varvec{\tau }}}_{\text {GP}}\), \({{\varvec{\tau }}}_{\text {ISO}}\big )\) between \({{\varvec{\tau }}}_{\text {GP}}\) and \({{\varvec{\tau }}}_{\text {ISO}}\) is 0.
Proof The prediction of GPIsomap is given by (10). Using Lemma 5, we note that
The term \({{\textbf {K}} _{*}}\) for GPIsomap, using the novel kernel function evaluates to:
where \({{\textbf {G}} }_{*}^2\) represents the vector containing the squared geodesic distances of \({\textbf {x}}_{n+1}\) to \({\textbf {X}}\) containing \(\{ {\textbf {x}} _i \}_{i = 1,2 \ldots n}\).
Considering the above equation elementwise, we note that the \({\textbf {i}} ^{\text {th}}\) term of \({{\textbf {K}} _{*}}\) equates to \(\exp {\left[ \frac{{{\textbf {g}} }_{i,n+1}^2}{2{{\varvec{\ell }}}^2}\right] }\). Using Taylor’s series expansion we have,
The prediction by the SIsomap is given by (4) as follows:
where \({\textbf {f}} = \{ {\textbf {f}} _i\}\) is as defined by (4).
Rewriting (4) we have:
where \({\varvec{\gamma }} = \big (\frac{1}{n}\sum \limits _{j}{{{\textbf {g}} }_{i,j}^2} \big )\) is a constant with respect to \({\textbf {x}} _{n+1}\), since it depends only on squared geodesic distance values associated within the batch data set \({\mathcal {B}}\) and \({\textbf {x}} _{n+1}\) is part of the stream data set \({\mathcal {S}}\).
We now consider the \({1}^{\text {st}}\) dimension of the predictions for GPIsomap and SIsomap only and demonstrate their equivalence via Procrustes Error. The analysis for the remaining dimensions follows a similar line of reasoning.
Thus for the \({1}^{\text {st}}\) dimension, using (27) the SIsomap prediction is:
Similarly using Lemma 5, (24) and (25), we have that the \({\textbf {1}} ^{\text {st}}\) dimension for GPIsomap prediction is given by,
We can observe that \({{\varvec{\tau }}}_{\text {GP}_{1}}\) is a scaled and translated version of \({{\varvec{\tau }}}_{\text {ISO}_{1}}\). Similarly for each of the dimensions (\({1} \le {i} \le {d}\)), the prediction for the GPIsomap \({{\varvec{\tau }}}_{\text {GP}_{i}}\) can be shown to be a scaled and translated version of the prediction for the SIsomap \({{\varvec{\tau }}}_{\text {ISO}_{i}}\). These individual scaling \({\textbf {s}} _i\) and translation \({{\textbf {t}} }_i\) factors can be represented together by single collective scaling \({\textbf {s}}\) and translation \({{\textbf {t}} }\) factors. Consequently, the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}} \big ({{\varvec{\tau }}}_{\text {GP}}\), \({{\varvec{\tau }}}_{\text {SI}}\big )\) is 0. (refer Sect. “Nonlinear dimensionality reduction”). \(\square\)
Results and analysis
In this section, we demonstrate the performance of the proposed algorithm on both synthetic and realworld data sets. In Sect. “Results on synthetic data sets”, we present results for synthetic data sets, whereas Sect. “Results on sensor data set” contains results on benchmark sensor data sets. All experiments were done using Python 3.0 implementations of the proposed and related methods, and were run on a MacBook Pro (2.8 GHz QuadCore Intel Core i7, 16 GB 1600 MHz DDR3). Our results demonstrate that: (i). GPIsomap is able to perform good quality dimension reduction on a manifold, (ii). the reduction produced by GPIsomap is equivalent to the corresponding output of SIsomap (or SIsomap++), and (iii). the predictive variance within GPIsomap is able to identify changes in the underlying distribution in the data stream on all data sets considered in this paper.
In the interest of space, we avoid comparing the quality of the dimensionality reduction with other methods, and refer readers to SIsomap and SIsomap++ where these equivalent methods were shown to be better than existing approaches for dimensionality reduction.
GPIsomap has the following hyperparameters: \(\epsilon\), k, l, \(\lambda\), \(\sigma _t\), \(n_s\). We set k, l, \(\lambda\) to have values of 16, 1 and 0.005, respectively. We study the effect of \(\sigma _t\) and \(n_s\) using the different data sets listed in Sects. “Results on synthetic data sets” and “Results on sensor data set” respectively.
Results on synthetic data sets
Swiss roll data sets are typically used for evaluating manifold learning algorithms. To evaluate our method on concept drift, we use the Euler Isometric Swiss Roll data set [5] consisting of four \({\mathbb {R}}^{2}\) Gaussian patches having \(n=2000\) points each, chosen at random, which are embedded into \({\mathbb {R}}^{3}\) using a nonlinear function \({\varvec{\psi }}(\cdot )\). The points for each of the Gaussian modes were divided equally into training and test sets randomly. To test incremental concept drift, we use one of the training data sets from the above data set, along with a uniform distribution of points for testing (refer to Fig. 1 for details). Figures 2a and 3a demonstrate our results on this data set.
To evaluate our method on sudden concept drift, we trained our GPIsomap model using the first three out of four training sets of the Euler Isometric Swiss Roll data set. Subsequently we stream points randomly from the test sets from only the first three classes initially and later stream points from the test set of the fourth class, keeping track of the predictive variance all the while. Figure 2a demonstrates the sudden increase (see red line) in the variance of the stream when streaming points are from the fourth class i.e. unknown mode. Thus GPIsomap is able to detect concept drift correctly, and is able to map all of the data points correctly on the lower dimensional manifold, as shown in Fig. 3a. The bottom panel of Fig. 1 demonstrates the performance of SIsomap++ on this data set. It fails to map the streaming points of the unknown mode correctly, given it had not encountered the unknown mode during the batch training phase.
In Sect. “Theoretical analysis”, we proved the equivalence between the prediction of SIsomap with that of GPIsomap, using our novel kernel. In Fig. 4, we show empirically via Procrustes Error (PE) that the prediction of SIsomap matches that of GPIsomap, irrespective of size of batch used. PE for GPIsomap with the Euclidean distance based kernel remains high irrespective of the size of the batch, which clearly demonstrates the unsuitability of this kernel to adequately learn mappings in the lowdimensional space.
Results on sensor data set
In this section, we present results from different benchmark sensor data sets to demonstrate the efficacy of our algorithm.
Results on gas sensor array drift data set
The Gas Sensor Array Drift [45] data set is a benchmark data set (\(n = 13910\)) available to research communities to develop strategies to dealing with concept drift and uses measurements from 16 chemical sensors used to discriminate between 6 gases (class labels) at various concentrations. We demonstrate the performance of our proposed method on this data set.
We first remove instances which had invalid/empty entries as feature values. Subsequently the data is mean normalized. Data points from the first five classes were divided into training and test sets. We train our model using the training data from four out of these five classes. While testing, we stream points randomly from the test sets of these four classes first and later stream points from the test set of the fifth class. Figures 2b and 3b demonstrate our results on this data set. From Fig. 2b, we observe that our model can clearly detect concept drift due to the unknown fifth class by tracking the variance of the stream, using the running average (red line). However, as shown in Fig. 3b, a twodimensional manifold is not sufficient to capture the cluster structure in the data set.
Results on human activity recognition (HAR) data set
The Human Activity Recognition [46] data set consists of multiple data sets which are focused on discriminating between different activities, i.e. to predict which activity was performed at a specific point in time. In this work, we focused on the Weight Lifting Exercises (WLE) data set (\(n = 39242\)) which investigates how well an activity was performed by the wearer of different sensor devices. The WLE data set consists of six young health participants who performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
The data set was cleaned i.e. instances with invalid/empty entries were removed. Subsequently the data points from the different classes were mean normalized and divided into training and test sets. Figures 2c and 3c demonstrate our results on this data set. Figure 2c demonstrates the concept drift phenomenon. Similar to the methodology we used earlier to detect concept drift, we initially trained our algorithm using instances from the latter four classes only, whereas during the streaming phase we randomly selected instances from the streaming set of these four classes first and later streamed points from the first class, keeping track of the predictive variance.
Conclusions
We have proposed a streaming Isomap algorithm (GPIsomap) that can be used to learn nonlinear lowdimensional representation of highdimensional data arriving in a streaming fashion. This algorithm can have significant applications in areas involving analysis of highdimensional data streams [47], especially in constrained environments, such as realtime situational monitoring [48, 49], healthcare monitoring [50], scientific data visualization [51], etc. We prove that using a GPR formulation to map incoming data instances onto an existing manifold is equivalent to using existing geometric strategies [5, 6]. Moreover, by utilizing a small batch for the initial learning of the manifold, as well as for training the GPR model, the method scales linearly with the size of the stream, thereby ensuring its applicability for practical problems. Using the Bayesian inference of the GPR model allows us to estimate the variance associated with the mapping of the streaming instances. The variance is shown to be a strong indicator of changes in the underlying stream properties on a variety of data sets. By utilizing the variance, one can devise retraining strategies that can include expanding the batch data set. While in the experiments we have demonstrated the ability of GPIsomap to detect shifts in the underlying distributions, the algorithm can also be used to detect gradual shifts, as illustrated in Fig. 1. While we have focused on Isomap algorithm in this paper, similar formulations can be applied for other NLDR methods such as LLE [2], Laplacian Eigenmaps [29], Hessian Eigenmaps [30], etc., and will be explored as future research.
Availibility of data and materials
Not applicable. For any collaboration, please contact the authors.
Notes
See Sect. “Problem statement and preliminaries” for the definition of manifold quality.
An equivalence between GPLVM and Kernel Principal Component Analysis (KPCA) has been shown in the literature [19].
Actually, there are two variants of Isomap. The former employs a \({{\textbf {K}} }\)rule to define the neighborhood \({\mathcal {N}}({\textbf {y}} )\) for each point \({\textbf {y}} \in {\textbf {Y}}\) i.e. it considers the knearest neighbors of each point \({\textbf {y}}\) to be its neighborhood \({\mathcal {N}}({\textbf {y}} )\). The second variant employs a \({{\varvec{\epsilon }}}\)rule to define the neighborhood \({\mathcal {N}}({\textbf {y}} )\) of \({\textbf {y}}\) i.e. it considers all points which are within a radius of \({{\varvec{\epsilon }}}\) to be in its neighborhood \({\mathcal {N}}({\textbf {y}} )\). We observe that there is a direct onetoone relationship between the two rules with regards to computing the neighborhood \({\mathcal {N}}({\textbf {y}} )\) for all \({\textbf {y}} \in {\textbf {Y}}\).
Note that the Incremental Isomap algorithm [27] has a slightly different formulation where
$${{\bf f}_i} \simeq \frac{1}{2} \big(\frac{1}{n}\sum\limits_{j}{{\bf g}_{i,j}^2}  \frac{1}{{n}^2}\sum\limits_{l,m}{{\bf g}_{l,m}^2} \big) + \frac{1}{2}\big(\frac{1}{n}\sum\limits_{j}{{\bf g}_{j,n+1}^2}  {{\bf g}_{i,n+1}^2}\big)$$(3)where \({{\textbf {g}} }_{i,j}\) refer to the geodesic distance discussed in Sect. “Problem statement and preliminaries”. The SIsomap algorithm assumes that the data stream draws from an uniformly sampled, unimodal distribution \(p({\textbf {x}} )\) and that the stream \({{\mathcal {S}}}\) and the batch \({{\mathcal {B}}}\) data sets get generated from \(p({\textbf {x}} )\). Additionally it assumes that the manifold has stabilized i.e. \(\left {{\mathcal {B}}}\right = n\) is large enough. Using these assumptions in (3) above, we have that \(\big (\frac{1}{n}\sum \limits _{j}{{{\textbf {g}} }_{j,n+1}^2}  \frac{1}{{n}^2}\sum \limits _{l,m}{{{\textbf {g}} }_{l,m}^2}\big ) = {{\varvec{\epsilon }}} \simeq 0\) i.e. the expectation of squared geodesic distances for points in the batch data set \({{\mathcal {B}}}\) is close to those for points in the stream data set \({{\mathcal {S}}}\). The line of reasoning for this follows from [36]. Thus (3) simplifies to (4).
In cases of uneven/low density sampling, the clustering strategy discussed might possibly generate many small clusters. In such cases, one can try to merge clusters [41], based on their affinity/closeness to make the clusters’ size reasonable.
For vectorvalued outputs, i.e., \({\textbf {x}} \in {\mathbb {R}}^d\), one can consider d independent models.
While the typical notation for GPR models uses \({\textbf {X}}\) as inputs and \({\textbf {Y}}\) as outputs [7], we have reversed the notation to maintain consistency with rest of the paper.
It is possible that the lowdimensional Euclidean space specific to each manifold is different i.e. \({\textbf {U}} _{i}\) is a convex domain in \({\mathbb {R}}^{{\textbf {d}} _i}\) space, where \({\textbf {d}} _i \ne {\textbf {d}} _j\). However we can imagine a scenario where we choose a \({\mathbb {R}}^{\textbf {d}}\) global space, where \({\textbf {d}} = \sum _{i}{{\textbf {d}} _i}\) from which the different convex \({\textbf {U}} _{i}\) were sampled from. Additionally note that convexity is preserved by linear projections to higher dimensional spaces thus the convex domains \({\textbf {U}} _{i = 1, \ldots , p}\) remain convex in this new space.
References
Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–23.
Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–6.
Silva VD, Tenenbaum JB. Global versus local methods in nonlinear dimensionality reduction. NeurIPS. 2003;721–728.
Wu Y, Chan KL. An extended isomap algorithm for learning multiclass manifold. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. 2004:6;3429–3433.
Schoeneman F, Mahapatra S, Chandola V, Napp N, Zola J. Error metrics for learning reliable manifolds from streaming data. In: SDM. 2017:750–758. SIAM
Mahapatra S, Chandola V. Sisomap++: Multi manifold learning from streaming data. In: 2017 IEEE International Conference on Big Data (Big Data). 2017:716–725.
Williams CK, Seeger M. Using the nyström method to speed up kernel machines. In: NeurIPS. 2001:682–688.
Vitter JS. Random sampling with a reservoir. ACM Trans Math Softw (TOMS). 1985;11(1):37–57.
Chaudhuri S, Motwani R, Narasayya V. On random sampling over joins. ACM SIGMOD Record. 1999;28:263–74.
Domingos P, Hulten G. Mining highspeed data streams. Kdd. 2000;2:4.
Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. J Comput Syst Sci. 1999;58(1):137–47.
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T. Optimal histograms with quality guarantees. VLDB. 1998;98:24–7.
Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. SIAM J Comput. 2002;31(6):1794–813.
Choi H, Choi S. Kernel isomap. Electron Lett. 2004;40(25):1612–3.
Xing W, Shah AA, Nair PB. Reduced dimensional gaussian process emulators of parametrized partial differential equations based on isomap. Proc Royal Soc A Math Phys Eng Sci. 2015;471(2174):20140697.
Feragen A, Lauze F, Hauberg S. Geodesic exponential kernels: when curvature and linearity conflict. New Orleans: IEEE CVPR; 2015. p. 3032–42.
Chapelle O, Haffner P, Vapnik VN. Support vector machines for histogrambased image classification. IEEE Trans Neural Netw. 1999;10(5):1055–64.
Barkan O, Weill J, Averbuch A. Gaussian process regression for outofsample extension. 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing. 2016.
Li P, Chen S. A review on gaussian process latent variable models. CAAI Trans Intell Technol. 2016;1(4):366–76.
Lawrence ND. Gaussian process latent variable models for Visualisation of high dimensional data. NeurIPS. 2003;16:329–36.
Titsias M, Lawrence ND. Bayesian gaussian process latent variable model. AISTATS. 2010;9:844–51.
Henriksen A, Ward R. Adaoja: adaptive learning rates for streaming PCA. CoRR arXiv.1905.12115. https://doi.org/10.48550/arXiv.1905.12115.
Rani R, Khurana M, Kumar A, Kumar N. Big data dimensionality reduction techniques in iot: review, applications and open research challenges. Cluster Computing 2022.
Kiarashinejad Y, Abdollahramezani S, Adibi A. Deep learning approach based on dimensionality reduction for designing electromagnetic nanostructures. npj Comput Mater. 2020;6(1):12.
Balzano L, Chi Y, Lu YM. Streaming pca and subspace tracking: the missing data case. Proc IEEE. 2018;106(8):1293–310. https://doi.org/10.1109/JPROC.2018.2847041.
Bengio Y, Paiement Jf, Vincent P, Delalleau O, Roux NL, Ouimet M. Outofsample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering. NeurIPS. 2004:177–184.
Law MH, Jain AK. Incremental nonlinear dimensionality reduction by manifold learning. IEEE Trans Pattern Anal Mach Intell. 2006;28(3):377–91.
Weinberger KQ, Packer B, Saul LK. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. AISTATS. 2005;2:6.
Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. NeurIPS, 2002:585–591.
Donoho DL, Grimes C. Hessian eigenmaps: locally linear embedding techniques for highdimensional data. Proc Natl Acad Sci. 2003;100(10):5591–6.
Torgerson WS. Multidimensional scaling: I. theory and method. Psychometrika. 1952;17(4):401–19.
Cailliez F. The analytical solution of the additive constant problem. Psychometrika. 1983;48(2):305–8.
Dryden IL. Shape analysis. Wiley Stats Ref: Statistics reference online; 2014.
ShaweTaylor J, Williams CK. The stability of kernel principal components analysis and its relation to the process eigenspectrum. 2003:383–390.
Ham JH, Lee DD, Mika S, Schölkopf B. A kernel view of the dimensionality reduction of manifolds. Dep Pap (ESE). 2004;93.
Hoeffding W. Probability inequalities for sums of bounded random variables. J Am Stat Assoc. 1963;58(301):13–30.
Polito M, Perona P. Grouping and dimensionality reduction by locally linear embedding. NeurIPS. 2002:1255–1262.
Fan M, Qiao H, Zhang B, Zhang X. Isometric multimanifold learning for feature extraction. In: 2012 IEEE 12th International Conference on Data Mining, 2012:241–250. IEEE
Vidal R, Ma Y, Sastry S. Generalized principal component analysis (gpca). IEEE Trans Pattern Anal Mach Intell. 2005;27(12):1945–59.
Little AV, Lee J, Jung YM, Maggioni M. Estimation of intrinsic dimensionality of samples from noisy lowdimensional manifolds in high dimensions with multiscale svd. In: 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, 2009:85–88. IEEE.
Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell. 2002;5:603–19.
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surveys (CSUR). 1999;31(3):264–323.
Bernstein M, De Silva V, Langford JC, Tenenbaum JB. Graph approximations to geodesics on embedded manifolds. Citeseer: Technical report. 2000.
Sibson R. Studies in the robustness of multidimensional scaling: perturbational analysis of classical scaling. J Royal Stat Soc Series B Methodol. 1979;41(2):217–29.
Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R. Chemical gas sensor drift compensation using classifier ensembles. Sens Actuators B Chem. 2012;166:320–9.
Velloso E, Bulling A, Gellersen H, Ugulino W, Fuks H. Qualitative activity recognition of weight lifting exercises. In: Proceedings of the 4th Augmented Human International Conference. 2013:116–123. ACM
Gomes HM, Read J, Bifet A, Barddal JP, Gama JA. Machine learning for streaming data: State of the art, challenges, and opportunities. SIGKDD Explor Newsl. 2019:6–22.
Thudumu S, Branch P, Jin J, Singh JJ. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data. 2020;7(1):42.
Fujiwara T, Chou J, Shilpika S, Xu P, Ren L, Ma K. An incremental dimensionality reduction method for visualizing streaming multidimensional data. IEEE Trans Visualization Comput Graphics. 2020;26(01):418–28.
Gupta V, Mittal M. QRS complex detection using STFT, chaos analysis, and PCA in standard and realtime ECG databases. J Inst Eng India Series B. 2019;100(5):489–97. https://doi.org/10.1007/s40031019003989.
Dorier M, Wang Z, Ayachit U, Snyder S, Ross R, Parashar M. Colza: Enabling elastic in situ visualization for highperformance computing simulations. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2022;538–548.
Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C++. Art Scientific Computi. 1992;2:1002.
Acknowledgements
Access to computing facilities were provided by University of Buffalo Center for Computational Research.
Funding
This material is based in part upon work supported by the National Science Foundation under award numbers CNS  1409551 and IIS  1641475.
Author information
Authors and Affiliations
Contributions
SM performed the literature review, implemented the proposed model, and carried out the experiments. SM and VC cowrote the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The author confirms the sole responsibility for this manuscript. The author read and approved the final manuscript.
Consent for publication
The authors consent for publication.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Lemma 1
The matrix exponential for \({\textbf {M}}\) for rank\(\big ({\textbf {M}} \big ) = d\) and symmetric \({\textbf {M}}\) is given by
where \(\{ {{{\varvec{\lambda }}}}_i \}_{i = 1,2 \ldots d}\) are the d largest eigenvalues of \({\textbf {M}}\) and \(\{ {{\textbf {q}} }_i \}_{i = 1,2 \ldots d}\) are the corresponding eigenvectors such that \({{\textbf {q}} _i^\top }{{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}\).
Proof Let \({\textbf {M}}\) be an \(n \times n\) real matrix. The exponential \(e^{\textbf {M}}\) is given by
where \({{\textbf {I}} }\) is the identity. Real, symmetric \({\textbf {M}}\) has real eigenvalues and mutually orthogonal eigenvectors i.e. \({\textbf {M}} = \sum \nolimits _{i=1}^{n} {{\varvec{\lambda }}}_{i}{{{\textbf {q}} }_i}{{\textbf {q}} _i^\top } \text{ where } \{ {{\varvec{\lambda }}}_{i} \}_{i = 1 \ldots n} \text{ are } \text{ real } \text{ and } {{{\textbf {q}} }_i^\top {{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}}\). Given \({\textbf {M}}\) has rank d, we have \({\textbf {M}} = \sum \limits _{i=1}^{d} {{\varvec{\lambda }}}_{i}{{\textbf {q}} _i}{{{\textbf {q}} }_i^\top }\).
\(\square\)
Lemma 2
The inverse of the Gaussian kernel for rank\(\big ({\textbf {M}} \big ) = 1\) and symmetric \({\textbf {M}}\) is given by
where \({{{\textbf {q}} }_1}\) is the first eigenvector of M i.e. \({{{\textbf {q}} }_1^\top }{{{\textbf {q}} }_1} = 1\), \({{{\varvec{\lambda }}}_1}\) is the corresponding eigenvalue and \({{\varvec{\alpha }}} = \frac{1}{\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big )}\) and \({{\textbf {c}} _1} = \big [ \exp {\left( \frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) }  1 \big ]\).
Proof Using (22) for \(d = 1\), we have
Representing \(\frac{1}{\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big )}\) as \({{\varvec{\alpha }}}\) and \(\big [\exp {\left( \frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) }  1 \big ]\) as \({{\textbf {c}} _1}\) and using \(\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big ){{\textbf {I}} }\) as \({{\textbf {A}} }\), \({{\textbf {c}} _1}{{{\textbf {q}} }_1}\) as \({\textbf {u}}\) and \({{\textbf {q}} _1}\) as \({\textbf {v}}\) in the ShermanMorrison identity [52], we have
\(\square\)
Lemma 3
The inverse of the Gaussian kernel for rank\(\big ({\textbf {M}} \big ) = d\) and symmetric \({\textbf {M}}\) is given by
where \(\{ {{\varvec{\lambda }}}_i \}_{i = 1,2 \ldots d}\) are the d largest eigenvalues of \({\textbf {M}}\) and \(\{ {{\textbf {q}} }_i \}_{i = 1,2 \ldots d}\) are the corresponding eigenvectors such that \({{\textbf {q}} _i^\top }{{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}\).
Proof Using the result of previous lemma iteratively, we get the required result
where \({{\varvec{\alpha }}} = \frac{1}{\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big )}\) and \({{\textbf {c}} _i} = \big [ \exp {\left( \frac{{{\varvec{\lambda }}}_i}{2{{\varvec{\ell }}}^2}\right) }  1 \big ]\). \(\square\)
Lemma 4
The solution for Gaussian Process regression system, for the scenario when rank\(\big ({\textbf {M}} \big ) = 1\) and for symmetric \({\textbf {M}}\) is given by
Proof Assuming the intrinsic dimensionality of the lowdimensional manifold to be 1 implies that the inverse of the Gaussian kernel is as defined as in (32). \({\textbf {y}}\) is \(\sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1\) in this case (refer Sect. “Nonlinear dimensionality reduction”). Thus we have
\(\square\)
Lemma 5
The solution for Gaussian Process regression system, for the scenario when rank\(\big ({\textbf {M}} \big ) = d\) and for symmetric \({\textbf {M}}\) is given by
Proof Assuming the intrinsic dimensionality of the lowdimensional manifold to be d implies that the inverse of the Gaussian kernel is as defined as in (33). \({\textbf {y}}\) is \(\{ \sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1 \; \sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2 \; \ldots \; \sqrt{{\varvec{\lambda }}}_d{\textbf {q}} _d \}\) in this case (refer Sect. “Nonlinear dimensionality reduction”), where \({{\textbf {q}} _i^\top }{{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}\). Each of the k dimensions of \({\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{\textbf {I}} \big )}^{1}{\textbf {y}}\) can be processed independently, similar to the previous lemma. For the \({i}^{\text {th}}\) dimension, we have,
Thus we get the result,
\(\square\)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mahapatra, S., Chandola, V. Learning manifolds from nonstationary streams. J Big Data 11, 42 (2024). https://doi.org/10.1186/s40537023008728
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023008728