Skip to main content

Learning manifolds from non-stationary streams

Abstract

Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.

Introduction

High-dimensional data is inherently difficult to explore and analyze, owing to the “curse of dimensionality” that render many statistical and machine learning techniques inadequate. In this context, non-linear dimensionality reduction (NLDR) has proved to be an indispensable tool. Manifold learning based NLDR methods, such as Isomap [1], Local Linear Embedding (LLE) [2], etc., assume that the distribution of the data in the high-dimensional observed space is not uniform. Instead, the data is assumed to lie near a non-linear low-dimensional manifold embedded in the high-dimensional space. By exploiting the geometric properties of the manifold, e.g., smoothness, such methods infer the low-dimensional representation of the data from the high-dimensional observations.

A key shortcoming of NLDR methods is their \(O(n^3)\) complexity, where n is the size of the data [1]. If directly applied on streaming data, where data arrives one point at a time, NLDR methods have to recompute the entire manifold at every time step, making such a naive adaptation prohibitively expensive. To alleviate the computational problem, landmark-based methods [3] or general out-of-sample extension methods [4] have been proposed. However, these techniques are still computationally expensive for practical applications. Recently, a streaming adaptation of the Isomap algorithm [1], which is a widely used NLDR method, was proposed [5]. This method, called S-Isomap, relies on exact learning from a small initial batch of observations, followed by approximate mapping of subsequent stream of observations. An extension to the case when the observations are sampled from multiple, and possibly intersecting, manifolds, called S-Isomap++, was subsequently proposed [6].

Empirical results on benchmark data sets show that these methods can reliably learn the manifold with a small initial batch of observations. However two issues still remain. First, no theoretical bounds on the quality of the manifold, as a function of the initial batch size, exist. Second, these methods assume that the underlying generative distribution is stationary over the stream, and are unable to detect when the distribution “drifts” or abruptly “shifts” away from the base, resulting in incorrect low-dimensional mappings (see Fig. 1).

Fig. 1
figure 1

Impact of changes in the data distribution on streaming NLDR. In the top panel, the true data lies on a 2D manifold (top-left) and the observed data is in \({\mathbb {R}}^3\) obtained by using the swiss-roll transformation of the 2D data (top-middle). The streaming algorithm (S-Isomap [5]) uses a batch of samples from a 2D Gaussian (black), and maps streaming points sampled from a uniform distribution (gray). The streaming algorithm performs well on mapping the batch points to \({\mathbb {R}}^2\) but fails on the streaming points that “drift” away from the batch (top-right). In the bottom panel, the streaming algorithm (S-Isomap++ [6]) uses a batch of samples from three 2D Gaussians (black). The stream points are sampled from the three Gaussians and a new Gaussian (gray). The streaming algorithm performs well on mapping the batch points to \({\mathbb {R}}^2\) but fails on the streaming points that are “shifted” from the batch (bottom-right). Both streaming algorithms are discussed in Sect. “Problem statement and preliminaries

The focus of this paper is two-fold. We first provide theoretical results that show that the qualityFootnote 1 of a manifold, as learnt by Isomap, asymptotically converges as the data size, n, increases. This is a necessary result to show the correctness of streaming methods such as S-Isomap and S-Isomap++, under the assumption of stationarity. Next, we propose a methodology to detect changes in the underlying distribution of the stream properties (drifts and shifts), and inform the streaming methods to update the base manifold.

We employ a Gaussian Process (GP) [7] based adaptation of Isomap to process high-throughput streams. The use of GP is enabled by a kernel that measures the relationship between a pair of observations along the manifold, and not in the original high-dimensional space. We prove that the low-dimensional representations inferred using the GP based method – GP-Isomap – are equivalent to the representations obtained using the state-of-art streaming Isomap methods [5, 6]. Additionally, we empirically show, on synthetic and real data sets, that the predictive variance associated with the GP predictions is an effective indicator of the changes (either gradual drifts or sudden shifts) in the underlying generative distribution, and can be employed to inform the algorithm to “re-learn” the core manifold.

Related works

Processing data streams efficiently using standard approaches is challenging in general, given streams require real-time processing and cannot be stored permanently. Any form of analysis, including detecting concept drift, requires adequate summarization which can deal with the inherent constraints and that can approximate the characteristics of the stream well. Sampling based strategies include random sampling [8, 9] as well as decision-tree based approaches [10] which have been used in this context. To identify concept drift, maintaining statistical summaries on a streaming “window” is a typical strategy [11,12,13]. However, none of these are applicable in the setting of learning a latent representation from the data, e.g., manifolds, in the presence of changes in the stream distribution.

We discuss limitations of existing incremental and streaming solutions that have been specifically developed in the context of manifold learning, specifically in the context of the Isomap algorithm in . Coupling Isomap with GP Regression (GPR) has been explored in the past [14,15,16,17], though not in the context of streaming data. This includes a Mercer kernel-based Isomap technique [14] and an emulator pipeline using Isomap to determine a low-dimensional representation, whose output is fed to a GPR model [15].

The intuition to use GPR for detecting concept drift is novel even though the Bayesian non-parametric approach [18], primarily intended for anomaly detection, comes close to our work in a single manifold setting. However, their choice of the Euclidean distance (in original \({\mathbb {R}}^D\) space) based kernel for its covariance matrix, can result in high Procrustes error, as shown in Fig. 4. Additionally, their approach does not scale, given it does not use any approximation to be able to process the new streaming points “cheaply”.

We also note that a family of GP based non-spectralFootnote 2 non-linear dimensionality reduction methods exist, called Gaussian Process Latent Variable Model (GPLVM) [20] and its variants [19, 21]. GPLVM assumes that the high-dimensional observations are generated from the corresponding low-dimensional representations, using a GP prior. The latent low-dimensional representations are then inferred by maximizing the marginalized log-likelihood of the observed data, which is an optimization problem with n unknown d-dimensional vectors, where d is the length of the low-dimensional representation. In contrast, the GP-Isomap algorithm assumes that the low-dimensional representations are generated from the corresponding high-dimensional data, using a manifold-specific kernel matrix.

There has been a considerable body of literature dealing with dimensionality reduction [22, 23], including recent work that uses deep learning based models [24], however, these cannot be applied in a streaming setting. While there have been some recent works that use PCA in a streaming setting [25], these are inherently linear and hence are not applicable where the manifolds are non-linear.

Problem statement and preliminaries

We first formulate the NLDR problem and provide background on Isomap and discuss its out-of-sample and streaming extensions [5, 6, 26, 27]. Additionally, we provide brief introduction to Gaussian Process (GP) analysis.

Non-linear dimensionality reduction

Given high-dimensional data \({\textbf {Y}} = \{ {\textbf {y}} _i \}_{i = 1 \ldots n}\), where \({\textbf {y}} _i \in {\mathbb {R}}^D\), the NLDR problem is concerned with finding its corresponding low-dimensional representation \({\textbf {X}} = \{ {\textbf {x}} _i \}_{i = 1 \ldots n}\), such that \({\textbf {x}} _i \in {\mathbb {R}}^d\), where \({d} \ll {D}\).

NLDR methods assume that the data lies along a low-dimensional manifold embedded in a high-dimensional space, and exploit the global (Isomap [1], Minimum Volume Embedding [28]) or local (LLE [2], Laplacian Eigenmaps [29], Hessian Eigenmaps [30]) properties of the manifold to map each \({\textbf {y}} _i\) to its corresponding \({\textbf {x}} _i\).

The Isomap algorithm [1] maps each \({\textbf {y}} _i\) to its low-dimensional representation \({\textbf {x}} _i\) in such a way that the geodesic distance along the manifold between any two points, \({\textbf {y}} _i\) and \({\textbf {y}} _j\), is as close to the Euclidean distance between \({\textbf {x}} _i\) and \({\textbf {x}} _j\) as possible. The geodesic distance is approximated by computing the shortest path between the two points using the k-nearest neighbor graphFootnote 3 and is stored in the geodesic distance matrix \({{\textbf {G}} } = \{ {{\textbf {g}} }_{i,j} \}_{1 \le i, j \le n}\), where \({{\textbf {g}} }_{i,j}\) is the geodesic distance between the points \({\textbf {y}} _i\) and \({\textbf {y}} _j\). \(\widetilde{{{\textbf {G}} }}= \{ {{{\textbf {g}} }^{2}_{i,j}} \}_{1 \le i, j \le n}\) contains squared geodesic distance values. The Isomap algorithm recovers \({\textbf {x}} _i\) by using the classical Multi Dimensional Scaling (MDS) on \(\widetilde{{{\textbf {G}} }}\). Let \({{\textbf {B}} }\) be the inner product matrix between different \({\textbf {x}} _i\). \({{\textbf {B}} }\) can be retrieved as \({{\textbf {B}} } = -{{\textbf {H}} }\widetilde{{{\textbf {G}} }}{{\textbf {H}} }/2\) by assuming \(\sum\nolimits_{{i = 1}}^{n} {{\mathbf{x}}_{i} = \,0}\), where \({{\textbf {H}} } = \{ {{\textbf {h}} }_{i,j} \}_{1 \le i, j \le n}\) and \({{\textbf {h}} }_{i,j} = {{\varvec{\delta }}}_{i,j} - 1/{n}\), where \({{\varvec{\delta }}}_{i,j}\) is the Kronecker delta. Isomap uncovers \({\textbf {X}}\) such that \({\textbf {X}} ^T{\textbf {X}}\) is as close to \({{\textbf {B}} }\) as possible. This is achieved by setting \({\textbf {X}} = \{ \sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1 \; \sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2 \; \ldots \; \sqrt{{\varvec{\lambda }}}_d{{\textbf {q}} }_d \}^T\) where \({{\varvec{\lambda }}}_1, {{\varvec{\lambda }}}_2 \dots {{\varvec{\lambda }}}_d\) are the d largest eigenvalues of \({{\textbf {B}} }\) and \({{\textbf {q}} }_1, {{\textbf {q}} }_2 \dots {{\textbf {q}} }_d\) are the corresponding eigenvectors.

The Isomap algorithm makes use of \(\widetilde{{{\textbf {G}} }}\) to approximate the pairwise Euclidean distances on the generated manifold. Isomap demonstrates good performance when the computed geodesic distances are close to Euclidean. In this scenario, the matrix \({{\textbf {B}} }\) behaves like a positive semi-definite (PSD) kernel. The opposite scenario requires a modification to be made to \(\widetilde{{{\textbf {G}} }}\) to make it PSD. In MDS literature, this is commonly referred to as the Additive Constant Problem (ACP) [14, 31, 32].

To measure error between the true, underlying low-dimensional representation to that uncovered by NLDR methods, Procrustes analysis [33] is typically used. Procrustes analysis involves aligning two matrices, \({{\textbf {A}} }\) and \({{\textbf {B}} }\), by finding the optimal translation, \({{\textbf {t}} }\), rotation, \({{\textbf {R}} }\), and scaling factor, \({\textbf {s}}\), that minimizes the Frobenius norm between the two aligned matrices, i.e.,:

$$\begin{aligned} {{\varvec{\epsilon }}}_{\text {Proc}}({{\textbf {A}} }, {{\textbf {B}} }) = \min _{{{\textbf {R}} },{{\textbf {t}} },{\textbf {s}} } \Vert {\textbf {s}} {{\textbf {R}} }{{\textbf {B}} } + {{\textbf {t}} } - {{\textbf {A}} }\Vert _{\text {F}} \end{aligned}$$
(1)

The above optimization problem has a closed-form solution obtained by performing Singular Value Decomposition (SVD) of \({{\textbf {A}} }{{\textbf {B}} }^T\) [33]. Consequently, one of the properties of Procrustes analysis is that \({{\varvec{\epsilon }}}_{\text {Proc}}({{\textbf {A}} }, {{\textbf {B}} }) = 0\) when \({{\textbf {A}} } = {\textbf {s}} {{\textbf {R}} }{{\textbf {B}} } + {{\textbf {t}} }\) i.e. when one of the matrices is a scaled, translated and/or rotated version of the other, which we leverage upon in this work.

Streaming Isomap

Given that the Isomap algorithm has a complexity of \({\mathcal {O}}(n^3)\) (where n = size of data) since it needs to perform Eigen Decomposition on \({{\textbf {B}} }\) as described in the previous section, recomputing the manifold is computationally impractical to use in a streaming setting. Incremental techniques have been proposed in the past [5, 27], which can efficiently process the new streaming points, without affecting the quality of the embedding significantly.

The S-Isomap algorithm relies on the assumption that a stable manifold can be learnt using only a fraction of the stream (denoted as the batch data set \({{\mathcal {B}}}\)), and the remaining part of stream (denoted as the stream data set \({{\mathcal {S}}}\)) can be mapped to the manifold in a significantly less costly manner. A convergence proof that justifies this assumption is provided in Sect. “Convergence proofs for S-Isomap and S-Isomap++”. Alternatively, this can be justified by considering the convergence of eigenvectors and eigenvalues of \({{\textbf {B}} }\), as the number of points in the batch increase [34]. In particular, the bounds on the convergence error for a similar NLDR method, i.e., kernel PCA, is shown to be inversely proportional to the batch size [34]. Similar arguments can be made for Isomap, by considering the equivalence between Isomap and Kernel PCA [26, 35]. This relationship has also been empirically shown for multiple data sets [5]. The S-Isomap algorithm computes the low-dimensional representation for each new point i.e. \({{\textbf {x}} _{n+1}}\in {\mathbb {R}}^d\) by solving a least-squares problem formulated by matching the dot product of the new point with the low-dimensional embedding of the points in the batch data set \({\textbf {X}}\), computed using Isomap, to the normalized squared geodesic distances vector \({\textbf {f}}\). The least-squares problem has the following form:

$$\begin{aligned} {\textbf {X}} ^T{{\textbf {x}} _{n+1}} = {\textbf {f}} \end{aligned}$$
(2)

whereFootnote 4

$${{\bf f}_i} \simeq \frac{1}{2} \big(\frac{1}{n}\sum\limits_{j}{{\bf g}_{i,j}^2} - {{\bf g}_{i,n+1}^2}\big)$$
(4)

where \({{\textbf {g}} }_{i,j}\) refer to the geodesic distance discussed in Sect. “Problem statement and preliminaries”.

Handling multiple manifolds

In the ideal case, when manifolds are densely sampled and sufficiently separated, clustering can be performed before applying NLDR techniques [37, 38], by choosing an appropriate local neighborhood size so as not to include points from other manifolds and still be able to capture the local geometry of the manifold. However, if the manifolds are close or intersecting, such methods typically fail. While methods such as Generalized Principal Component Analysis (GPCA) [39] have been proposed to generalize linear methods such as PCA for a case where the data lies on multiple sub-spaces, such ideas have not been explored for non-linear methods.

The S-Isomap++ [6] algorithm overcomes limitations of the S-Isomap algorithm and extends it to be able to deal with multiple manifolds. It uses the notion of Multi-scale SVD [40] to define tangent manifold planes at each data point, computed at the appropriate scale, and computes similarity in a local neighborhood. Additionally, it includes a novel manifold tangent clustering algorithm to be able to deal with the above issue of clustering manifolds which are close and in certain scenarios, intersecting, using these tangent manifold planes. After initially clustering the high-dimensional batch data set, the algorithm applies NLDR on each manifold individually and eventually “stitches” them together in a global ambient space by defining transformations which can map points from the individual low-dimensional manifolds to the global space. S-Isomap++ does not assume that the number of manifolds (p) is specified and automatically infers p using its clustering mechanism.Footnote 5 Given that the data points lie on low-dimensional and potentially intersecting manifolds, it is evident that the standard clustering methods, such as K-Means [42], that operate on the observed data in \({\mathbb {R}}^{\textbf {D}}\), will fail in correctly identifying the clusters.

However, S-Isomap++ can only detect manifolds which it encounters in its batch learning phase and not those which it might encounter in the streaming phase. Thus, S-Isomap++ ceases to “learn” and evolve to be able to limit the embedding error for points in the data stream, even though it has a “stitching” mechanism to embed individual low-dimensional manifolds, which might themselves be of different dimensions.

Gaussian process regression

Let us assume that we are learning a probabilistic regression model to obtain the prediction at a given test input, \({\textbf {y}}\), using a non-linear and latent function, \(f(\varvec{\cdot })\). AssumingFootnote 6\(d=1\), the observed output, x, is related to the input as:

$${\bf x} = f({\bf y}) + {\boldsymbol \varepsilon},\text{ where, } {\boldsymbol \varepsilon}\sim \mathcal{N}(0,{\boldsymbol \sigma}_n^2)$$
(5)

Given a training set of inputs, \({\textbf {Y}} = \{ {\textbf {y}} _i \}_{i = 1 \ldots n}\) and corresponding outputs, \({\textbf {X}} = \{{\bf x}_i\}_{i = 1 \ldots n}\),Footnote 7 the Gaussian Process Regression (GPR) model assumes a GP prior on the latent function values, i.e., \(f({\textbf {y}} ) \sim GP(m({\textbf {y}} ),k({\textbf {y}} ,{\textbf {y}} '))\), where \(m({\textbf {y}} )\) is the mean of \(f({\textbf {y}} )\) and \(k({\textbf {y}} ,{\textbf {y}} ')\) is the covariance between any two evaluations of \(f(\varvec{\cdot })\), i.e, \(m({\textbf {y}} ) = {\mathbb {E}}[f({\textbf {y}} )]\) and \(k({\textbf {y}} ,{\textbf {y}} ') = {\mathbb {E}}[(f({\textbf {y}} ) - m({\textbf {y}} ))(f({\textbf {y}} ') - m({\textbf {y}} '))]\). Here we use a zero-mean function (\(m({\textbf {y}} ) = 0\)), though other functions could be used as well. The GP prior states that any finite collection of the latent function evaluations are jointly Gaussian, i.e.,

$$\begin{aligned} f({\textbf {y}} _1,{\textbf {y}} _2,\ldots , {\textbf {y}} _n) \sim {\mathcal {N}}({\textbf {0}} , K) \end{aligned}$$
(6)

where the \(ij^{th}\) entry of the \(n \times n\) covariance matrix, K, is given by \(k({\textbf {y}} _i, {\textbf {y}} _j)\). The GPR model uses (5) and (6) to obtain the predictive distribution at a new test input, \({\textbf {y}} _{n+1}\), as a Gaussian distribution with following mean and variance:

$${\mathbb {E}}[{\bf x}_{n+1}] = {\textbf {k}} _{n+1}^\top (K + {{\varvec{\sigma }}}_n^2I)^{-1}{\textbf {X}}$$
(7)
$$\begin{aligned} var[{\bf x}_{n+1}] = k({\bf y}_{n+1},{\bf y}_{n+1}) - {\bf k}_{n+1}^\top(K + {{\varvec{\sigma }}}_n^2I)^{-1}{\bf k}_{n+1} + {{\varvec{\sigma }}}_n^2 \end{aligned}$$
(8)

where \({\textbf {k}} _{n+1}\) is a \(n \times 1\) vector with \(i^{th}\) value as \(k({\textbf {y}} _{n+1},{\textbf {y}} _i)\).

The kernel function, \(k(\varvec{\cdot })\), specifies the covariance between function values, \(f({\textbf {y}} _i)\) and \(f({\textbf {y}} _j)\), as a function of the corresponding inputs, \({\textbf {y}} _i\) and \({\textbf {y}} _j\). A popular choice is the squared exponential kernel, which has been used in this work:

$$\begin{aligned} k({\bf y}_i, {\bf y}_j) = {{\varvec{\sigma }}}^2_s\exp {\left[ -\frac{{\Vert {\bf y}_i-{\bf y}_j\Vert }^2}{2{{\varvec{\ell }}}^2}\right] } \end{aligned}$$
(9)

where \({{\varvec{\sigma }}}_s^2\) is the signal variance and \({{\varvec{\ell }}}\) is the length scale. The quantities \({{\varvec{\sigma }}}_s^2\), \({{\varvec{\ell }}}\), and \({{\varvec{\sigma }}}_n^2\) from (5) are the hyper-parameters of the model and can be estimated by maximizing the marginal log-likelihood of the observed data (\({\textbf {Y}}\) and \({\textbf {X}}\)) under the GP prior assumption.

One can observe that predictive mean, \({\mathbb {E}}[{\textbf {x}} _{n+1}]\) in (7) can be written as an inner product, i.e.

$$\begin{aligned} {\mathbb {E}}[{\bf x}_{n+1}] = {{\varvec{\beta }}}^\top {\textbf {k}} _{n+1} \end{aligned}$$
(10)

where \({{\varvec{\beta }}} = (K + {{\varvec{\sigma }}}_n^2I)^{-1}{\textbf {X}}\). We will utilize this form in subsequent proofs.

Convergence proofs for S-isomap and S-isomap++

In this section, we demonstrate the convergence of the S-Isomap algorithm for a single manifold setting, subsequent to which we extend it to the multi-manifold setting i.e. for the S-Isomap++ algorithm described above.

Theorem 1

Given a uniformly sampled, uni-modal distribution from which the random batch data set \({\mathcal {B}} = \{ {\textbf {y}} _i \in {\mathbb {R}}^{\textbf {D}} \}_{i = 1 \ldots n}\) of the S-Isomap algorithm is derived from, there exists a threshold \({\textbf {n}} _{0}\), such that when \({\textbf {n}} \ge {\textbf {n}} _{0}\), the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}}\big ({{\varvec{\tau }}}_{{\mathcal {B}}}\), \({{\varvec{\tau }}}_{\text {ISO}}\big )\) between \({{\varvec{\tau }}}_{{\mathcal {B}}} = {{\varvec{\phi }}}^{-1}\big ({\mathcal {B}}\big )\), the true underlying representation and \({{\varvec{\tau }}}_{\text {ISO}}= \hat{{\varvec{\phi }}}^{-1}\big ({\mathcal {B}}\big )\), the embedding uncovered by Isomap is small (\({{\varvec{\epsilon }}}_{\text {Proc}}\approx 0\)) i.e. the batch phase of the S-Isomap algorithm converges, where \({{\varvec{\phi }}}(\varvec{\cdot })\) is the non-linear function which maps data points from the underlying low-dimensional ground truth representation \({\textbf {U}}\) to \({\mathcal {B}}\in {\mathbb {R}}^{\textbf {D}}\) and the ground truth \({\textbf {U}}\) originally resides in a convex \({\mathbb {R}}^{\textbf {d}}\) Euclidean space.


Proof Based on the setting described above, the S-Isomap algorithm acts like a generative model which is trying to learn the inverse mapping \({{\varvec{\phi }}}(\varvec{\cdot })^{-1}\), where the associated embedding error is the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}}\big ({{\varvec{\tau }}}_{\mathcal {B}}\), \({{\varvec{\tau }}}_{\text {ISO}}\big )\).

The proof follows from [43] who showed that in a setting, where given \({{\varvec{\lambda }}}_1\), \({{\varvec{\lambda }}}_2\), \({{\varvec{\mu }}} > 0\) and for appropriately chosen \({{\varvec{\epsilon }}} > 0\), as well as a data set \({\textbf {Y}} = \{ {\textbf {y}} _i \}_{i = 1 \ldots n}\) sampled from a Poisson distribution with density function \({{\varvec{\alpha }}}\) which satisfies the \({{\varvec{\delta }}}\)-sampling condition i.e.

$$\begin{aligned} {{\varvec{\alpha }}} > \log ({\textbf {V}} /({{\varvec{\mu }}} \widetilde{{\textbf {V}} }({{\varvec{\delta }}}/4)))/\widetilde{{\textbf {V}} }({{\varvec{\delta }}}/2) \end{aligned}$$
(11)

wherein the \({{\varvec{\epsilon }}}\)-rule is used to construct a graph \({{\textbf {G}} }\) on \({\textbf {Y}}\), the ratio between the graph based distance \({\textbf {d}} _{G}({{\textbf {x}} , {\textbf {y}} })\) and the true Euclidean distance \({\textbf {d}} _{M}({{\textbf {x}} , {\textbf {y}} }) \; \forall {\textbf {x}}\), \({\textbf {y}} \in {\textbf {Y}}\) is bounded. More concretely, the following holds with probability at least \((1 - {{\varvec{\mu }}})\) for \(\forall {\textbf {x}}\), \({\textbf {y}} \in {\textbf {Y}}\):

$$\begin{aligned} 1 - {{\varvec{\lambda }}}_1 \le \frac{{\textbf {d}} _{G}({{\textbf {x}} , {\textbf {y}} })}{{\textbf {d}} _{M}({{\textbf {x}} , {\textbf {y}} })} \le 1 + {{\varvec{\lambda }}}_2 \end{aligned}$$
(12)

where \({\textbf {V}}\) is the volume of the manifold \({\mathcal {M}}\) and

$$\begin{aligned} \widetilde{{\textbf {V}} }({\textbf {r}} ) = \min \limits _{{\textbf {x}} \in {\mathcal {M}}}\text{ Vol }({\mathcal {B}}_{\textbf {x}} ({\textbf {r}} )) = {{\varvec{\eta }}}_{\textbf {d}} {\textbf {r}} ^{\textbf {d}} \end{aligned}$$
(13)

is the volume of the smallest metric ball in \({\mathcal {M}}\) of radius \({\textbf {r}}\) and \({{\varvec{\delta }}} >0\) is such that

$$\begin{aligned} {{\varvec{\delta }}} = {{\varvec{\lambda }}}_2{{\varvec{\epsilon }}}/4 \end{aligned}$$
(14)

A similar result can be derived in the scenario where \({\textbf {n}}\) points are sampled independently from the fixed probability distribution \(p({\textbf {y}}\); \({\varvec{\theta }})\), in which case we have :

$$\begin{aligned} {\textbf {n}} \widetilde{{\varvec{\alpha }}} = {{\varvec{\alpha }}} \end{aligned}$$
(15)

where \(\widetilde{{\varvec{\alpha }}}\) is the probability of selecting a sample from \(p({\textbf {y}}\); \({\varvec{\theta }})\).

Using (13), (14) and (15) in (11), we have:

$$\begin{aligned}{} & {} \begin{aligned} {\textbf {n}} \widetilde{{\varvec{\alpha }}}&> \log ({\textbf {V}} /({{\varvec{\mu }}} \widetilde{{\textbf {V}} }({{\varvec{\delta }}}/4)))/\widetilde{{\textbf {V}} }({{\varvec{\delta }}}/2) \\&= \big [ \log ({\textbf {V}} /{{\varvec{\mu }}}{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/16)}^{\textbf {d}} ) \big ]/{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/8)}^{\textbf {d}} \end{aligned} \end{aligned}$$
(16)
$$\begin{aligned}{} & {} \begin{aligned} {\textbf {n}}&> (1/\widetilde{{\varvec{\alpha }}})\big [ \log ({\textbf {V}} /{{\varvec{\mu }}}{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/16)}^{\textbf {d}} ) \big ]/{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/8)}^{\textbf {d}} \\&= {\textbf {n}} _{0} \end{aligned} \end{aligned}$$
(17)

where \({\textbf {n}} _{0} = (1/\widetilde{{\varvec{\alpha }}})\big [ \log ({\textbf {V}} /{{\varvec{\mu }}}{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/16)}^{\textbf {d}} ) \big ]/{{\varvec{\eta }}}_{d}{({{\varvec{\lambda }}}_{2}{{\varvec{\epsilon }}}/8)}^{\textbf {d}}\), is the condition which ensures that (12) is satisfied.

Thus we have an adequate threshold for the size of the batch data set \({\mathcal {B}}\) which ensures (17) is satisfied for the \({{\varvec{\epsilon }}}\)-rule. We can derive a similar threshold for the \({{\textbf {K}} }\)-rule, observing that there is a direct one-to-one mapping between \({{\textbf {K}} }\) and \({{\varvec{\epsilon }}}\) (See Sect. “Non-linear dimensionality reduction” for more details).

To complete the proof, we observe that (12) implies that \({\textbf {d}} _{\text {G}}({{\textbf {x}} ,{\textbf {y}} })\), the graph based distance between points \({\textbf {x}}\), \({\textbf {y}} \in {{\textbf {G}} }\) is a perturbed version of \({\textbf {d}} _{\text {M}}({{\textbf {x}} , {\textbf {y}} })\), the true Euclidean distance between points \({\textbf {x}}\) and \({\textbf {y}}\) in \({\mathbb {R}}^{\textbf {d}}\). Let \(\widetilde{{\textbf {D}} }_{\text {M}}\) and \(\widetilde{{\textbf {D}} }_{\text {G}}\) represent the squared distance matrix corresponding to \({\textbf {d}} _{\text {M}}({{\textbf {x}} ,{\textbf {y}} })\) and \({\textbf {d}} _{\text {G}}({{\textbf {x}} ,{\textbf {y}} })\) respectively. Thus we have \(\widetilde{{\textbf {D}} }_{\text {G}} = \widetilde{{\textbf {D}} }_{\text {M}}\) \(+\) \(\Delta \widetilde{{\textbf {D}} }_{\text {M}}\) where \(\Delta \widetilde{{\textbf {D}} }_{\text {M}} = \{ \Delta \widetilde{{\textbf {d}} }_{\text {M}}({\textbf {i}} , {\textbf {j}} )\}_{1 \le i, j \le n}\) and \(\Delta \widetilde{{\textbf {d}} }_{\text {M}}({\textbf {i}} , {\textbf {j}} )\) are bounded due to (12).

In the past [44], the robustness of MDS to small perturbations was demonstrated as follows. Let \({\textbf {F}}\) represent the zero-diagonal symmetric matrix which perturbs the true squared distance matrix \({{\textbf {B}} }\) to \({{\textbf {B}} } + {\varvec{\Delta }}{{\textbf {B}} } = {{\textbf {B}} } + {{\varvec{\epsilon }}}{\textbf {F}}\). Then the Procrustes Error between the embeddings uncovered by MDS for \({{\textbf {B}} }\) and for \({{\textbf {B}} } + {\varvec{\Delta }}{{\textbf {B}} }\) is given by \(\frac{\varvec{\epsilon}^2}{4}\sum\nolimits_{{j,k}} {\frac{{{\bf e}_{j}^{T} {\bf F}{\bf e}_{k} ^{2} }}{{\varvec{\lambda} _{j} + \,\varvec{\lambda} _{k} }}}\), which is very small for small entries \(\{ {\textbf {f}} _{i,j} \}_{1 \le i, j \le n} \in {\textbf {F}}\), \(\{{\textbf {e}} _k ({{\varvec{\lambda }}}_k)\}_{k = 1 \ldots n}\) represent the eigenvectors (eigenvalues) of \({{\textbf {B}} }\) and the double summation is over pairs of \(({\textbf {j}}, {\textbf {k}}) = 1,2,\ldots ({\textbf {n}}-1)\) but excluding those pairs \(({\textbf {j}}, {\textbf {k}})\) wherein both entries of which lie in the range \(({{\textbf {K}} }+1), ({{\textbf {K}} }+2), \ldots ({\textbf {n}}-1)\), \({{\textbf K}} \, = \,\sum\nolimits_{{k = 1}}^{n} {{\mathcal {I}} \left( {{\varvec{\lambda }}_{k} > \,0} \right)}\) and \({\mathcal {I}}(\varvec{\cdot })\) is the indicator function. We substitute \({{\varvec{\epsilon }}} = 1\) and replace \({{\textbf {B}} }\) with \(\widetilde{{\textbf {D}} }_{\text {M}}\) and \({\varvec{\Delta }}{{\textbf {B}} }\) with \({\varvec{\Delta }}\widetilde{{\textbf {D}} }_{\text {M}}\) above to complete the proof, since the entries of \({\varvec{\Delta }}\widetilde{{\textbf {D}} }_{\text {M}}\) are very small i.e. \(\{ 0 \le {\varvec{\Delta }}{\textbf {d}} _{\text {M}}(i, j) \le {{\varvec{\lambda }}}^2 \}_{1 \le i, j \le n}\) where \({{\varvec{\lambda }}} = \max ({{\varvec{\lambda }}}_1, {{\varvec{\lambda }}}_2)\) for small \({{\varvec{\lambda }}}_1\), \({{\varvec{\lambda }}}_2\), given the condition \({\textbf {n}} > {\textbf {n}} _{0}\) is satisfied for (12). Thus we have that the embedding uncovered by S-Isomap for a batch data set \({\mathcal {B}}\) where \(\left| {\mathcal {B}}\right| = {\textbf {n}} > {\textbf {n}} _{0}\) converges asymptotically to their true embedding upto translation, rotation and scaling factors. \(\square\)

Extension to the multi-manifold setting

The above proof can be extended to show the convergence of the S-Isomap++ [6] algorithm, described in Sect. “Handling Multiple Manifolds”  as follows.

Corollary 1

The batch phase of the S-Isomap++ algorithm converges under appropriate conditions.

Proof Similar to the proof for the S-Isomap algorithm, we consider a corresponding setting for the multi-manifold scenario now, wherein we are attempting to learn the inverse mappings \({{\varvec{\phi }}}(\varvec{\cdot })_{i = 1, \ldots , {\textbf {p}}}^{-1}\) for each of the \({\textbf {p}}\) manifolds. The initial clustering step of the S-Isomap++ algorithm separates the samples from the batch data set \({\mathcal {B}}\) into different individual clusters \({\mathcal {B}}_{i}\), such that each cluster is mutually exclusive of the others and corresponds to one of the multiple manifolds present in the data i.e. \(\bigcup \limits _{i=1}^{\textbf {p}} {\mathcal {B}}_{i} = {\mathcal {B}}\) and \({\mathcal {B}}_{i}\bigcap \limits _{\begin{array}{c} \forall i,j, i \ne j \end{array}}{\mathcal {B}}_{j} = \phi\).

The intuition for clustering and subsequently processing each of the clusters separately is based on the setting described above that the observed data was generated by first sampling points from multiple \({\textbf {U}} _{i = 1, \ldots , {\textbf {p}}}\) i.e., convex domains in \({\mathbb {R}}^{\textbf {d}}\)Footnote 8 and subsequently mapping those points in non-linear fashion, using possibly different \({{\varvec{\phi }}}(\varvec{\cdot })_{i = 1, \ldots , {\textbf {p}}}\) to \({\mathcal {B}}\in {\mathbb {R}}^{\textbf {D}}\). Thus, to learn the different inverse mappings effectively, there is a need to be able to cluster the data appropriately.

After the initial clustering step, a similar analysis as in Theorem 1 provides thresholds \({\textbf {n}}_i, \forall i \in \{1, \ldots , {\textbf {p}}\}\) for each of the \({\textbf {p}}\) clusters beyond which when \(\left| {\mathcal {B}}_{i}\right| = {\textbf {n}} \ge {\textbf {n}} _{i}\), the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}}\big ({{\varvec{\tau }}}_{{\mathcal {B}}_{i}}\), \({{\varvec{\tau }}}_{\text {ISO}_{i}}\big )\) between \({{\varvec{\tau }}}_{{\mathcal {B}}_{i}} = {{\varvec{\phi }}}_{i}^{-1}\big ({\mathcal {B}}_{i}\big )\), the true underlying representation and \({{\varvec{\tau }}}_{{\text {ISO}}_{i}}= \hat{{\varvec{\phi }}}_{i}^{-1}\big ({\mathcal {B}}_{i}\big )\), the embedding uncovered by Isomap is small (\({{\varvec{\epsilon }}}_{\text {Proc}}\approx 0\)) i.e. the batch phase of the S-Isomap++ algorithm converges provided each of the \({\textbf {p}}\) clusters \({\mathcal {B}}_{i = 1, \ldots , {\textbf {p}}}\) exceeds the appropriate threshold \({\textbf {n}}_i\) (similar to (17) above). \(\square\)

The S-Isomap++ algorithm does not assume that the number of manifolds (\({\textbf {p}}\)) is specified. Refer to Sect. “Handling multiple manifolds” for more details.

Methodology

The proposed GP-Isomap algorithm follows a two-phase strategy (similar to the S-Isomap and S-Isomap++), where exact manifolds are learnt from an initial batch \({\mathcal {B}}\), and subsequently a computationally inexpensive mapping procedure processes the remainder of the stream. To handle multiple manifolds, the batch data \({\mathcal {B}}\) is first clustered via manifold tangent clustering or other standard techniques. Exact Isomap is applied on each cluster. The resulting low-dimensional data for the clusters is then “stitched” together to obtain the low-dimensional representation of the input data. The difference from the past methods is the mapping procedure which uses GPR to obtain the predictions for the low-dimensional mapping (see (7)). At the same time, the associated predictive variance (see (8)) is used to detect changes in the underlying distribution.

The overall GP-Isomap algorithm is outlined in 1 and takes a batch data set, \({\mathcal {B}}\) and the streaming data, \({\mathcal {S}}\) as inputs, along with other parameters. The processing is split into two phases: a batch learning phase (Lines 1–15) and a streaming phase (Lines 16–32), which are described later in this section.

Algorithm 1
figure a

GP-Isomap

Kernel function

The key innovation here is to use a manifold-specific kernel matrix in the GPR method. The matrix \({{\textbf {B}} }\), which is the inner product matrix between the points in the low-dimensional space (see Sect. “Non-linear dimensionality reduction”), could be a reasonable starting point. However, as past researchers have shown [16], typical kernels, such as squared exponential kernel, can only be generalized to a positive definite kernel on a geodesic metric space if the space is flat. Thus \({{\textbf {B}} }\) will not necessarily yield a valid positive semi-definite kernel matrix. However, a result by [32] shows that a small positive constant, \({{\varvec{\lambda }}}_{\text {max}}\), can be added to \({{\textbf {B}} }\) to guarantee that it will be PSD. This constant can be calculated as the largest eigenvalue of the matrix:

$$\begin{aligned} {\textbf {M}} = \begin{bmatrix} 0 &{} 2{{\textbf {B}} } \\ -{{\textbf {I}} } &{} -4{\textbf {P}} \end{bmatrix} \end{aligned}$$
(18)

where \({\textbf {P}} = -{{\textbf {H}} }{{\textbf {G}} }{{\textbf {H}} }/2\). Here, \({{\textbf {G}} }\) is the geodesic distance matrix and \({{\textbf {H}} } = \{ {{\textbf {h}} }_{i,j} \}_{1 \le i, j \le n}\), \({{\textbf {h}} }_{i,j} = {{\varvec{\delta }}}_{i,j} - 1/{n}\), where \({{\varvec{\delta }}}_{i,j}\) is the Kronecker delta. \(\widetilde{{{\textbf {B}} }}\) can be derived from \({{\textbf {B}} }\) as [32]:

$$\begin{aligned} \widetilde{{{\textbf {B}} }} = {{\textbf {B}} } + 2{{\varvec{\lambda }}}_{\text {max}} {\textbf {P}} + \frac{1}{2}{{\varvec{\lambda }}}_{\text {max}} ^2{{\textbf {H}} } \end{aligned}$$
(19)

where \({{\varvec{\lambda }}}_{\text {max}}\) is the largest eigenvalue of \({\textbf {M}}\).

The proposed GP-Isomap algorithm uses a novel geodesic distance based kernel function defined as:

$$\begin{aligned} k({\textbf {y}} _i,{\textbf {y}} _j) = {{\varvec{\sigma }}}^2_s\exp \left( -\frac{\widetilde{{\textbf {b}} }_{i,j}}{2{{\varvec{\ell }}}^2}\right) \end{aligned}$$
(20)

where \(\widetilde{{\textbf {b}} }_{i,j}\) is the \({ij}^{th}\) entry of the matrix \(\widetilde{{{\textbf {B}} }}\), \({{\varvec{\sigma }}}^2_s\) is the signal variance (whose value we fix as 1 in this work) and \({{\varvec{\ell }}}\) is the length scale hyper-parameter. Thus the kernel matrix \({{\textbf {K}} }\) can be written as:

$$\begin{aligned} {{\textbf {K}} } = \exp {\left( -\frac{\widetilde{{{\textbf {B}} }}}{2{{\varvec{\ell }}}^2}\right) } \end{aligned}$$
(21)

This kernel function plays a key role in using the GPR model for mapping streaming points on the learnt manifold, by measuring similarity along the low-dimensional manifold, instead of the original space (\({\mathbb {R}}^D\)), as is typically done in GPR based solutions.

The matrix \(\widetilde{{{\textbf {B}} }}\), is positive semi-definite. Consequently, we note that the kernel matrix, \({{\textbf {K}} }\), is positive definite (refer (22) below).

Using 1, the novel kernel we propose can be written as

$$\begin{aligned} {{\textbf {K}} }\big ( {\textbf {x}} , {\textbf {y}} \big ) = {{\textbf {I}} } + \sum \limits _{i=1}^{d} \big [ \exp {\left( -\frac{{{\varvec{\lambda }}}_i}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ]{{{\textbf {q}} }_i}{{{\textbf {q}} }_i^T} = {{\textbf {I}} } + {\textbf {Q}} \widetilde{\varvec{\Lambda }}{\textbf {Q}} ^{T} \end{aligned}$$
(22)

where \(\widetilde{\varvec{\Lambda }} = \begin{bmatrix} \big [ \exp {\left( -\frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ] &{} 0 &{} 0 \\ 0 &{} \ddots &{} 0 \\ 0 &{} 0 &{} \big [ \exp {\left( -\frac{{{\varvec{\lambda }}}_d}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ] \end{bmatrix}\) and \(\{{{\varvec{\lambda }}}_i, {{\textbf {q}} }_i \}_{i = 1 \ldots d}\) are eigenvalue/eigenvector pairs of \(\widetilde{{{\textbf {B}} }}\) as discussed in Sect. “Non-linear dimensionality reduction”.

Batch learning

The batch learning phase consists of these tasks:

  1. i).

    Clustering: The first step in the batch phase involves clustering of the batch data set \({\mathcal {B}}\) into \({\textbf {p}}\) individual clusters which represent the manifolds (Line 1). In case, \({\mathcal {B}}\) contains a single cluster, the algorithm can correctly detect it. Refer to Sect. “Handling multiple manifolds” for more details,

  2. ii).

    Dimensionality reduction: Subsequently, full Isomap is executed on each of the \({\textbf {p}}\) individual clusters to get low-dimensional representations \({\mathcal {LDE}}_{i=1,2 \ldots {\textbf {p}}}\) of the data points belonging to each individual cluster (Lines 3–5),

  3. iii).

    Hyper-parameter estimation: The geodesic distance matrix for the points in the \({{\varvec{i}}}^{\text {th}}\) manifold \({{{\mathcal {G}}}_{i}}\) and the corresponding low-dimensional representation \({{\mathcal {LDE}}_{i}}\), are fed to the GP model for each of the \({\textbf {p}}\) manifolds, to perform hyper-parameter estimation, which outputs \(\{{{\varvec{\phi }}}_{i}^{GP} \}_{i = 1,2 \ldots {\textbf {p}}}\) (Lines 6–8), and,

  4. iv).

    Learning mapping to global space: The low-dimensional embedding uncovered for each of the manifolds can be of different dimensionalities. Consequently, a mapping to a unified global space is needed. To learn this mapping, a support set \({\varvec{\xi }}_{s}\) is formulated, which contains the \({\textbf {k}}\) pairs of nearest points and \({\textbf {l}}\) pairs of farthest points, between each pair of manifolds. Subsequently, MDS is executed on this support set \({\varvec{\xi }}_{s}\) to uncover its low-dimensional representation \({\mathcal{G}\mathcal{E}}_{s}\). Individual scaling and translation factors \(\{ {{\mathcal {R}}}_{i}, {t}_{i} \}_{i = 1,2 \ldots {\textbf {p}}}\) are learnt via solving a least squares problem involving \({\varvec{\xi }}_{s}\), which map points from each of the individual manifolds to the global space (Lines 9–15).

Stream processing

In the streaming phase, each sample \({\textbf {s}}\) in the stream set \({\mathcal {S}}\) is embedded using each of the \({\textbf {p}}\) GP models to evaluate the prediction \({{\varvec{\mu }}}_{i}\), along with the variance \({{\varvec{\sigma }}}_{i}\) (Lines 22–24). The manifold with the smallest variance get chosen to embed the sample \({\textbf {s}}\) into, using the corresponding scaling \({{\mathcal {R}}}_{j}\) and translation factor \({t}_{j}\), provided \({ min_i} \left| {{\varvec{\sigma }}}_{i} \right|\) is within the allowed threshold \({{\varvec{\sigma }}}_{t}\) (Lines 25–28), otherwise sample \({{\varvec{s}}}\) is added to the unassigned set \({{\mathcal {S}}}_{u}\) (Lines 29–31). When the size of unassigned set \({{\mathcal {S}}}_{u}\) exceeds certain threshold \({\textbf {n}}_{s}\), we add them to the batch data set and re-learn the base manifold (Line 18–20). The assimilation of the new points in the batch maybe done more efficiently in an incremental manner.

Complexity

The runtime complexity of our proposed algorithm is dominated by the GP regression step as well as the Isomap execution step, both of which have \({\mathcal {O}}(n^3)\) complexity, where n is the size of the batch data set \({\mathcal {B}}\). This is similar to the S-Isomap and S-Isomap++ algorithms, that also have a runtime complexity of \({\mathcal {O}}(n^3)\). The stream processing step is \({\mathcal {O}}(n)\) for each incoming streaming point. The space complexity of GP-Isomap is dominated by \({\mathcal {O}}(n^2)\). This is because each of the samples of the stream set \({\mathcal {S}}\) get processed separately. Thus, the space requirement as well as runtime complexity does not grow with the size of the stream, which makes the algorithm appealing for handling high-volume streams.

Theoretical analysis

We first state the main result for the single manifold case, and prove it using results in the Appendix section, and then present proofs for the multi-manifold case.

Theorem 2

For a single manifold setting, the prediction \({{\varvec{\tau }}}_{\text {GP}}\) of GP-Isomap is equivalent to the prediction \({{\varvec{\tau }}}_{\text {ISO}}\) of S-Isomap, i.e., the Procrustes Error \({{\varvec{\epsilon }}} _{\text {Proc}}\big ({{\varvec{\tau }}}_{\text {GP}}\), \({{\varvec{\tau }}}_{\text {ISO}}\big )\) between \({{\varvec{\tau }}}_{\text {GP}}\) and \({{\varvec{\tau }}}_{\text {ISO}}\) is 0.


Proof The prediction of GP-Isomap is given by (10). Using Lemma 5, we note that

$${{\varvec{\beta }}} = \left\{ {\frac{{{{\varvec{\alpha }}} \sqrt {{\varvec{\lambda }}}_{1} {\mathbf{q}}_{1} }}{{1 + {{\varvec{\alpha}}} {\mathbf{c}}_{1} }}\frac{{{{\varvec{\alpha }}} \sqrt {{\varvec{\lambda }}}_{2} {\mathbf{q}}_{2} }}{{1 + {{\varvec{\alpha }}} {\mathbf{c}}_{2} }} \ldots \frac{{{{\varvec{\alpha }}} \sqrt {{\varvec{\lambda }}}_{d} {\mathbf{q}}_{d} }}{{1 + {{\varvec{\alpha }}} {\mathbf{c}}_{d} }}} \right\}$$
(23)

The term \({{\textbf {K}} _{*}}\) for GP-Isomap, using the novel kernel function evaluates to:

$$\begin{aligned} {{\textbf {K}} _{*}} = \exp {\left( -\frac{{{\textbf {G}} }_{*}^2}{2{{\varvec{\ell }}}^2}\right) } \end{aligned}$$
(24)

where \({{\textbf {G}} }_{*}^2\) represents the vector containing the squared geodesic distances of \({\textbf {x}}_{n+1}\) to \({\textbf {X}}\) containing \(\{ {\textbf {x}} _i \}_{i = 1,2 \ldots n}\).

Considering the above equation element-wise, we note that the \({\textbf {i}} ^{\text {th}}\) term of \({{\textbf {K}} _{*}}\) equates to \(\exp {\left[ -\frac{{{\textbf {g}} }_{i,n+1}^2}{2{{\varvec{\ell }}}^2}\right] }\). Using Taylor’s series expansion we have,

$$\begin{aligned} \exp {\left[ -\frac{{{\textbf {g}} }_{i,n+1}^2}{2{{\varvec{\ell }}}^2}\right] } \simeq \big (1 -\frac{{{\textbf {g}} }_{i,n+1}^2}{2{{\varvec{\ell }}}^2}\big ) \text{ for } \text{ large } {{\varvec{\ell }}} \end{aligned}$$
(25)

The prediction by the S-Isomap is given by (4) as follows:

$$\begin{aligned} {{\varvec{\tau }}}_{\text {ISO}} = \{ \sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1^T{\textbf {f}} \; \sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2^T{\textbf {f}} \; \ldots \; \sqrt{{\varvec{\lambda }}}_d{{\textbf {q}} }_d^T{\textbf {f}} \}^T \end{aligned}$$
(26)

where \({\textbf {f}} = \{ {\textbf {f}} _i\}\) is as defined by (4).

Rewriting (4) we have:

$$\begin{aligned} {{\textbf {f}} _i} \simeq \frac{1}{2} \big ({\varvec{\gamma }} - {{{\textbf {g}} }_{i,n+1}^2}\big ) \end{aligned}$$
(27)

where \({\varvec{\gamma }} = \big (\frac{1}{n}\sum \limits _{j}{{{\textbf {g}} }_{i,j}^2} \big )\) is a constant with respect to \({\textbf {x}} _{n+1}\), since it depends only on squared geodesic distance values associated within the batch data set \({\mathcal {B}}\) and \({\textbf {x}} _{n+1}\) is part of the stream data set \({\mathcal {S}}\).

We now consider the \({1}^{\text {st}}\) dimension of the predictions for GP-Isomap and S-Isomap only and demonstrate their equivalence via Procrustes Error. The analysis for the remaining dimensions follows a similar line of reasoning.

Thus for the \({1}^{\text {st}}\) dimension, using (27) the S-Isomap prediction is:

$$\begin{aligned} \begin{aligned} {{\varvec{\tau }}}_{\text {ISO}_{1}}&= \sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1^T{\textbf {f}} \\&= \sqrt{{\varvec{\lambda }}}_1\sum \limits _{i=1}^{n} {{{\textbf {q}} }_{1,i}} \big (\frac{1}{2} \big ({\varvec{\gamma }} - {{{\textbf {g}} }_{i,n+1}^2}\big )\big )\\&= \frac{\sqrt{{\varvec{\lambda }}}_1}{2} \sum \limits _{i=1}^{n} {{{\textbf {q}} }_{1,i}} \big ({\varvec{\gamma }} - {{{\textbf {g}} }_{i,n+1}^2}\big )\\ \end{aligned} \end{aligned}$$
(28)

Similarly using Lemma 5, (24) and (25), we have that the \({\textbf {1}} ^{\text {st}}\) dimension for GP-Isomap prediction is given by,

$$\begin{aligned} \begin{aligned} {{\varvec{\tau }}}_{\text {GP}_{1}}&= \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1^T}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}}} {{\textbf {K}} _{*}} \\&= \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}} \sum \limits _{i=1}^{n} {{{\textbf {q}} }_{1,i}} \big (1 - \frac{{{{\textbf {g}} }_{i,n+1}^2}}{2{{\varvec{\ell }}}^2}\big )\\ \end{aligned} \end{aligned}$$
(29)

We can observe that \({{\varvec{\tau }}}_{\text {GP}_{1}}\) is a scaled and translated version of \({{\varvec{\tau }}}_{\text {ISO}_{1}}\). Similarly for each of the dimensions (\({1} \le {i} \le {d}\)), the prediction for the GP-Isomap \({{\varvec{\tau }}}_{\text {GP}_{i}}\) can be shown to be a scaled and translated version of the prediction for the S-Isomap \({{\varvec{\tau }}}_{\text {ISO}_{i}}\). These individual scaling \({\textbf {s}} _i\) and translation \({{\textbf {t}} }_i\) factors can be represented together by single collective scaling \({\textbf {s}}\) and translation \({{\textbf {t}} }\) factors. Consequently, the Procrustes Error \({{\varvec{\epsilon }}}_{\text {Proc}} \big ({{\varvec{\tau }}}_{\text {GP}}\), \({{\varvec{\tau }}}_{\text {SI}}\big )\) is 0. (refer Sect. “Non-linear dimensionality reduction”). \(\square\)

Results and analysis

In this section, we demonstrate the performance of the proposed algorithm on both synthetic and real-world data sets. In Sect. “Results on synthetic data sets”, we present results for synthetic data sets, whereas Sect. “Results on sensor data set” contains results on benchmark sensor data sets. All experiments were done using Python 3.0 implementations of the proposed and related methods, and were run on a MacBook Pro (2.8 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3). Our results demonstrate that: (i). GP-Isomap is able to perform good quality dimension reduction on a manifold, (ii). the reduction produced by GP-Isomap is equivalent to the corresponding output of S-Isomap (or S-Isomap++), and (iii). the predictive variance within GP-Isomap is able to identify changes in the underlying distribution in the data stream on all data sets considered in this paper.

In the interest of space, we avoid comparing the quality of the dimensionality reduction with other methods, and refer readers to S-Isomap and S-Isomap++ where these equivalent methods were shown to be better than existing approaches for dimensionality reduction.

GP-Isomap has the following hyper-parameters: \(\epsilon\), k, l, \(\lambda\), \(\sigma _t\), \(n_s\). We set k, l, \(\lambda\) to have values of 16, 1 and 0.005, respectively. We study the effect of \(\sigma _t\) and \(n_s\) using the different data sets listed in Sects. “Results on synthetic data sets” and “Results on sensor data set” respectively.

Results on synthetic data sets

Swiss roll data sets are typically used for evaluating manifold learning algorithms. To evaluate our method on concept drift, we use the Euler Isometric Swiss Roll data set [5] consisting of four \({\mathbb {R}}^{2}\) Gaussian patches having \(n=2000\) points each, chosen at random, which are embedded into \({\mathbb {R}}^{3}\) using a non-linear function \({\varvec{\psi }}(\cdot )\). The points for each of the Gaussian modes were divided equally into training and test sets randomly. To test incremental concept drift, we use one of the training data sets from the above data set, along with a uniform distribution of points for testing (refer to Fig. 1 for details). Figures 2a and 3a demonstrate our results on this data set.

Fig. 2
figure 2

Using variance to detect concept drift for different data sets. The x-axis represents time and the y-axis represents the model’s predictive variance for the stream. Initially, when stream consists of samples generated from known modes, variance is low. Later, when samples from an unrecognized mode appear, variance drastically shoots up. For the first two data sets, noisy instances in the initial part get assigned a large variance, sporadically. The variance is well-behaved for the third data set. The optimal values of hyper-parameters, \(n_s\) and \(\sigma _t\), were set to (1000, 0.7), (412, 1.2), (855, 0.5), for the three data sets

Fig. 3
figure 3

Low dimensional representations uncovered by GP-Isomap for three different data sets. For the Swiss roll data, GP-isomap is able to learn the structure in the data using a 2-D reduction, while for the real-world census data sets, the structure is not evident in 2-D and possibly a higher dimensional manifold is required

To evaluate our method on sudden concept drift, we trained our GP-Isomap model using the first three out of four training sets of the Euler Isometric Swiss Roll data set. Subsequently we stream points randomly from the test sets from only the first three classes initially and later stream points from the test set of the fourth class, keeping track of the predictive variance all the while. Figure 2a demonstrates the sudden increase (see red line) in the variance of the stream when streaming points are from the fourth class i.e. unknown mode. Thus GP-Isomap is able to detect concept drift correctly, and is able to map all of the data points correctly on the lower dimensional manifold, as shown in Fig. 3a. The bottom panel of Fig. 1 demonstrates the performance of S-Isomap++ on this data set. It fails to map the streaming points of the unknown mode correctly, given it had not encountered the unknown mode during the batch training phase.

In Sect. “Theoretical analysis”, we proved the equivalence between the prediction of S-Isomap with that of GP-Isomap, using our novel kernel. In Fig. 4, we show empirically via Procrustes Error (PE) that the prediction of S-Isomap matches that of GP-Isomap, irrespective of size of batch used. PE for GP-Isomap with the Euclidean distance based kernel remains high irrespective of the size of the batch, which clearly demonstrates the unsuitability of this kernel to adequately learn mappings in the low-dimensional space.

Fig. 4
figure 4

Procrustes error (PE) between the ground truth with a GP-Isomap (blue line) with the geodesic distance based kernel, b S-Isomap (dashed blue line with dots) and c GP-Isomap (green line) using the Euclidean distance based kernel, for different fractions (f) of data used in the batch \({\mathcal {B}}\). The behavior of PE for a closely matches that for b. However, the PE for GP-Isomap using the Euclidean distance kernel remains high irrespective of f demonstrating its unsuitability for manifolds

Results on sensor data set

In this section, we present results from different benchmark sensor data sets to demonstrate the efficacy of our algorithm.

Results on gas sensor array drift data set

The Gas Sensor Array Drift [45] data set is a benchmark data set (\(n = 13910\)) available to research communities to develop strategies to dealing with concept drift and uses measurements from 16 chemical sensors used to discriminate between 6 gases (class labels) at various concentrations. We demonstrate the performance of our proposed method on this data set.

We first remove instances which had invalid/empty entries as feature values. Subsequently the data is mean normalized. Data points from the first five classes were divided into training and test sets. We train our model using the training data from four out of these five classes. While testing, we stream points randomly from the test sets of these four classes first and later stream points from the test set of the fifth class. Figures 2b and 3b demonstrate our results on this data set. From Fig. 2b, we observe that our model can clearly detect concept drift due to the unknown fifth class by tracking the variance of the stream, using the running average (red line). However, as shown in Fig. 3b, a two-dimensional manifold is not sufficient to capture the cluster structure in the data set.

Results on human activity recognition (HAR) data set

The Human Activity Recognition [46] data set consists of multiple data sets which are focused on discriminating between different activities, i.e. to predict which activity was performed at a specific point in time. In this work, we focused on the Weight Lifting Exercises (WLE) data set (\(n = 39242\)) which investigates how well an activity was performed by the wearer of different sensor devices. The WLE data set consists of six young health participants who performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

The data set was cleaned i.e. instances with invalid/empty entries were removed. Subsequently the data points from the different classes were mean normalized and divided into training and test sets. Figures 2c and 3c demonstrate our results on this data set. Figure 2c demonstrates the concept drift phenomenon. Similar to the methodology we used earlier to detect concept drift, we initially trained our algorithm using instances from the latter four classes only, whereas during the streaming phase we randomly selected instances from the streaming set of these four classes first and later streamed points from the first class, keeping track of the predictive variance.

Conclusions

We have proposed a streaming Isomap algorithm (GP-Isomap) that can be used to learn non-linear low-dimensional representation of high-dimensional data arriving in a streaming fashion. This algorithm can have significant applications in areas involving analysis of high-dimensional data streams [47], especially in constrained environments, such as real-time situational monitoring [48, 49], healthcare monitoring [50], scientific data visualization [51], etc. We prove that using a GPR formulation to map incoming data instances onto an existing manifold is equivalent to using existing geometric strategies [5, 6]. Moreover, by utilizing a small batch for the initial learning of the manifold, as well as for training the GPR model, the method scales linearly with the size of the stream, thereby ensuring its applicability for practical problems. Using the Bayesian inference of the GPR model allows us to estimate the variance associated with the mapping of the streaming instances. The variance is shown to be a strong indicator of changes in the underlying stream properties on a variety of data sets. By utilizing the variance, one can devise re-training strategies that can include expanding the batch data set. While in the experiments we have demonstrated the ability of GP-Isomap to detect shifts in the underlying distributions, the algorithm can also be used to detect gradual shifts, as illustrated in Fig. 1. While we have focused on Isomap algorithm in this paper, similar formulations can be applied for other NLDR methods such as LLE [2], Laplacian Eigenmaps [29], Hessian Eigenmaps [30], etc., and will be explored as future research.

Availibility of data and materials

Not applicable. For any collaboration, please contact the authors.

Notes

  1. See Sect. “Problem statement and preliminaries” for the definition of manifold quality.

  2. An equivalence between GPLVM and Kernel Principal Component Analysis (KPCA) has been shown in the literature [19].

  3. Actually, there are two variants of Isomap. The former employs a \({{\textbf {K}} }\)-rule to define the neighborhood \({\mathcal {N}}({\textbf {y}} )\) for each point \({\textbf {y}} \in {\textbf {Y}}\) i.e. it considers the k-nearest neighbors of each point \({\textbf {y}}\) to be its neighborhood \({\mathcal {N}}({\textbf {y}} )\). The second variant employs a \({{\varvec{\epsilon }}}\)-rule to define the neighborhood \({\mathcal {N}}({\textbf {y}} )\) of \({\textbf {y}}\) i.e. it considers all points which are within a radius of \({{\varvec{\epsilon }}}\) to be in its neighborhood \({\mathcal {N}}({\textbf {y}} )\). We observe that there is a direct one-to-one relationship between the two rules with regards to computing the neighborhood \({\mathcal {N}}({\textbf {y}} )\) for all \({\textbf {y}} \in {\textbf {Y}}\).

  4. Note that the Incremental Isomap algorithm [27] has a slightly different formulation where

    $${{\bf f}_i} \simeq \frac{1}{2} \big(\frac{1}{n}\sum\limits_{j}{{\bf g}_{i,j}^2} - \frac{1}{{n}^2}\sum\limits_{l,m}{{\bf g}_{l,m}^2} \big) + \frac{1}{2}\big(\frac{1}{n}\sum\limits_{j}{{\bf g}_{j,n+1}^2} - {{\bf g}_{i,n+1}^2}\big)$$
    (3)

    where \({{\textbf {g}} }_{i,j}\) refer to the geodesic distance discussed in Sect. “Problem statement and preliminaries”. The S-Isomap algorithm assumes that the data stream draws from an uniformly sampled, unimodal distribution \(p({\textbf {x}} )\) and that the stream \({{\mathcal {S}}}\) and the batch \({{\mathcal {B}}}\) data sets get generated from \(p({\textbf {x}} )\). Additionally it assumes that the manifold has stabilized i.e. \(\left| {{\mathcal {B}}}\right| = n\) is large enough. Using these assumptions in (3) above, we have that \(\big (\frac{1}{n}\sum \limits _{j}{{{\textbf {g}} }_{j,n+1}^2} - \frac{1}{{n}^2}\sum \limits _{l,m}{{{\textbf {g}} }_{l,m}^2}\big ) = {{\varvec{\epsilon }}} \simeq 0\) i.e. the expectation of squared geodesic distances for points in the batch data set \({{\mathcal {B}}}\) is close to those for points in the stream data set \({{\mathcal {S}}}\). The line of reasoning for this follows from [36]. Thus (3) simplifies to (4).

  5. In cases of uneven/low density sampling, the clustering strategy discussed might possibly generate many small clusters. In such cases, one can try to merge clusters [41], based on their affinity/closeness to make the clusters’ size reasonable.

  6. For vector-valued outputs, i.e., \({\textbf {x}} \in {\mathbb {R}}^d\), one can consider d independent models.

  7. While the typical notation for GPR models uses \({\textbf {X}}\) as inputs and \({\textbf {Y}}\) as outputs [7], we have reversed the notation to maintain consistency with rest of the paper.

  8. It is possible that the low-dimensional Euclidean space specific to each manifold is different i.e. \({\textbf {U}} _{i}\) is a convex domain in \({\mathbb {R}}^{{\textbf {d}} _i}\) space, where \({\textbf {d}} _i \ne {\textbf {d}} _j\). However we can imagine a scenario where we choose a \({\mathbb {R}}^{\textbf {d}}\) global space, where \({\textbf {d}} = \sum _{i}{{\textbf {d}} _i}\) from which the different convex \({\textbf {U}} _{i}\) were sampled from. Additionally note that convexity is preserved by linear projections to higher dimensional spaces thus the convex domains \({\textbf {U}} _{i = 1, \ldots , p}\) remain convex in this new space.

References

  1. Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–23.

    Article  Google Scholar 

  2. Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–6.

    Article  Google Scholar 

  3. Silva VD, Tenenbaum JB. Global versus local methods in nonlinear dimensionality reduction. NeurIPS. 2003;721–728.

  4. Wu Y, Chan KL. An extended isomap algorithm for learning multi-class manifold. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. 2004:6;3429–3433.

  5. Schoeneman F, Mahapatra S, Chandola V, Napp N, Zola J. Error metrics for learning reliable manifolds from streaming data. In: SDM. 2017:750–758. SIAM

  6. Mahapatra S, Chandola V. S-isomap++: Multi manifold learning from streaming data. In: 2017 IEEE International Conference on Big Data (Big Data). 2017:716–725.

  7. Williams CK, Seeger M. Using the nyström method to speed up kernel machines. In: NeurIPS. 2001:682–688.

  8. Vitter JS. Random sampling with a reservoir. ACM Trans Math Softw (TOMS). 1985;11(1):37–57.

    Article  MathSciNet  Google Scholar 

  9. Chaudhuri S, Motwani R, Narasayya V. On random sampling over joins. ACM SIGMOD Record. 1999;28:263–74.

    Article  Google Scholar 

  10. Domingos P, Hulten G. Mining high-speed data streams. Kdd. 2000;2:4.

    Google Scholar 

  11. Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. J Comput Syst Sci. 1999;58(1):137–47.

    Article  MathSciNet  Google Scholar 

  12. Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T. Optimal histograms with quality guarantees. VLDB. 1998;98:24–7.

    Google Scholar 

  13. Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. SIAM J Comput. 2002;31(6):1794–813.

    Article  MathSciNet  Google Scholar 

  14. Choi H, Choi S. Kernel isomap. Electron Lett. 2004;40(25):1612–3.

    Article  Google Scholar 

  15. Xing W, Shah AA, Nair PB. Reduced dimensional gaussian process emulators of parametrized partial differential equations based on isomap. Proc Royal Soc A Math Phys Eng Sci. 2015;471(2174):20140697.

    Google Scholar 

  16. Feragen A, Lauze F, Hauberg S. Geodesic exponential kernels: when curvature and linearity conflict. New Orleans: IEEE CVPR; 2015. p. 3032–42.

    Google Scholar 

  17. Chapelle O, Haffner P, Vapnik VN. Support vector machines for histogram-based image classification. IEEE Trans Neural Netw. 1999;10(5):1055–64.

    Article  Google Scholar 

  18. Barkan O, Weill J, Averbuch A. Gaussian process regression for out-of-sample extension. 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing. 2016.

  19. Li P, Chen S. A review on gaussian process latent variable models. CAAI Trans Intell Technol. 2016;1(4):366–76.

    Article  Google Scholar 

  20. Lawrence ND. Gaussian process latent variable models for Visualisation of high dimensional data. NeurIPS. 2003;16:329–36.

    Google Scholar 

  21. Titsias M, Lawrence ND. Bayesian gaussian process latent variable model. AISTATS. 2010;9:844–51.

    Google Scholar 

  22. Henriksen A, Ward R. Adaoja: adaptive learning rates for streaming PCA. CoRR arXiv.1905.12115. https://doi.org/10.48550/arXiv.1905.12115.

    Article  Google Scholar 

  23. Rani R, Khurana M, Kumar A, Kumar N. Big data dimensionality reduction techniques in iot: review, applications and open research challenges. Cluster Computing 2022.

  24. Kiarashinejad Y, Abdollahramezani S, Adibi A. Deep learning approach based on dimensionality reduction for designing electromagnetic nanostructures. npj Comput Mater. 2020;6(1):12.

    Article  Google Scholar 

  25. Balzano L, Chi Y, Lu YM. Streaming pca and subspace tracking: the missing data case. Proc IEEE. 2018;106(8):1293–310. https://doi.org/10.1109/JPROC.2018.2847041.

    Article  Google Scholar 

  26. Bengio Y, Paiement J-f, Vincent P, Delalleau O, Roux NL, Ouimet M. Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering. NeurIPS. 2004:177–184.

  27. Law MH, Jain AK. Incremental nonlinear dimensionality reduction by manifold learning. IEEE Trans Pattern Anal Mach Intell. 2006;28(3):377–91.

    Article  Google Scholar 

  28. Weinberger KQ, Packer B, Saul LK. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. AISTATS. 2005;2:6.

    Google Scholar 

  29. Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. NeurIPS, 2002:585–591.

  30. Donoho DL, Grimes C. Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci. 2003;100(10):5591–6.

    Article  MathSciNet  Google Scholar 

  31. Torgerson WS. Multidimensional scaling: I. theory and method. Psychometrika. 1952;17(4):401–19.

    Article  MathSciNet  Google Scholar 

  32. Cailliez F. The analytical solution of the additive constant problem. Psychometrika. 1983;48(2):305–8.

    Article  MathSciNet  Google Scholar 

  33. Dryden IL. Shape analysis. Wiley Stats Ref: Statistics reference online; 2014.

    Google Scholar 

  34. Shawe-Taylor J, Williams CK. The stability of kernel principal components analysis and its relation to the process eigenspectrum. 2003:383–390.

  35. Ham JH, Lee DD, Mika S, Schölkopf B. A kernel view of the dimensionality reduction of manifolds. Dep Pap (ESE). 2004;93.

  36. Hoeffding W. Probability inequalities for sums of bounded random variables. J Am Stat Assoc. 1963;58(301):13–30.

    Article  MathSciNet  Google Scholar 

  37. Polito M, Perona P. Grouping and dimensionality reduction by locally linear embedding. NeurIPS. 2002:1255–1262.

  38. Fan M, Qiao H, Zhang B, Zhang X. Isometric multi-manifold learning for feature extraction. In: 2012 IEEE 12th International Conference on Data Mining, 2012:241–250. IEEE

  39. Vidal R, Ma Y, Sastry S. Generalized principal component analysis (gpca). IEEE Trans Pattern Anal Mach Intell. 2005;27(12):1945–59.

    Article  Google Scholar 

  40. Little AV, Lee J, Jung Y-M, Maggioni M. Estimation of intrinsic dimensionality of samples from noisy low-dimensional manifolds in high dimensions with multiscale svd. In: 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, 2009:85–88. IEEE.

  41. Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell. 2002;5:603–19.

    Article  Google Scholar 

  42. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surveys (CSUR). 1999;31(3):264–323.

    Article  Google Scholar 

  43. Bernstein M, De Silva V, Langford JC, Tenenbaum JB. Graph approximations to geodesics on embedded manifolds. Citeseer: Technical report. 2000.

  44. Sibson R. Studies in the robustness of multidimensional scaling: perturbational analysis of classical scaling. J Royal Stat Soc Series B Methodol. 1979;41(2):217–29.

    MathSciNet  Google Scholar 

  45. Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R. Chemical gas sensor drift compensation using classifier ensembles. Sens Actuators B Chem. 2012;166:320–9.

    Article  Google Scholar 

  46. Velloso E, Bulling A, Gellersen H, Ugulino W, Fuks H. Qualitative activity recognition of weight lifting exercises. In: Proceedings of the 4th Augmented Human International Conference. 2013:116–123. ACM

  47. Gomes HM, Read J, Bifet A, Barddal JP, Gama JA. Machine learning for streaming data: State of the art, challenges, and opportunities. SIGKDD Explor Newsl. 2019:6–22.

  48. Thudumu S, Branch P, Jin J, Singh JJ. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data. 2020;7(1):42.

    Article  Google Scholar 

  49. Fujiwara T, Chou J, Shilpika S, Xu P, Ren L, Ma K. An incremental dimensionality reduction method for visualizing streaming multidimensional data. IEEE Trans Visualization Comput Graphics. 2020;26(01):418–28.

    Article  Google Scholar 

  50. Gupta V, Mittal M. QRS complex detection using STFT, chaos analysis, and PCA in standard and real-time ECG databases. J Inst Eng India Series B. 2019;100(5):489–97. https://doi.org/10.1007/s40031-019-00398-9.

    Article  Google Scholar 

  51. Dorier M, Wang Z, Ayachit U, Snyder S, Ross R, Parashar M. Colza: Enabling elastic in situ visualization for high-performance computing simulations. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2022;538–548.

  52. Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C++. Art Scientific Computi. 1992;2:1002.

    Google Scholar 

Download references

Acknowledgements

Access to computing facilities were provided by University of Buffalo Center for Computational Research.

Funding

This material is based in part upon work supported by the National Science Foundation under award numbers CNS - 1409551 and IIS - 1641475.

Author information

Authors and Affiliations

Authors

Contributions

SM performed the literature review, implemented the proposed model, and carried out the experiments. SM and VC co-wrote the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Suchismit Mahapatra.

Ethics declarations

Ethics approval and consent to participate

The author confirms the sole responsibility for this manuscript. The author read and approved the final manuscript.

Consent for publication

The authors consent for publication.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Lemma 1

The matrix exponential for \({\textbf {M}}\) for rank\(\big ({\textbf {M}} \big ) = d\) and symmetric \({\textbf {M}}\) is given by

$$\begin{aligned} e^{\textbf {M}} = {{\textbf {I}} } + \sum \limits _{i=1}^{d} \big ( e^{{{\varvec{\lambda }}}_i} - 1 \big ){{{\textbf {q}} }_i}{{{\textbf {q}} }_i^\top } \end{aligned}$$

where \(\{ {{{\varvec{\lambda }}}}_i \}_{i = 1,2 \ldots d}\) are the d largest eigenvalues of \({\textbf {M}}\) and \(\{ {{\textbf {q}} }_i \}_{i = 1,2 \ldots d}\) are the corresponding eigenvectors such that \({{\textbf {q}} _i^\top }{{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}\).


Proof Let \({\textbf {M}}\) be an \(n \times n\) real matrix. The exponential \(e^{\textbf {M}}\) is given by

$$\begin{aligned} e^{\textbf {M}} = \sum \limits _{k=0}^{\infty } \frac{1}{{k}\,!} {\textbf {M}} ^{\textbf {k}} = {{\textbf {I}} } + \sum \limits _{k=1}^{\infty } \frac{1}{{k}\,!} {\textbf {M}} ^{\textbf {k}} \end{aligned}$$

where \({{\textbf {I}} }\) is the identity. Real, symmetric \({\textbf {M}}\) has real eigenvalues and mutually orthogonal eigenvectors i.e. \({\textbf {M}} = \sum \nolimits _{i=1}^{n} {{\varvec{\lambda }}}_{i}{{{\textbf {q}} }_i}{{\textbf {q}} _i^\top } \text{ where } \{ {{\varvec{\lambda }}}_{i} \}_{i = 1 \ldots n} \text{ are } \text{ real } \text{ and } {{{\textbf {q}} }_i^\top {{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}}\). Given \({\textbf {M}}\) has rank d, we have \({\textbf {M}} = \sum \limits _{i=1}^{d} {{\varvec{\lambda }}}_{i}{{\textbf {q}} _i}{{{\textbf {q}} }_i^\top }\).

$$\begin{aligned} \begin{aligned} e^{\textbf {M}}&= {{\textbf {I}} } + \sum \limits _{i=1}^{\infty } \frac{1}{i\,!} {\textbf {M}} ^i \\&= {{\textbf {I}} } + \frac{1}{1\,!} \big ({{{\varvec{\lambda }}}_1}{{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top } + {{{\varvec{\lambda }}}_2}{{{\textbf {q}} }_2}{{{\textbf {q}} }_2^\top } + \ldots + {{{\varvec{\lambda }}}_d}{{{\textbf {q}} }_d}{{{\textbf {q}} }_d^\top } \big ) \\&+ \frac{1}{2\,!}\big ({{{\varvec{\lambda }}}_1}{{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top } + {{{\varvec{\lambda }}}_2}{{{\textbf {q}} }_2}{{{\textbf {q}} }_2^\top } + \ldots +{{{\varvec{\lambda }}}_d}{{{\textbf {q}} }_d}{{{\textbf {q}} }_d^\top } \big )^2 + \ldots \\&= {{\textbf {I}} } + \big ( \frac{{{{\varvec{\lambda }}}_1}}{1\,!} + \frac{{{{\varvec{\lambda }}}_1^2}}{2\,!} + \ldots \big ){{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top } + \big ( \frac{{{{\varvec{\lambda }}}_2}}{1\,!} + \frac{{{{\varvec{\lambda }}}_2^2}}{2\,!} + \ldots \big ){{{\textbf {q}} }_2}{{{\textbf {q}} }_2^\top } + \ldots \\&+ \big ( \frac{{{{\varvec{\lambda }}}_d}}{1\,!} + \frac{{{{\varvec{\lambda }}}_d^2}}{2\,!} + \ldots \big ){{{\textbf {q}} }_d}{{{\textbf {q}} }_d^\top }\\&= {{\textbf {I}} } + \big ( e^{{{\varvec{\lambda }}}_1} - 1 \big ){{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top } + \big ( e^{{{\varvec{\lambda }}}_2} - 1 \big ){{{\textbf {q}} }_2}{{{\textbf {q}} }_2^\top } + \ldots + \big ( e^{{{\varvec{\lambda }}}_d} - 1 \big ){{{\textbf {q}} }_d}{{{\textbf {q}} }_d^\top }\\&= {{\textbf {I}} } + \sum \limits _{i=1}^{d} \big ( e^{{{\varvec{\lambda }}}_i} - 1 \big ){{{\textbf {q}} }_i}{{{\textbf {q}} }_i^\top } \end{aligned} \end{aligned}$$
(30)

\(\square\)

Lemma 2

The inverse of the Gaussian kernel for rank\(\big ({\textbf {M}} \big ) = 1\) and symmetric \({\textbf {M}}\) is given by

$$\begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1} = {{\alpha }}{{\textbf {I}} } - \frac{{{\alpha }}^2{\textbf {c}} _1{{\textbf {q}} _1}{{\textbf {q}} _1^\top }}{1 + {{\alpha }}{\textbf {c}} _1} \end{aligned}$$

where \({{{\textbf {q}} }_1}\) is the first eigenvector of M i.e. \({{{\textbf {q}} }_1^\top }{{{\textbf {q}} }_1} = 1\), \({{{\varvec{\lambda }}}_1}\) is the corresponding eigenvalue and \({{\varvec{\alpha }}} = \frac{1}{\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big )}\) and \({{\textbf {c}} _1} = \big [ \exp {\left( -\frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ]\).


Proof Using (22) for \(d = 1\), we have

$$\begin{aligned} \begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1}&= {\big ( {{\textbf {I}} } + \big [\exp {\left( -\frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ]{{{\textbf {q}} }_1}{{\textbf {q}} _1^\top } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1} \\&= {\big ( \big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big ){{\textbf {I}} } + \big [\exp {\left( -\frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ]{{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top } \big )}^{-1} \end{aligned} \end{aligned}$$
(31)

Representing \(\frac{1}{\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big )}\) as \({{\varvec{\alpha }}}\) and \(\big [\exp {\left( -\frac{{{\varvec{\lambda }}}_1}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ]\) as \({{\textbf {c}} _1}\) and using \(\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big ){{\textbf {I}} }\) as \({{\textbf {A}} }\), \({{\textbf {c}} _1}{{{\textbf {q}} }_1}\) as \({\textbf {u}}\) and \({{\textbf {q}} _1}\) as \({\textbf {v}}\) in the Sherman-Morrison identity [52], we have

$$\begin{aligned} \begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1}&= {{\varvec{\alpha }}}{{\textbf {I}} } - \frac{{{\varvec{\alpha }}}{{\textbf {I}} }{{\textbf {c}} _1}{{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top }{{\varvec{\alpha }}}{{\textbf {I}} }}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}}\\&= {{\varvec{\alpha }}}{{\textbf {I}} } - \frac{{{\varvec{\alpha }}}^2{{\textbf {c}} _1}{{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top }}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}} \end{aligned} \end{aligned}$$
(32)

\(\square\)

Lemma 3

The inverse of the Gaussian kernel for rank\(\big ({\textbf {M}} \big ) = d\) and symmetric \({\textbf {M}}\) is given by

$$\begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1} = {{\varvec{\alpha }}}{{\textbf {I}} } - {{\varvec{\alpha }}}^2 \sum \limits _{i=1}^{d} \frac{{{\textbf {c}} _i}{{{\textbf {q}} }_i}{{\textbf {q}} _i^\top }}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _i}} \end{aligned}$$

where \(\{ {{\varvec{\lambda }}}_i \}_{i = 1,2 \ldots d}\) are the d largest eigenvalues of \({\textbf {M}}\) and \(\{ {{\textbf {q}} }_i \}_{i = 1,2 \ldots d}\) are the corresponding eigenvectors such that \({{\textbf {q}} _i^\top }{{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}\).


Proof Using the result of previous lemma iteratively, we get the required result

$$\begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1} = {{\varvec{\alpha }}}{{\textbf {I}} } - {{\varvec{\alpha }}}^2 \sum \limits _{i=1}^{d} \frac{{{\textbf {c}} _i}{{\textbf {q}} _i}{{{\textbf {q}} }_i^\top }}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _i}} \end{aligned}$$
(33)

where \({{\varvec{\alpha }}} = \frac{1}{\big ( 1 + {{{\varvec{\sigma }}}_n}^2 \big )}\) and \({{\textbf {c}} _i} = \big [ \exp {\left( -\frac{{{\varvec{\lambda }}}_i}{2{{\varvec{\ell }}}^2}\right) } - 1 \big ]\). \(\square\)

Lemma 4

The solution for Gaussian Process regression system, for the scenario when rank\(\big ({\textbf {M}} \big ) = 1\) and for symmetric \({\textbf {M}}\) is given by

$$\begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1}{\textbf {y}} = \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1{\textbf {q}} _1}{{1 + {{\varvec{\alpha }}}{\textbf {c}} _1}} \end{aligned}$$

Proof Assuming the intrinsic dimensionality of the low-dimensional manifold to be 1 implies that the inverse of the Gaussian kernel is as defined as in (32). \({\textbf {y}}\) is \(\sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1\) in this case (refer Sect. “Non-linear dimensionality reduction”). Thus we have

$$\begin{aligned} \begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1}{\textbf {y}}&= \big ( {{\varvec{\alpha }}}{{\textbf {I}} } - \frac{{{\varvec{\alpha }}}^2{{\textbf {c}} _1}{{{\textbf {q}} }_1}{{{\textbf {q}} }_1^\top }}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}} \big ) \big (\sqrt{{\varvec{\lambda }}}_1{\textbf {q}} _1\big )\\&= {{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1 - \frac{{{\varvec{\alpha }}}^2\sqrt{{\varvec{\lambda }}}_1{{\textbf {c}} _1}{{\textbf {q}} }_1}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}}} = \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}}} \end{aligned} \end{aligned}$$
(34)

\(\square\)

Lemma 5

The solution for Gaussian Process regression system, for the scenario when rank\(\big ({\textbf {M}} \big ) = d\) and for symmetric \({\textbf {M}}\) is given by

$$\begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1}{\textbf {y}} = \{ \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1{\textbf {q}} _1}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}}} \; \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _2}}} \; \ldots \; \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_d{{\textbf {q}} }_d}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _d}}} \} \end{aligned}$$

Proof Assuming the intrinsic dimensionality of the low-dimensional manifold to be d implies that the inverse of the Gaussian kernel is as defined as in (33). \({\textbf {y}}\) is \(\{ \sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1 \; \sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2 \; \ldots \; \sqrt{{\varvec{\lambda }}}_d{\textbf {q}} _d \}\) in this case (refer Sect. “Non-linear dimensionality reduction”), where \({{\textbf {q}} _i^\top }{{{\textbf {q}} }_j} = {{\varvec{\delta }}}_{i,j}\). Each of the k dimensions of \({\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{\textbf {I}} \big )}^{-1}{\textbf {y}}\) can be processed independently, similar to the previous lemma. For the \({i}^{\text {th}}\) dimension, we have,

$$\begin{aligned} \begin{aligned} {\big ( {{\textbf {K}} } + {{{\varvec{\sigma }}}_n}^2{{\textbf {I}} } \big )}^{-1}{\textbf {y}} _i&= \big ( {{\varvec{\alpha }}}{{\textbf {I}} } - {{\varvec{\alpha }}}^2 \sum \limits _{j=1}^{d} \frac{{{\textbf {c}} _j}{{{\textbf {q}} }_j}{{\textbf {q}} _j^\top }}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _j}} \big ) \big (\sqrt{{\varvec{\lambda }}}_i{{\textbf {q}} }_i\big )\\&= {{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_i{{\textbf {q}} }_i - {{\varvec{\alpha }}}^2 \sum \limits _{j=1}^{d} \frac{{{\textbf {c}} _j}{{{\textbf {q}} }_j}{{{\textbf {q}} }_j^\top }{{\textbf {q}} }_i\big (\sqrt{{\varvec{\lambda }}}_i\big )}{1 + {{\varvec{\alpha }}}{{\textbf {c}} _j}} \\&= {{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_i{{\textbf {q}} }_i - \frac{{{\varvec{\alpha }}}^2\sqrt{{\varvec{\lambda }}}_i{{\textbf {c}} _i}{{\textbf {q}} }_i}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _i}}} = \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_i{{\textbf {q}} }_i}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _i}}} \end{aligned} \end{aligned}$$
(35)

Thus we get the result,

$$\begin{aligned} \big ( {{\textbf {K}} } + {{{\sigma }}_{\varvec{n}}^2{{\textbf {I}} } \big )}^{-1}{\textbf {y}} = \{ \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_1{{\textbf {q}} }_1}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _1}}} \; \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_2{{\textbf {q}} }_2}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _2}}} \; \ldots \; \frac{{{\varvec{\alpha }}}\sqrt{{\varvec{\lambda }}}_d{{\textbf {q}} }_d}{{1 + {{\varvec{\alpha }}}{{\textbf {c}} _d}}} \} \end{aligned}$$
(36)

\(\square\)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahapatra, S., Chandola, V. Learning manifolds from non-stationary streams. J Big Data 11, 42 (2024). https://doi.org/10.1186/s40537-023-00872-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-023-00872-8

Keywords

Mathematics Subject Classification