The proposed UbiSOM algorithm relies on two learning assessment metrics, namely the average quantization error and the average neuron utility, computed over a sliding window. While the first assesses the trend of the vector quantization process towards the underlying distribution, the later is able to detect regions of the map that may become “unused” given some changes in the distribution, e.g., disappearance of clusters. Both metrics are weighed in a drift function that gives an overall indication of the performance of the map over the data stream, used to estimate learning parameters.
The UbiSOM implements a finite state-machine consisting in two states, namely ordering and learning. The ordering state allows the map to initially unfold over the underlying distribution with monotonically decreasing learning parameters; it is also used to obtain the first values of the assessment metrics, transitioning afterwards to the learning state. Here, the learning parameters, i.e., learning rate and neighborhood radius, are decreased or increased based on the drift function. This allows the UbiSOM to retain an indefinite plasticity, while maintaining the original SOM properties, over non-stationary data streams. These states also coincide with the two typical training phases suggested by Kohonen. It is possible, however, that unrecoverable situations from abrupt changes in the underlying distribution are detected, which leads the algorithm to transition back to the ordering state.
Notation
Each UbiSOM neuron \(k\) is a tuple \(\mathcal {W}_{k}=\langle {\mathbf{w}}_{k},\, t_{k}^{update}\rangle\), where \({\mathbf{w}}_{k}\in \mathbb {R}^{d}\) is the prototype and \(t_{k}^{update}\) stores the time stamp of the last time its prototype was updated. For each incoming observation \({\mathbf{x}}_{t}\), presented at time t, two metrics are computed, within a sliding window of length T, namely the average quantization error \(\overline{qe}(t)\) and the average neuron utility \(\overline{\lambda }(t)\). We assume that all features of the data stream are equally normalized between \([d_{min},d_{max}]\). The local quantization error \({E(t)}\) is normalized by \(|\Omega |=(d_{max}-d_{min})\sqrt{d}\), so that \(\overline{qe}(t)\in [0,1]\). The \(\overline{\lambda }(t)\) metric averages neuron utility (\(\lambda (t)\)) values that are computed as a ratio of updated neurons during the last T observations. Both metrics are used in a drift function \(d(t)\), where the parameter \(\beta \in [0,1]\) weighs both metrics.
The UbiSOM switches between the ordering and learning states, both using the classical SOM update rule, but with different mechanisms for estimating learning parameters \(\sigma\) and \(\eta\). The ordering state endures for \(T\) examples, until the first values of \(\overline{qe}(t)\) and \(\overline{\lambda }(t)\) are available, establishing an interval \([t_{i},t_{f}]\), during which monotonically decreasing functions \(\sigma (t)\) and \(\eta (t)\) are used to decrease values between \(\{\sigma _{i},\sigma _{f}\}\) and \(\{\eta _{i},\eta _{f}\},\) respectively. The learning state estimates learning parameters as a function of the drift function. UbiSOM neighborhood function is defined in a way that uses \(\sigma \in [0,1]\) as opposed to existing variants, where the domain of the values is problem-dependent.
Online assessment metrics
The purpose of these metrics is to assess the “fit” of the map to the underlying distribution. Both proposed metrics are computed over a sliding window of length \(T\).
Average quantization error
The widely used global quantization error (QE) metric is the standard measure of fit of a SOM model to a particular distribution. It is typically used to compare SOM models obtained for different runs and/or parameterizations and used in a batch setting. The rationale is that the model which exhibits a lower QE value is better at summarizing the input space.
Regarding data streams this metric, as it stands, is not applicable because data is potentially infinite. Competing approaches to the proposed UbiSOM use only the local quantization error \(E(t)\). Kohonen stated that both \(\eta (t)\) and \(\sigma (t)\) should decrease monotonically with time, a critical condition to achieve convergence [8]. However, the local error is very unstable because \(\Omega \rightarrow \mathcal {K}\) is a many-to-few mapping, where some observations are better represented than others. As an example, with stationary data the local error does not decrease monotonically over time. We argue this is the reason why other existing approaches, e.g., PLSOM and DSOM, fail to model the input space density correctly.
In the proposed algorithm, the quantization error was modified to a running mean in the form of the average quantization error \(\overline{qe}(t)\), based on the premise that the error of a learner will decrease over time for an increasing number of examples if the underlying distribution is stationary; otherwise, if the distribution changes, the error increases. For each observation \({\mathbf{x}}_t\) the \(E{^{\prime }}(t)\) local quantization error is obtained during the BMU search, as the normalized Euclidean distance
$$\begin{aligned} E{^{\prime }}(t)=\frac{\Vert \,{\mathbf{x}}_{t}-{\mathbf{w}}_{c}\,\Vert }{|\Omega |}. \end{aligned}$$
(5)
These values are averaged over a window of length \(T \gg 0\) to obtain \(\overline{qe}(t)\), defined in Eq. (5). Consequently, the value of \(T\) establishes a short, medium or long-term trend of the model adaptation.
$$\begin{aligned} \overline{qe}(t)=\frac{1}{T}\sum _{t}^{t-T+1}E{^{\prime }}(t) \end{aligned}$$
(6)
Figure 1 depicts the typical behavior of \(E{^{\prime }}(t)\) values obtained during a run of the classical SOM algorithm, together with the computed \(\overline{qe}(t)\) values for \(T=2000\), over a data stream where the underlying distribution suffers an abrupt change at \(t\) = \(50\,000\). We can observe that \(E{^{\prime }}(t)\) values exhibit a large variance throughout time, as opposed to \(\overline{qe}(t)\) which is smoother and indicates the trend of the convergence. Therefore, it is implicitly assumed that if \(\overline{qe}(t)\) is decreasing, then the underlying distribution is stationary; otherwise, it is changing.
Average neuron utility
The average quantization error \(\overline{qe}(t)\) may be a good overall indicator of the fit of the model. Despite that, it may be unable to detect the abrupt disappearance of clusters. Figure 2 illustrates such a scenario, depicting the “unused” area of the map after the inner cluster disappears. Here \(\overline{qe}(t)\) does not increase, however in this situation, the learning parameters should increase and allow the map to recover from this situation.As a consequence, the average neuron utility was proposed as a means to detect these cases.
To compute this assessment metric each UbiSOM neuron \(\mathcal {K}\) is extended with a time stamp \(t_{k}^{update}\) which stores the last time the corresponding prototype was updated, functioning as an aging mechanism. A prototype is updated if it is the BMU or if it falls in the influence region of the BMU, limited by the neighborhood function. Initially, \(t_{k}^{update}=0\). The neuron utility \(\lambda (t)\) is given by Eq. (7). It measures the ratio of neurons that were updated within the last \(T\) observations, over the total number of neurons. Consequently, if all neurons have been recently updated, then \(\lambda (t)=1\). The values are then averaged by Eq. (8) to obtain \(\overline{{\lambda }}(t)\).
$$\begin{aligned} \lambda (t)=\frac{\sum _{k=1}^{K}1_{\{t-t_{k}^{update}\le T\}}}{K} \end{aligned}$$
(7)
$$\begin{aligned} \overline{\lambda }(t)=\frac{1}{T}\sum _{t}^{t-T+1}\lambda (t). \end{aligned}$$
(8)
As a result, a decrease in \(\overline{\lambda }(t)\) indicates that there are neurons that are not being used to quantize the data stream. While it is not unusual to obtain these “dead-units” with stationary data after the map has converged, the decreasing trend should alert for changes in the underlying distribution.
The drift function
The previous metrics \(\overline{qe}(t)\) and \(\overline{\lambda }(t)\) are both weighed in a drift function that is used by the UbiSOM to estimate learning parameters. In short:
-
\(\overline{qe}(t)\) The average quantization error gives an indication of how well the map is currently quantifying the underlying distribution, previously defined in Eq. (6). In most situation where the underlying data stream is stationary, \(\overline{qe}(t)\) is expected to decrease and stabilize, i.e., the map is converging. If the shape of the distribution changes, \(\overline{qe}(t)\) is expected to increase.
-
\(\overline{\lambda }(t)\) The average neuron utility is an additional measure which gives an indication of the proportion of neurons that are actively being updated, previously defined in Eq. (8). The decrease of \(\overline{\lambda }(t)\) indicates neurons are being underused, which can reflect changes in the underlying distribution not detected by \(\overline{qe}(t)\).
The drift function is defined as
$$\begin{aligned} d(t)=\beta \,\overline{qe}(t)+(1-\beta )\,(1-\overline{\lambda }(t)) \end{aligned}$$
(9)
where \(\beta \in [0,1]\) is a weighting factor that establishes the balance of importance between the two metrics. Since both \(\overline{qe}(t)\) and \(\overline{\lambda }(t)\) are only obtained after \(T\) observations, so is \(d(t)\).
A quick analysis of \(d(t)\) should be made: with high learning parameters, specially the neighborhood \(\sigma\) value, \(\overline{\lambda }(t)\) is expected to be \(\thickapprox 1\), which practically eliminates the second term of the equation. Consequently, the drift function in only governed by \(\overline{qe}(t)\). When the neuron utility decreases the second term contributes to the increase of \(d(t)\) in proportion to the chosen \(\mathbf {\beta }\) value. Ultimately, if \(\beta =1\) then the drift function is only defined by the \(\overline{qe}(t)\) metric. Empirically, \(\beta\) should be parameterized with relatively high values, establishing \(\overline{qe}(t)\) as the main measure of “fit” and using \(\overline{\lambda }(t)\) as a failsafe mechanism.
The neighborhood function
The UbiSOM algorithm uses a normalized neighborhood radius \(\sigma\) learning parameter and a truncated neighborhood function. The latter is what effectively allows \(\overline{\lambda }(t)\) to be computed.
The classical SOM neighborhood function relies on a \(\sigma\) value that is problem-dependent, i.e., the used values depend on the lattice size. This complicates the parameterization of \(\sigma\) for different values of \(\mathcal {K}\), i.e., \(width\times height\).
The performed normalization is based on the maximum distance between any two neurons in the lattice. In rectangular maps the farthest neurons are the ones at opposing diagonals, e.g., positions (0, 0) and \((width-1,\, height-1)\) in Fig. 3. Hence distances within the lattice are normalized by the Euclidean norm of the vector \({\mathbf{diag}}=(width-1,height-1)\), defined as
$$\begin{aligned} \| {\mathbf{diag}}\| =\sqrt{(width-1)^{2}+(height-1)^{2}}. \end{aligned}$$
(10)
This effectively limits the maximum neighborhood width the UbiSOM can use and establishes \(\sigma \in [0,1]\).
The neighborhood function of the UbiSOM variant is given by
$$\begin{aligned} h_{ck}^{\prime }(t)=e^{-\Big (\frac{\parallel r_{c}-r_{k}\parallel )}{\sigma \,\parallel {\mathbf{diag}}\parallel }\Big )^{2}} \end{aligned}$$
(11)
where \(r_{c}\) is the position in the grid of the BMU for observation \({\mathbf{x}}_{t}\). To get a grasp on how different \(\sigma\) values determine the influence region around the BMU, Fig. 4 depicts Eq. (11) for different \(\sigma\) values. Neurons whose values of \(h_{ck}^{\prime }(t)\) are below a threshold of 0.01 are not updated. This is critical for the computation of \(\lambda (t)\), since \(h_{ck}^{\prime }(t)\) is a continuous function and as \(h_{ck}^{\prime }(t)\rightarrow 0\) all other neurons would still be updated with very small values. The truncated neighborhood function is also a performance improvement, avoiding negligible updates to prototypes.
States and transitions
The UbiSOM algorithm implements a finite state-machine, i.e., it can switch between two states. This design was, on one hand, imposed by the initial delay in obtaining values for the assessment metrics and, as a consequence, for the drift function \(d(t)\); on the other hand, seen as a desirable mechanism to conform to Kohonen’s proposal of an ordering and a convergence phase for the SOM [8] and to deal with drastic changes that can occur in the underlying distribution.
The two possible states of the UbiSOM algorithm, namely ordering state and learning state are depicted in Fig. 5 and described next. Both use a similar update equation as the classical algorithm, but with the neighborhood function defined in Eq. (11), as defined in Eq. (12). Please note that the prototypes are only updated above the neighborhood function threshold.
$$\begin{aligned} {\mathbf{w}}_{k}(t+1)={\left\{ \begin{array}{ll} {\mathbf{w}}_{k}(t)+\eta (t)h_{ck}^{\prime }(t)\left[ {\mathbf{x}}_{t}-{\mathbf{w}}_{k}(t)\right] &\quad h_{ck}^{\prime }(t)>0.01\\ {\mathbf{w}}_{k}(t) &\quad otherwise \end{array}\right. } \end{aligned}$$
(12)
However, each state estimates learning parameters with different functions for \(\eta (t)\) and \(\sigma (t)\).
Ordering state
The ordering state is the initial state of the UbiSOM algorithm and to where it possibly reverts if it can not recover from an abrupt change in the data stream. It endures for \(T\) observations where learning parameters are estimated with a monotonically decreasing function, i.e., time-dependent, similar to the classical SOM. Thus, the parameter \(T\) simultaneously defines the window length of the assessment metrics, as well as dictates the duration of the ordering state. The parameters should be relatively high, so the map can order itself from a totally unordered initialization regarding the underlying distribution. This phase also allows for the first value of the drift function \(d(t)\) to be available. After \(T\) observations the algorithm switches to the learning state.
Let \(t_{i}\) and \(t_{f}=t_{i}+T-1\) be the first and last iterations of the ordering phase, respectively. This state requires choosing appropriate parameter values for \(\eta _{i}\), \(\eta _{f}\), \(\sigma _{i}\) and \(\sigma _{f}\), which are, respectively, the initial and final values for the learning rate and the normalized neighborhood radius. The choice of values will greatly impact the initial ordering of the prototypes and will affect the estimation of parameters of the learning state. Any monotonically decreasing function can be used, although in this research the following were used:
$$\begin{aligned} \sigma (t)=\sigma _{i}\left( \frac{\sigma _{f}}{\sigma _{i}}\right) ^{t/t_{f}},\quad \eta (t)=\eta _{i}\left( \frac{\eta _{f}}{\eta _{i}}\right) ^{t/t_{f}}\qquad \forall t\in \{t_{i},t_{i+1},\ldots ,t_{f}\} \end{aligned}$$
(13)
At the end of the \(t_{f}\) iteration, the first value of the drift function is obtained, i.e., \(d(t_{f}),\)and the UbiSOM algorithm transitions to the learning state.
Learning state
The learning state begins at \(t_{f}+1\) and is the main state of the UbiSOM algorithm, during which learning parameters are estimated in a time-independent manner. Here learning parameters are estimated solely based on the drift function \(d(t)\), decreasing or increasing relative to the first computed value \(d(t_{f})\) and final values (\(\eta _{f},\sigma _{f}\)) of the ordering state.
Given that in this state the map is expected to start converging, the values of \(d(t)\) should also decrease. Hence, the value \(d(t_{f})\) is used as a reference value establishing a threshold above which the map is considered to be irrecoverably diverging from changes in the underlying distribution, e.g., in some abrupt changes the drift function can increase rapidly to very high values. Consequently, it also limits the maximum values that learning parameters can attain during this state
Learning parameters \(\eta (t)\) and \(\sigma (t)\) are estimated for an observation presented at time t by Eq. (14), where \(d(t)\) is defined as in Eq. (9). One can easily derive that learning parameters are estimated proportionally to \(d(t)\). Also, final values of the ordering state for \(\eta _{f}\) and \(\sigma _{f}\) establish an upper bounded for the learning parameters in this state.
$$\begin{aligned} \eta (t)={\left\{ \begin{array}{ll} \frac{\eta _{f}}{d(t_{f})}\, d(t) &\quad d(t)<d(t_{t})\\ \eta _{f} &\quad otherwise \end{array}\right. }\quad \sigma (t)={\left\{ \begin{array}{ll} \frac{\sigma _{f}}{d(t_{f})}\, d(t) &\quad d(t)<d(t_{f})\\ \sigma _{f} &\quad otherwise. \end{array}\right. } \end{aligned}$$
(14)
The outcome of these equations is that if the distribution is stationary the learning parameters accompany the decrease of the drift function values, allowing the map to converge to a stable state. On the contrary, if changes occur, the drift function values rise, consequently increasing the learning parameters, and increase the plasticity of the map to a point where \(d(t)\) should decrease again. The increased plasticity should allow the map to adjust to the distribution change.
However, there may be cases of abrupt changes from where the map cannot recover, i.e., the map does not resume convergence with decreasing \(d(t)\) values. Therefore, if we detect that learning parameters are in their peak values during at least \(T\) iterations, i.e., \(\sum 1_{\{d(t)\ge d(t_{f})\}}\ge T\), then this situation is confirmed and the UbiSOM transitions back to the ordering state.
Time and space complexity
The UbiSOM algorithm (and model) does not increase the time complexity of the classical SOM algorithm, since all the potentially penalizing additional operations, namely the computations of the assessment metrics, can be obtained in O(1). Regarding space complexity, it increases the space needed for: (1) storing an additional timestamps for each neuron \(k\); (2) storing two queues for the assessment metrics \(\overline{qe}(t)\) and \(\overline{\lambda }(t)\), each of length \(T\). Therefore, after the initial creation of data structures (map and queues) in O(\(K\)) time and \(O(Kd+2K+2T)\) space, every observation \({\mathbf{x}}_{t}\) is processed in constant O(2Kd) time and constant space. No observations are kept in memory.
Hence, the UbiSOM algorithm is scalable in respect to the number of observations N, since the cost per observations is kept constant. However, the increase of the number of neurons \(K\), i.e., the size of the lattice, and the dimensionality d of the data stream will increase this cost linearly.