 Methodology
 Open access
 Published:
On data efficiency of univariate time series anomaly detection models
Journal of Big Data volume 11, Article number: 83 (2024)
Abstract
In machine learning (ML) problems, it is widely believed that more training samples lead to improved predictive accuracy but incur higher computational costs. Consequently, achieving better data efficiency, that is, the tradeoff between the size of the training set and the accuracy of the output model, becomes a key problem in ML applications. In this research, we systematically investigate the data efficiency of Univariate Time Series Anomaly Detection (UTSAD) models. We first experimentally examine the performance of nine popular UTSAD algorithms as a function of the training sample size on several benchmark datasets. Our findings confirm that most algorithms become more accurate when more training samples are used, whereas the marginal gain for adding more samples gradually decreases. Based on the above observations, we propose a novel framework called FastUTSAD that achieves improved data efficiency and reduced computational overhead compared to existing UTSAD models with little loss of accuracy. Specifically, FastUTSAD is compatible with different UTSAD models, utilizing a sampling and scaling lawbased heuristic method to automatically determine the number of training samples a UTSAD model needs to achieve predictive performance close to that when all samples in the training set are used. Comprehensive experimental results show that, for the nine popular UTSAD algorithms tested, FastUTSAD reduces the number of training samples and the training time by 91.09–91.49% and 93.49–93.82% on average without significant decreases in accuracy.
Introduction
With the rapid advancement of sensor [2] and Internet of Things (IoT) [14] technologies, large volumes of timeseries data are generated at an unprecedented speed. Such big timeseries data has been widely used to assist realtime decision making in many areas, including IT operations [64], finance [4], and healthcare [28]. For example, cloud service providers collect a series of key performance indicators (KPIs), such as CPU utilization, memory usage, and network I/O, to analyze and optimize the performance and health of their servers. As another example, wearable sensors continuously monitor heart rates, blood pressure, and other measures of body conditions, which can be analyzed to provide actionable information on the health and wellbeing of patients.
Among various problems with time series data, anomaly detection plays a central role due to its prevalence and importance in industrial applications. Specifically, timeseries anomaly detection (TSAD) aims to identify unexpected patterns that do not follow the expected behavior from a series of data points observed over time. Those unexpected patterns, or anomalies, typically signify unusual events, such as attacks in enterprise networks [54], structural defects in jet turbine engineering [58], seizures in brain activities [31], and ecosystem disturbances in earth sciences [16]. Accurate anomaly detection can therefore trigger prompt warnings and troubleshooting, helping to avoid potential losses. As such, to detect different types of anomalies from timeseries data in various domains, numerous TSAD algorithms have been proposed over the last decades. For example, [48] reported 158 different methods to detect time series anomalies, ranging from statistical analysis, signal processing, and data mining to deep learning models.
Despite extensive studies on TSAD in the literature, to the best of our knowledge, there have not yet been many explorations on their efficiency, that is, how to build a TSAD model with high accuracy using fewer computational resources. Generally, existing studies follow the common assumption that more training samples (typically sliding windows in the time series for TSAD) lead to better predictive performance, known as scaling laws [27]. Meanwhile, building a largescale machine learning (ML) model for TSAD also incurs high costs, especially when computational resources are limited. According to our experimental results, building a deep learning model, e.g., LSTMAD [37], on the electrocardiogram (ECG) [40] dataset with more than 200k training samples takes nearly 1 week using an Nvidia RTX A6000 GPU (with batch size 128, subsequence length 64, and number of epochs 50). Furthermore, training a classic ML model such as LOF [11] on the same ECG dataset takes about 11 h using a server with 16 CPU cores at 2.40 GHz. Consequently, they fail to support realtime decision making when time series are generated rapidly. Improving the efficiency of TSAD models becomes imperative for their industrial implementation. Toward this end, we aim to improve the data efficiency of the TSAD models, that is, the tradeoff between the size of the training set and the accuracy of the output model, as fewer data naturally leads to higher time efficiency and lower computational resource consumption.
In this paper, we systematically investigate the data efficiency of Univariate Time Series Anomaly Detection (UTSAD) models, for which the data points in the time series consist of only one variable without intervariable dependencies and correlations. In addition, we focus on the task of detecting subsequence anomalies over sliding windows in time series. We first benchmark nine popular UTSAD methods in different areas (data mining, classic ML, and deep learning) with different learning paradigms (unsupervised and semisupervised) on univariate time series randomly sampled from three datasets with various anomaly ratios and average subsequence lengths. Based on the benchmark results (see Fig. 1 for details), we obtain three key observations. First, the accuracy of the output model initially improves when the number of training samples increases but then becomes stable or even decreases. Second, the accuracy of the output model built on only a small fraction of sliding windows can be very close to that of all sliding windows. Third, in most cases, the time to build a model with little loss of accuracy on the “small” data is much less than the time to build a model on the full “big” data. These findings demonstrate that there is much room to improve the data efficiency of the UTSAD methods, thus providing strong support for our motivation.
Based on the above experimental observations, we propose a novel and generic framework called FastUTSAD to improve the data efficiency and reduce the computational overhead of different UTSAD models with little loss of accuracy. Specifically, instead of feeding all training data to the UTSAD model at once, FastUTSAD trains the model in multiple stages, each of which only inputs a small fraction of the training set. Then, inspired by scaling laws [27], FastUTSAD utilizes a heuristic method, based on the observation that the accuracy of the model remains nearly unchanged when too much data is used, to automatically and adaptively determine the number of sliding windows the UTSAD model requires. In this way, once the model performance becomes stable, FastUTSAD will terminate the incremental training procedure and output the model trained only on the sampled data.
Finally, we conduct extensive experiments among eight benchmark datasets and nine popular UTSAD methods. The results indicate that our FastUTSAD framework exhibits much higher data efficiency than existing methods. For all the UTSAD methods that we test, FastUTSAD reduces the number of training samples and the training time by 91.09–91.49% and 93.49–93.82% on average without significant decreases in different accuracy measures.
The main contributions of this paper are summarized as follows:

We experimentally explore the relationship between the training sample size, training time, and accuracy of different UTSAD algorithms on large benchmark datasets, thus finding a chance to improve their data efficiencies.

We propose a novel FastUTSAD framework that improves data efficiency and reduces computational overhead for UTSAD tasks at little expense of accuracy. In the FastUTSAD framework, we design a multistep continual training (MCT) strategy to automatically determine the appropriate training data size for a given dataset and a UTSAD model to achieve the desired performance.

We perform comprehensive experiments on eight benchmark datasets to confirm the superior performance of FastUTSAD for the tradeoff between data efficiency and model accuracy.
Paper Organization: The rest of this paper is organized as follows. Section Related work reviews the related work. Section Background introduces the basic concepts and formally defines the UTSAD problem. Section Experimental observations on data efficiency of UTSAD methods provides our key observations from an experimental study on the relationship between model performance and the number of training samples. Section The FastUTSAD framework presents the FastUTSAD framework in detail. Section Experimental evaluation for FastUTSAD demonstrates the experimental results to evaluate the performance of FastUTSAD. Section Conclusion concludes this work and indicates possible future directions.
Related work
Univariate time series anomaly detection
Numerous algorithms have been proposed for the problem of univariate time series anomaly detection (UTSAD). We broadly categorize existing UTSAD algorithms into three types: unsupervised, semisupervised, and supervised methods, based on whether and how the training sliding windows are labeled. We refer interested readers to [13, 17, 18, 44, 48, 64] for extensive surveys on UTSAD methods. Next, we briefly discuss algorithms of each kind separately.
Unsupervised UTSAD methods
Methods of this type do not require training sliding windows to be labeled and are thus widely applicable in different scenarios. They implicitly assume that anomalous instances of the time series can be distinguished from their normal counterparts since they are scarce and generated from a different distribution. Generally, they consider using different measures to assign an anomaly score to each instance and find those with the highest anomaly scores as anomalies. Histogrambased Outlier Score (HBOS) [22] is a histogrambased anomaly detection algorithm. It models the densities of instance features using histograms with a fixed or dynamic bin width and then computes the anomaly score of each instance by computing how likely the instance is to fall within the histogram bins for each dimension. The Local Outlier Factor (LOF) [11] is a local densitybased method that measures the degree to which a data point is isolated by comparing it with its neighbors. LOF finds data points with lower local densities than their neighbors as anomalies. Isolation Forest (IForest) [56] is a treebased method for anomaly detection. Its basic idea is that anomalous instances are easier to separate from the rest of the samples. As such, it recursively generates random partitions on the samples and organizes them hierarchically in a tree structure, where an instance closer to the root of the tree is assigned a higher anomaly score. The Deep Autoencoding Gaussian Mixture Model (DAGMM) [65] is a deep learning method for anomaly detection based on reconstruction, which assumes that anomalies cannot be effectively reconstructed from lowdimensional projections. DAGMM utilizes the autoencoder to reconstruct the input data and the Gaussian Mixture Model (GMM) for density estimation.
Semisupervised UTSAD methods
Methods of this type assume that only the normal class of time series is labeled and build models to identify normal patterns. Consequently, new instances will be recognized as abnormal if they diverge from the expected patterns. The oneclass support vector machine (OCSVM) [49] is a classic support vector method that identifies the boundary of normal data and regards instances outside the boundary as anomalies. Autoencoder (AE) [47] is a neural networkbased method that performs a nonlinear dimensionality reduction to detect subtle anomalies in which the linear principal component analysis fails. AE is to detect anomalies by learning the representation of normal patterns and computing the reconstruction errors as anomaly scores. The variable autoencoder (VAE) [5, 29] is a deep generative Bayesian network, a.k.a. probabilistic encoders and decoders, with the latent variable and the observed variable. VAE computes the reconstruction probability [5] between the input and output as anomaly scores. DeepAnT [42] is a convolutional neural network (CNN) model that uses the concept of forecasting. It uses a CNN to predict the next value of l, where l is the prediction length. Then, the predicted errors between the real values and the predicted values are seen as anomaly scores. LSTMAD [37] is a forecasting model that trains a Long ShortTerm Memory (LSTM) network on nonanomalous data to forecast future values and uses the prediction error as an indicator of anomalies. The main difference between semisupervised and unsupervised methods is that semisupervised methods represent normal patterns based on labeled normal instances, whereas unsupervised methods depend only on the data distribution.
Supervised UTSAD methods
Methods of this type consider that the training sets contain labeled instances of normal and abnormal classes and build predictive models to distinguish differences between the two classes, which are then applied to unseen instances for prediction. Opprentice [33] is an ensemble method in which multiple existing detectors are used to extract anomaly features and a random forest classifier is applied to automatically select the appropriate combinations of detector parameters and thresholds. RobustTAD [21] is a time series anomaly detection framework that combines time series decomposition and CNNs to handle complicated anomaly patterns. TCQSA [34] is a generic time series anomaly detection method consisting of a twolevel clusteringbased segmentation algorithm and a hybrid attentional LSTMCNN model. Random Forest (RF) [10] is a treebased ensemble method that fits several decision trees on different subsamples of the subsequence and uses averaging to improve predictive accuracy and reduce overfitting. It is widely used for UTSAD, such as [33, 36, 48]. [36] proposed a novel framework that supports anomaly detection in uncertain data streams based on effective period pattern recognition and feature extraction techniques. However, supervised methods are very restricted for UTSAD because labeled instances are often unavailable [48]. Therefore, most of the recent literature, such as [9, 44, 48, 51], mainly focuses on semisupervised and unsupervised methods for UTSAD.
Despite the extensive methods for UTSAD in the literature, to the best of our knowledge, they have paid little attention to data efficiency. In this work, we propose to improve the data efficiency for two of the three types of UTSAD methods in a generic framework. We do not consider supervised algorithms because labeled instances are often unavailable in reallife scenarios [48].
Data efficiency of ML models
The data efficiency of ML models has been extensively studied in the literature. Existing studies on data efficiency exploit the relationship between accuracy and training data size for many ML models and problems, including CNNs [25], metalearning [3], graph neural networks (GNNs) [62], and reinforcement learning [61]. Their results mostly indicate that using more training samples can lead to better model performance to some extent. However, excessive training samples bring little gain in accuracy, but, on the contrary, incur unnecessary computational costs. Note that no prior work on data efficiency is specific to UTSAD models, to the best of our knowledge.
A fundamental approach to improving the data efficiency of ML models is to perform sampling on the training set. There are three main types of sampling methods for ML tasks [46]: uniform, importance, and grid. Uniform sampling methods [32], which include each sample in the training set with equal probability, are the simplest and most common method that works on a wide spectrum of ML tasks. Importance sampling methods, often based on the notion of coresets [26], extract a small subset of samples based on their importance for the given ML problem. Grid sampling methods [1] are to divide the input space into small groups, extract the representatives from each group, and then weight them by the number of input points in the group. Although these methods have been applied to many ML tasks, as far as we know, we are the first to study the data efficiency of UTSAD.
Background
This section provides an overview of the fundamental concepts considered in this paper, followed by a formal definition of the problem under investigation.
Univariate time series
Time series data can be divided into univariate time series (UTS) and multivariate time series. In this work, we focus on the UTS data. UTS data refers to a series of observations at a single variable, which are usually collected at regular time intervals, such as 1 min. Formally, a UTS is denoted as \(D = D^n = \{d_1, d_2, \dots , d_n\}\), where n is the length of the UTS and \(d_i\) is an observation at timestamp i. Each data point \(d_i\) has a label (0 or 1) that indicates whether the data point is an anomaly or not. By default, we use “0” for a normal data point and “1” for an anomaly data point. Therefore, a data point in UTS can be represented as a triple \(d_i = (t_i, v_i, l_i)\), where \(t_i\) is the timestamp at which \(d_i\) is observed, \(v_i\) is the observation value, and \(l_i\) is the label for \(d_i\). For example,
is a UTS with five data points. In this example, the first data point \(d_1 = (\text {12:30}, 0.5, 0)\) is observed at \(\text {12:30}\) with a value of 0.5 and labeled as 0 (i.e., normal), and the third point \(d_3 = (\text {12:32},0.9,1)\) is the only anomaly data point observed. The time interval for this UTS is \(t_{i}  t_{i1} = 1\).
Sliding window
A sliding window is a consecutive subsequence of length l from a UTS D. We use \(W_{D}\) to denote all sliding windows generated from D with length l and time step \(t_{step}\). Formally, a sliding window \(w_i\) is defined as \(w_i = \{d_{il+1},d_{il+2}, \dots , d_i\} \in W_{D^n}\) (\(l \le n\)), where l is the length of the sliding window and n is the length of the UTS D. Taking the UTS D in Eq. 1 as an example, the first sliding window contains the first three data points in D when the window size \(l=3\) and the time step \(t_{step}=1\), i.e., \(w_3=\{d_1,d_2,d_3\}\) (\(w_1, w_2\) are omitted because they do not have enough points). The second and third sliding windows are \(w_4=\{d_2,d_3,d_4\}\) and \(w_5=\{d_3,d_4,d_5\}\). Therefore, \(W_D\) consists of three sliding windows: \(w_3\), \(w_4\), and \(w_5\). The sliding window is an effective method for processing UTS data and has been widely used in the literature [15, 24, 41, 55, 59, 60]. It captures the correlation between the data point \(d_i\) and neighboring data points. In addition, it allows us to divide a UTS into multiple subsequences with fixed length l and time step \(t_{step}\) as input for ML algorithms.
Rangebased anomaly detection
We need to convert point labels to window labels when using sliding windows for semisupervised anomaly detection since anomaly windows should be removed. In this work, following the existing literature [15, 41, 55, 59, 60], we consider a sliding window to be abnormal if it contains any anomaly data points. Also taking the UTS D in Eq. 1 as an example, the first window \(w_3=\{d_1,d_2,d_3\}\) is anomalous since \(d_3\) is an anomaly data point. Formally, if \(\exists d_j \in w_i\) is an anomaly, then \(w_i\) is labeled as an anomalous sliding window.
We consider detecting anomalies based on ranges, that is, anomalies that occur consecutively over a period of time [52]. For example, if the window labels of a UTS are \(\{a,b,c,d,e\}=\{1,0,1,1,0\}\), then the UTS contains two range anomalies instead of three (there are three 1’s in the labels, but only two consecutive ranges), where the first range anomaly only contains the first value a, and the second range anomaly consists of the third and fourth values \(\{c,d\}\). The normal labels act as a separator for dividing the original continuous labels into multiple subintervals. Each subinterval with continuous anomalies indicates a range anomaly.
Problem formulation
Our goal in this paper is to improve the data efficiency of a specified UTSAD algorithm. We treat the problem of improving data efficiency as identifying the minimum set of sliding window training \(W_D^* \subseteq W_D\), where \(W_D\) is the set of all sliding windows generated from D, on which the trained model can achieve a performance close to that trained on \(W_D\). Given a UTS D and an anomaly detection algorithm \(\mathcal {A}\), the problem is formalized as follows:
where \(W'_D\) is a subset of training sliding windows sampled from \(W_D\), \(Acc_{\mathcal {A}(W_{D})}\) and \(Acc_{\mathcal {A}(W'_D)}\) are the accuracy measures of \(\mathcal {A}\) trained on the original and sampled sets of sliding windows, and \(\theta \in (0, 1)\) is the threshold to control the accuracy loss. Note that we evaluate the accuracy of an algorithm at a UTS level and perform hypothesis testing at a dataset level. Because for a specific UTS D, the number of accuracy values (\(Acc_{\mathcal {A}(W_D)}\) and \(Acc_{\mathcal {A}(W'_D)}\)) is not enough to perform hypothesis testing. For a dataset consisting of m individual UTS \(D_{1}, \ldots , D_{m}\), we obtain m different (original) models \(\mathcal {A}(W_{D_1}), \ldots , \mathcal {A}(W_{D_m})\) by running \(\mathcal {A}\) on all windows of each UTS. Meanwhile, by solving Eq. 2, we also obtain m models \(\mathcal {A}(W^*_{D_1}), \ldots , \mathcal {A}(W^*_{D_m})\) by running \(\mathcal {A}\) on the windows sampled from each UTS. We use Welch’s ttest to decide whether the accuracy significantly decreases between \(\mathcal {A}(W_{D_1}), \ldots , \mathcal {A}(W_{D_m})\) and \(\mathcal {A}(W^*_{D_1}), \ldots , \mathcal {A}(W^*_{D_m})\) since the variances of the accuracy measures are often not available. The null hypothesis is that the accuracy measures \(Acc_{\mathcal {A}(W^*_{D_1})},\ldots , Acc_{\mathcal {A}(W^*_{D_m})}\) on \(\mathcal {A}(W^*_{D_1}),\) \(\ldots , \mathcal {A}(W^*_{D_m})\) are not less than the accuracy measures \(Acc_{\mathcal {A}(W_{D_1})},\) \(\ldots , Acc_{\mathcal {A}(W_{D_m})}\) on \(\mathcal {A}(W_{D_1}), \ldots , \mathcal {A}(W_{D_m})\). If the pvalue of Welch’s ttest is larger than or equal to the significant level (e.g., 0.01), we say that the hypothesis is supported.
Experimental observations on data efficiency of UTSAD methods
In this section, we will start by examining the UTSAD approaches that are under experimental evaluation. Subsequently, the accuracy measures, datasets, and fundamental settings of our experimental investigation are presented. Finally, we present the primary observations on the data efficiency of existing UTSAD approaches.
Review of UTSAD methods
Local outlier factor (LOF)
LOF [11] is an unsupervised densitybased method for anomaly detection. It computes the local density deviation of a given data instance (a sliding window \(w_i\) in our context) with respect to its neighbors. The data point will be regarded as an anomaly if its local density is lower than that of its neighbors. On the contrary, the local density of a normal data instance is typically similar to that of its neighbors.
LOF follows the following two steps to detect anomalies: 1) Calculate the Local Reachability Density (LRD) of a data instance x in its kneighborhoods as Eq. 3.
where k is the number of neighbors, \(N_k(x)\) is the set of knearest neighbors of x, and \(r_k\) is the reachability distance of the data instance x. The reachability distance is used to reduce statistical fluctuations in the assessment of the anomaly score. 2) Calculate the anomaly score \(s_{\text {LOF}}\) as the average of the ratio between the LRD of x and those of the knearest neighbors of x by Eq. 4.
where \({\text {LRD}}_k(\cdot )\) is the local reachability density of a data instance computed by Eq. 3. It is easy to see that the lower \({\text {LRD}}_k(x)\) is and the higher \(\sum _{o \in N_k(x)} {\text {LRD}}_k(o)\) is, the higher \(s_{\text {LOF}}(x)\) is. For LOF and all remaining methods, a higher anomaly score indicates a high probability of being an anomaly.
Histogrambased outlier score (HBOS)
HBOS [22] is an unsupervised histogrambased method for anomaly detection. It creates a histogram for each dimension (feature) and calculates the anomaly score based on the density estimation of each feature on the corresponding histogram bins. Then, the anomaly score is used to determine whether a data instance is an anomaly.
The HBOS anomaly detection procedure is presented as follows: 1) Compute an individual histogram for each of the d dimensions (features), where the height of each single bin represents a density estimation. 2) Normalize each histogram to ensure an equivalent weight of each dimension in the outlier score. 3) Calculate the HBOS score of an instance x using the corresponding height of the bins where the instance is located by Eq. 5.
where d is the number of features of data x (the window length l in our context). The score \(s_{\text {HBOS}}(x)\) is a multiplication of the inverse of the estimated densities, assuming the independence of each feature.
Isolation Forest (IForest)
IForest [35] is an unsupervised treebased method for anomaly detection. IForest uses the concept of isolation instead of measuring distance or density to detect anomalies and assumes that an anomaly can be isolated in a few steps. A short path indicates that a data instance is easy to isolate because its attribute values are significantly different from other values. IForest builds multiple trees by performing recursive random splits on attribute values. Then, the anomaly score of a given data instance x is defined as the average path length from a particular sample to the root in Eq. 6.
where h(x) is the length of the path for a data point from its leaf to the tree root, E(h(x)) is the average of h(x) from a collection of the build trees, c(n) is the average path length of an unsuccessful search in a binary search tree, and n is the number of the original data instances. According to Eq. 6, the lower the path to the root of x, the higher the anomaly score of x.
Oneclass support vector machine (OCSVM)
OCSVM [49] is a semisupervised SVMbased method for anomaly detection. It learns a decision boundary (a.k.a. hyperplane) from the normal data instances. Then, data instances outside the boundary are considered anomalies. The distance of an instance from the boundary is used to compute an anomaly score. The larger the distance (outside the hyperplane), the higher the anomaly score. OCSVM solves the following quadratic problem to learn a decision boundary around normal instances:
where N is the number of training samples, \(\Phi (\cdot )\) is a nonlinear feature map that can be computed by evaluating a kernel function \(k(\cdot , \cdot )\), \(\xi\) is a parameter to avoid overfitting, \(\omega\) and \(\rho\) are the parameters to define the hyperplane, \(\nu \in [0,1]\) controls the tradeoff between the training classification error and the margin maximization in one class. Once \(\omega\) and \(\rho\) are obtained, the anomaly score is computed by
where \(a_i, i \in \{1,2,\dots , n\}\) is the Lagrange parameter and \(k(\cdot )\) is a kernel function. We use the RBF kernel \(k_\gamma (x, y) = \exp (\gamma \Vert xy\Vert ^2)\) in our experiments. To ensure that the notion of an anomaly score is consistent, we assign the original decision score a negative sign. The modified \(s_{\text {ocsvm}}(x_i)\) represents the signed distance to the separating hyperplane, where a positive value is given for an outlier and a negative value for an inlier.
AutoEncoder (AE)
AE [39] is a semisupervised reconstructionbased method for anomaly detection. AE tries to learn the normal pattern of the given data to minimize the reconstruction errors for normal instances. Then, new data instances with large reconstruction errors are considered anomalies. AE consists of two components: an encoder network and a decoder network. The encoder \(f(\cdot )\) compresses the input data x into a lowdimension representation z, and then the decoder \(g(\cdot )\) projects z to the original dimension as the output \({x'}\), where \(x^\prime =g(f(x))\). The objective function of AE is given in Eq. 9.
where N is the number of training samples and \(\theta\) is the parameter of AE. In Eq. 9, the mean squared error (MSE) is used to measure the reconstruction error. Other error measures, such as the mean absolute error (MAE), cross entropy, and binary cross entropy, can be used in place of the MSE. We also note that some variants of AE [39] introduce additional regularization terms to Eq. 9 to improve robustness. In this work, we use MSEbased AE without regularization following existing UTSAD methods [44, 63]. After solving the parameters \(\theta\), the anomaly score can be calculated from the reconstruction error in Eq. 10.
Variational autoencoder (VAE)
VAE [5, 29] is also a semisupervised reconstructionbased method for anomaly detection. Its main component is a deep generative Bayesian network, a.k.a. probabilistic encoders and decoders, with the latent variable z and the observed variable x. The idea of VAE for anomaly detection is similar to that of AE, which first trains a VAE using normal data and then considers a new data instance with a large deviation as an anomaly. VAE consists of a probabilistic encoder network and a decoder network, with the latent variable z and the observed variable x (i.e., data instances). The generative process of VAE starts with a variable z with a prior distribution p(z). Then, the decoder network g is applied to z and then outputs \({x'}\) with distribution \(p({x'}g(z))\). Variational inference techniques are applied to perform posterior inference of q(zx). The posterior inference aims to train a separated distribution q(zf(x)) to approximate q(zx) by the encoder network f. The overall objective of VAE is to maximize the likelihood of the observation x in Eq. 11.
where p(z) is the prior distribution and p(xz) is the posterior distribution of variable x. Since sampling from z is a stochastic process, the gradients may not be correctly estimated. To overcome this issue, the reparameterization trick is introduced to maximize the evidence lower bound through variational inference for each observation x. Thus, Eq. 11 can be rewritten as Eq. 12.
where \(\text {KL}(\cdot )\) is the KL divergence. Since \(\text {KL}(\cdot ) \ge 0\), we only need to maximize \(L_b\). Thus, VAE uses q(zx) to approximate p(zx) by the encoder network f. After p and q are solved, the reconstruction probability [5] is computed as the anomaly score, i.e.,
where L is the number of samples from z, \(\mu _{\hat{x}}, \sigma _{\hat{x}}\) are the distribution sampled from z.
Deep autoencoding Gaussian mixture model (DAGMM)
DAGMM [65] is an unsupervised method that combines AE with the Gaussian mixture model (GMM) to detect anomalies. DAGMM consists of an AEbased representation network to generate a lowdimensional representation of input data instances and a GMMbased estimation network to compute reconstruction errors and anomaly scores. The objective function of DAGMM is given in Eq. 14.
where \(L(x_i,x_i^{\prime })\) is the reconstruction error, \(E(z_i)\) is the probabilities of observed input samples, N is the number of input instances, \(P(\hat{\Sigma })\) is a penalty term to tackle the singularity of GMM, \(\lambda _1\) and \(\lambda _2\) are hyperparameters. Once the unknown parameters are solved, the anomaly score is given by
where \(\hat{\phi }_k,\hat{\mu }_k, \hat{\Sigma _k}\) are the mixture probability, mean, and covariance for the kth component in GMM, and \(\varphi (x)\) is the features (combining \(z_c\) and \(z_r\)) generated by the compression network.
LSTMAD
LSTMAD [37] is an LSTMbased semisupervised model for anomaly detection. LSTMAD trains an LSTM network on normal data instances to forecast future values and uses the prediction error as an indicator of anomalies. For a new data instance, a higher prediction error implies a more likely anomaly. LSTMAD detects anomalies in the following steps. First, it uses a stacked LSTM network trained on normal data to predict the next l value, where l is the prediction length. The prediction errors are then fed to a multivariate Gaussian distribution to compute the anomaly score. In this work, we use prediction errors as anomaly scores. The anomaly score is given by
where \(\varphi (x)\) is the last value of the input data instance x (x is a sliding window in our context) and \(g(\cdot )\) is the value predicted by LSTMAD.
DeepAnT
DeepAnT [42] is a semisupervised CNNbased model for anomaly detection. It uses a convolutional neural network (CNN) to predict the next l value, where l is the prediction length. The objective function of DeepAnT is given by
where \(f(x_i)\) is the actual value of the sliding window and \(f(x_i^{\prime })\) is the predicted value of the sliding window associated with \(x_i\). Then, the predicted value is fed to an anomaly detector, which uses the Euclidean distance to compute the anomaly score:
where \(\varphi (x)\) is the last value of the input data instance x (x is a sliding window in our context) and \(g(\cdot )\) is the value predicted by DeepAnT.
Accuracy measures
In the UTSAD task, the accuracy metrics in most existing studies [8, 21, 43, 44, 48, 57] are computed by the anomaly score. Following the existing literature, we calculate a score for each corresponding sliding window \(w_i\), where a higher score means a high likelihood that the window is an anomaly. Furthermore, we evaluate the accuracy metrics of a UTS method based on the predicted window labels and their corresponding real window labels. Based on the anomaly score, several accuracy measures have been proposed to quantitatively assess the effectiveness of the UTSAD methods. Subsequently, we review these measures and present some of their limitations.
Precision, recall, and Fscore
Let P and N be the number of actual positive and negative windows, and TP, FP, TN, and FN be the number of true positive, false positive, true negative, and false negative classification results. The precision and recall of a classification method are defined as:
However, one limitation of these two measures is that high precision may cause low recall and vice versa. To overcome the shortcomings of precision and recall, the Fscore is proposed. The FScore is calculated as the harmonic mean of precision and recall:
Rangeprecision, rangerecall, and rangeFScore (RF)
The original precision, recall, and FScore are designed primarily for point anomalies. However, time series anomalies are rangebased, which means that they occur over a period of time [52]. In other words, the three measures suffer from the inability to represent rangebased time series anomalies. To alleviate the shortcomings of traditional measures, [52] expanded the wellknown definitions of precision and recall to measure ranges instead of points. They proposed two measures, RangePrecision and RangeRecall, accordingly. Note that although the point labels have been converted to window labels in our study, the converted window labels are also a continuous sequence. The RangeRecall is defined as
Here, R is a set of real anomaly ranges \(R = \{R_1, \dots , R_{N_r}\}\) and P is a set of predicted anomaly ranges \(P = \{P_1, \dots , P_{N_p}\}\). In our context, \(R_i\) and \(P_i\) are subsequences of the real window labels and the predicted window labels, respectively. \(N_r\) and \(N_p\) are the total numbers of the real and predicted anomaly ranges. \(\text {ER}(\cdot , \cdot )\) indicates whether a true anomaly range is detected; \(\text {OR}(\cdot , \cdot )\) evaluates the overlap between the real and predicted anomaly ranges in three aspects: (1) \(\omega\) measures how many points in \(R_i\) are detected; (2) \(\delta\) measures the position bias between the real and predicted anomaly ranges; and (3) \(\text {CF}(\cdot , \cdot )\) measures the number of fragmented regions corresponding to a given anomaly range \(R_i\). A typical choice for \(\text {CF}(R_i, P)\) is \(\frac{1}{x}\), where x is the number of distinct overlapped ranges. Finally, the value of \(\alpha\) controls the tradeoff between \(\text {ER}(\cdot , \cdot )\) and \(\text {OR}(\cdot , \cdot )\). Similarly, RangePrecision is defined as
Finally, RangeFScore (RF) is defined as
AUC ROC
The above measures depend on a predefined threshold in the anomaly score to decide whether a window is anomalous. Typically, the threshold \(\theta\) is set to \(\mu + 3\sigma\) according to [7], where \(\mu\) is the mean and \(\sigma\) is the standard deviation. Given a threshold \(\theta\) and an anomaly score \(s_i\) of the sliding window \(w_i\), one can decide whether a sliding window \(w_i\) is an anomaly or not:
However, thresholdbased measures are sensitive to the threshold value. To remove the effect of the threshold, the Area Under the Receiver Operating Characteristics Curve (AUC ROC) [20] is introduced. AUC ROC is defined as the area under the curve that represents the relationship between the true positive rate (TPR) on the yaxis and the false positive rate (FPR) on the xaxis as we vary the anomaly score threshold. The area under the curve is calculated by the trapezoidal rule. Let \(\Theta =\{\theta _0,\theta _1,\dots ,\theta _N\}, \theta _i< \theta _j, i<j; \theta _i \in [0,1]\) be a set of thresholds. Then, AUC ROC is defined as
AUC PR
Since the AUC ROC may be excessively optimistic when applied to unbalanced samples, another AUCbased metric, the Area Under the PrecisionRecall Curve (AUC PR) [19], is introduced. The AUC PR is defined as the area under the curve corresponding to the recall on the xaxis and precision on the yaxis when we vary the anomaly score threshold. The area under the curve is calculated by the trapezoidal rule. Let \(\Theta =\{\theta _0,\theta _1,\dots ,\theta _N\}, \theta _i< \theta _j, i<j; \theta _i \in [0,1]\) be a set of thresholds, AUC PR is defined as
VUS ROC and VUS PR
As mentioned above, rangebased measures are better than pointbased measures in the UTSAD task. However, AUC ROC and AUC PR [43] are pointbased metrics. To address this issue, [43] proposed the Volume Under the Surface (VUS) for the ROC and PR curves: AUC ROC and AUC PR. AUCbased measures are robust to lag, noise, anomaly cardinality ratio, and high separability between accurate and inaccurate AD methods [43]. Let \(\Theta =\{\theta _0,\theta _1,\dots ,\theta _N\}, \theta _i< \theta _j, i<j; \theta _i \in [0,1]\) be a set of thresholds, \(\mathcal {L}=\{\ell _0, \ell _1, \dots , \ell _l \}, \ell \in [0,1]; \ell _i< \ell _j, i<j\) be the buffer length introduced based on the idea that there should be a transition region between the normal and abnormal subsequences. The \(\ell\) is used to accommodate the false tolerance of the labeling in the ground truth. Then, the VUS ROC is defined as
Here, \(\text {TPR}_\ell\) and \(\text {FPR}_\ell\) are rangebased versions of \(\text {TPR}\) and \(\text {FPR}\). Similarly, VUS PR is defined as
Here, \(\text {Recall}_\ell\) and \(\text {Precision}_\ell\) are the rangebased measures corresponding to traditional recall and precision in the AUC PR.
Key observations
To evaluate the data efficiency of different UTSAD methods, we first observe the relationship between the number of sliding windows sampled \(n_{sw}\) for training and their performance (accuracy and training time). We evaluate all UTSAD methods in section Review of UTSAD methods, as summarized in Table 1. We evaluate these methods on 45 randomly selected time series from three benchmark datasets, namely YAHOO, SMD, and DAP in Table 2, which span two different domains: YAHOO and SMD include KPIs of computer servers, while DAP consists of healthcare data. The selected datasets also vary in length and anomaly rates, with YAHOO having an average length of 1.54k and an anomaly rate of 0.45%, SMD having an average length of 25.37k and an anomaly rate of 3.52%, and DAP having an average length of 20.3k and an anomaly rate of 10.37%. A more detailed description of the datasets is provided in section Experimental setup. The accuracy and training time of each method on all sliding windows are shown in Table 3 and these results by varying the number of sliding windows sampled are shown in Fig. 1. Based on the results, we draw three main observations.
Observation 1: As the amount of data increases, most accuracy metrics initially improve and then tend to stabilize or decrease. In other words, a method generally becomes more accurate when more training windows are fed, but the marginal utility of the extra windows gradually diminishes. Taking the AE model as an example, the VUS ROC shows a growing trend with increasing \(n_{sw} \in [2,16]\), where \(n_{sw}\) is the number of sliding windows used to train the model. However, the VUS ROC does not increase when \(n_{sw} \in [16, 10240]\). A counterintuitive fact is that only 16 sliding windows can train an AE model with good accuracy. This can be explained by the UTS in Fig. 2: two sliding windows (\(n_{sw}=2\)) can be used to train a good semisupervised model because they have represented typical regular patterns of the UTS, which the AE model needs to learn the normal pattern of the UTS.
Observation 2: The accuracy of the model obtained with a small number of sliding windows is essentially equal to the accuracy obtained with all sliding windows. Taking AE as an example, the VUS ROC value has reached 71.90 when \(n_{sw} =16\), which is higher than the VUS ROC value of 69.62, as shown in Table 3. In other words, one can train a model with a small number of sliding windows instead of all. This finding can be verified across all nine methods with different \(n_{sw}\), as shown in Fig. 1.
Observation 3: In most cases, the time to train a model on a small number of sliding windows is much shorter than the time to train it on full sliding windows. Also, taking AE as an example, the training time on a few sliding windows is much less than the full training time of 23.88 h (as shown in Table 3). Combining the above observations, we can reduce the number of training data instances and time and improve the data efficiency if we develop a strategy to find the smallest \(n_{sw}\) (denoted by \(n_{sw}^*\)) for each algorithm and UTS. Here, we use \(W_D^*\) to denote the smallest training windows by the strategy and \(\text {len}(W_D^*)=n_{sw}^*\), where \(\text {len}(W_D^*)\) is the number of sliding windows in \(W_D^*\).
Remark: The above three observations clearly imply great opportunities to improve the data efficiency of UTSAD methods. However, there are four challenges to address for such improvements. First, the accuracy increase is not always continuing, that is, the accuracy can decrease with increasing \(n_{sw}\). Second, different accuracy measures can show different trends, that is, there exist various accuracy measures (e.g., VUS ROC, VUS PR, and RF in our evaluation) and they can show different trends when the number of sliding windows for training increases. Third, anomaly detection methods and datasets vary with each other, that is, different combinations of algorithms and datasets have different appropriate \(n_{sw}^*\). Fourth, most methods have an extremely long training time, as shown in Table 3. In the following, we will elaborate on how to design a generic framework to improve data efficiency and address the above challenges.
The FastUTSAD framework
Overview of FastUTSAD
To solve the problem of finding the smallest training sliding windows \(W_D^*\), we propose the FastUTSAD framework. The basic idea of FastUTSAD involves integrating sampling techniques with scaling laws. This integration can be succinctly described as a multistep continual training (MCT) technique that trains the model \(\mathcal {A}\) iteratively in many steps, as opposed to a single training process with full training sliding windows \(W_D\). In each step, we gradually increase the number of sliding windows \(n_{sw}\) sampled from the original sliding windows for training. We observe an increase in model accuracy \(Acc_{inc}\) (such as VUS ROC, VUS PR, and RF). If \(Acc_{inc}\) is too small (i.e., \(Acc_{inc} < \alpha\) for some prespecified small \(\alpha\)), we will stop the training process and return the best algorithm \(\mathcal {A}^*\) and the smallest training windows \(W_D^*\) found so far. Otherwise, we increase the number of training windows \(n_{sw}\) and then retrain the model \(\mathcal {A}\). In the MCT process, the value of \(\alpha\) reflects the sensitivity of FastUTSAD: A smaller \(\alpha\) means that FastUTSAD will train the model with more training windows, although the increase in accuracy has been small. Figure 3 illustrates the architecture of the FastUTSAD framework. In subsequent subsections, we will describe each component of FastUTSAD in more detail.
Input of FastUTSAD
The input of FastUTSAD consists of three parts: (1) a UTS, (2) an anomaly detection algorithm, and (3) the configuration settings of FastUTSAD. We consider that the UTS contains labels for the accuracy calculation to help decide whether to stop training progress. The algorithm can be any UTSAD algorithm that works on sliding windows. The configuration consists of the sampling method used, the sampling gap \(s_{gap}\), and the stop threshold \(\alpha\). Here, the sampling gap \(s_{gap}\) indicates how many training windows will increase in the next iteration. In practice, we typically set \(s_{gap}\) to a constant, e.g., 256. The stop threshold \(\alpha\) controls when FastUTSAD terminates the training process and is usually set to \(0.1\%\).
Data processing
The first step of the FastUTSAD framework is data processing, which consists of data imputation, data standardization, conversion to sliding windows, and data splitting. First, data imputation fills in the missing value by the mean of the original UTS. Second, the UTS is standardized by mean and standard deviation. Formally, \(v_i^*=\frac{v_i\mu }{\sigma }\), where \(v_i\) is the original value, \(\sigma\) is the standard deviation, and \(v_i^*\) is the standardized value. Third, the standardized UTS is converted to sliding windows with a window size of 64 and a time step of 1. We note that the window size and time step are flexible in FastUTSAD and can be changed to other values. Finally, we use a fivefold crossvalidation to split the dataset into training, test, and validation sets in order to fairly evaluate each algorithm’s accuracy.
Data sampling
The second step of the FastUTSAD framework is data sampling, which draws a subset of sliding windows from the original training sliding windows. We consider the following three sampling methods for FastUTSAD, namely simple random sampling, Latin hypercube sampling, and distance sampling.
Simple Random Sampling: Random sampling is a basic method that randomly and uniformly samples sliding windows with a given size \(n_{sw}\) from \(W_D\) without replacement. It does not assume data distribution and reduces sampling bias [38].
Latin Hypercube Sampling (LHS): Latin hypercube sampling divides the UTS into \(n_{sw}\) equally spaced intervals and randomly draws a sample sliding window from each interval to achieve a uniform distribution over time.
Distance Sampling: Distance sampling refers to sample data based on their distances, which is specific to time series data over sliding windows. Its basic idea is similar to that of stratified sampling: it calculates the distances of instances (i.e., sliding windows) and then divides them into multiple groups of equal distance intervals accordingly. Finally, we randomly draw samples of equal size from each group. The distance calculation takes into account three aspects of each sliding window \(w_i\): the 95% maximum value, the 5% minimum value, and the Euclidean distance from the zero vector. Formally,
where L is the length of the sliding window \(w_i\).
Model training and performance collecting
The third step of the FastUTSAD framework is to perform the training procedure of the UTSAD algorithm with the sampled windows, instead of the original sliding windows. Then, the full test set is used to evaluate the accuracy of the algorithm. Notice that the accuracy is averaged in five folds. Regarding the evaluation of algorithm accuracy, we adopt all 13 accuracy measures as used by [44]. Furthermore, we also collect measures of time efficiency, including training time and data processing time. The training time only contains the time to perform the training procedure, where the test time is excluded. Data processing time consists mainly of sampling time and distance computation time. Data sampling time refers to the time to sample the subset of the original training sliding windows, and distance computation time refers to the time to calculate the distances of all sliding windows using Eq. 30.
Heuristic MCT method
The fourth step of the FastUTSAD framework is to run the heuristic multistep continual training (MCT) strategy, which considers the accuracy measures in the most recent three iterations, i.e., those from iteration \(i  2\) to i in the ith iteration, to determine whether to increase the number of training sliding windows in the next iteration or terminate the training procedure. MCT proceeds in the following subroutines. First, it calculates the increase in accuracy \(Acc_{inc}\) in the ith iteration by comparing the accuracy measures \(Acc_{i1}\) and \(Acc_{i2}\) in the previous two iterations with \(Acc_{i}\). Formally,
In Eq. 31, we consider the accuracy measures in the previous two iterations for two reasons. On the one hand, MCT provides a relatively loose condition for incremental training. On the other hand, MCT increases the robustness of FastUTSAD. According to our observations in the experiments, the accuracy of a UTSAD algorithm might not always improve when more training samples are used. For example, the increment \(Acc_{inc}\) is small between the ith iteration and the \((i  1)\)th iteration, but it becomes large enough between the ith iteration and the \((i  2)\)th iteration. Then, MCT decides whether the training process is continued. If \(Acc_{inc} < \alpha\), we break the training loops and return the best \(\mathcal {A}^*\) and \(W_D^*\) found so far. If \(Acc_{inc} \ge \alpha\), we increase the number of training windows from \(n_{sw_i}\) to \(n_{sw_{i+1}} = n_{sw_i} + s_{gap}\) and continue to train the algorithm with \(n_{sw_{i+1}}\) samples in the next iteration.
Experimental evaluation for FastUTSAD
In this section, we evaluate the performance of FastUTSAD on the eight benchmark datasets in Table 2, as well as compare it with the nine widely recognized UTSAD methods in Table 1. The experimental setup is briefly described at the beginning of the experiment.
Experimental setup
Datasets: To systematically evaluate FastUTSAD, we used the eight datasets in Table 2. These datasets span three different domains: computer systems, healthcare, and social. In addition, they also vary in length and distribution (anomaly rates). For each dataset, we randomly selected 15 UTS as representatives. In total, we selected 115 UTS to evacuate FastUTSAD (note that MGAB contains only 10 UTS). Every point in a UTS is associated with a label of either “normal” (0) or “abnormal” (1). Table 2 summarizes the relevant characteristics of the datasets, including their size, length, and distribution. Here, the length indicates the average size of every UTS in the corresponding dataset. We run each anomaly detection method on each UTS separately. We briefly describe the eight datasets in the following.

SVDB (MITBIH Supraventricular Arrhythmia Database) [23] includes 78 halfhour ECG recordings chosen to supplement the examples of supraventricular arrhythmias in the MITBIH Arrhythmia Database.

DAP (Daphnet) [6] contains the annotated readings of three acceleration sensors at the hip and leg of Parkinson’s disease patients that experience freezing of gait (FoG) during walking tasks.

ECG [40] is a standard electrocardiogram dataset, and the anomalies represent ventricular premature contractions. The long series (MBA_ECG14046) of length \(\sim 10^7\) was divided into 47 series by identifying the periodicity of the signal according to TSBUAD.

OPP (OPPORTUNITY) [45] is a dataset for human activity recognition from wearable, object, and ambient sensors, which is devised to benchmark human activity recognition algorithms (classification, automatic data segmentation, sensor fusion, feature extraction, etc.).

IOPS [12] is a public dataset consisting of 27 key performance indicators (KPIs) for artificial intelligencebased IT operations (AIOps), which was collected from five large internet companies, including Sougo, eBay, Baidu, Tencent and Alibaba.

SMD (Server Machine Dataset) [50] is a 5weeklong dataset collected from a large Internet company, which contains 3 groups of entities from 28 different machines.

YAHOO [30] is a dataset published by Yahoo! Labs consisting of real and synthetic time series based on the real production traffic to some of the Yahoo! production systems.

MGAB (MackeyGlass anomaly benchmark) [53] is composed of MackeyGlass time series with nontrivial anomalies. MackeyGlass time series exhibit chaotic behavior that is difficult for humans to distinguish.
Algorithms: To evaluate the robustness and effectiveness of FastUTSAD, we used the nine popular anomaly detection algorithms as described in Table 1.
Hyperparameters: For each algorithm, most of the hyperparameters are kept default configurations as [44, 48]. The following hyperparameters are changed for better performance. The length of the sliding window is set to 64, and the time step is set to 1 when we convert the UTS to sliding windows. For deep learning methods, the batch size is set to 128 and the epoch is set to 50. Other parameters we have not mentioned here can be found in [44, 48].
Performance Metrics: In this work, we consider three accuracy metrics: VUS ROC, VUS PR, and rangebase F1 score (RF). According to [43], thresholdindependent metrics are more suitable for time series. Therefore, we selected two thresholdindependent metrics: VUS ROC and VUS PR. VUS ROC measures the overall performance of a classification algorithm at different classification thresholds, and VUS PR measures the precision and recall at different thresholds. To fairly assess the accuracy of the model, we also selected a threshold RF metric, and the threshold \(\theta\) is set to \(\mu + 3\sigma\) according to [7], where \(\mu\) is the mean and \(\sigma\) is the standard deviation. Furthermore, we collected metrics for time efficiency, including algorithm training time and data processing time, as in section Model training and performance collecting.
Environment and Implementation: We conducted the experiments on two servers with the following hardware configuration. For nondeep learning methods, we train them on a server with an Intel (R) Xeon (R) CPU E5–2630 v3 (2 sockets, 72 cores) at 2.40GHz and 256GB of memory. We use 64 cores in parallel when training the algorithms. For deep learning algorithms, we train them on a server with eight Nvidia A6000 GPUs. Each GPU runs six jobs in parallel when training those algorithms. We implemented our algorithms using Python 3.8, scikitlearn 1.2.0, TensorFlow 2.9.1, and PyTorch 1.10.0.
Overall results
We demonstrate the performance of FastUTSAD in three aspects: (1) accuracy; (2) data efficiency; and (3) time efficiency. We compare the accuracy measures (VUS ROC, VUS PR, and RF) between the original model (\(Acc_{ori}\)) and FastUTSAD (\(Acc_{fast}\)). We also used statistical hypothesis testing to verify whether the accuracy measures of each method in each dataset have decreased significantly. In detail, we use Welch’s ttest to test \(Acc_{fast}\) and \(Acc_{ori}\) for each method on each dataset since we do not know the variance of the accuracy measures. The null hypothesis is that \(Acc_{fast}\) is not less than \(Acc_{ori}\). Both \(Acc_{ori}\) and \(Acc_{fast}\) are the averaged values among 5folds. Data efficiency shows the smallest training sliding windows \(n_{sw}^*\) (%) found by FastUTSAD. Time efficiency shows how much training time FastUTSAD can reduce. The default parameters of FastUTSAD are set as follows. The sampling method is simple random sampling. The sampling gap \(s_{gap}\) is set to 256 (as described in Section Overall results). Consequently, the training sample sizes \(n_{sw}\) are \(\{256, 512, 768, \dots , N\}\) over iterations, where N is the number of sliding windows in a given UTS. The stop condition \(\alpha\) is fixed to \(0.1\%\). The significance level is set to 0.01. The hyperparameters of the models are discussed in section Experimental setup. The results for the accuracy and data efficiency of different algorithms are shown in Table 4, and their time efficiency is shown in Table 5.
Accuracy: In general, the accuracy of the model returned by FastUTSAD does not decrease significantly compared to the original model. As shown in Table 4, the minimum pvalues for RF, VUS ROC, and VUS PR are 0.16, 0.55, and 0.63, respectively. According to the null hypothesis (\(Acc_{fast}\) is not less than \(Acc_{ori}\)) and \(\forall \text {pvalue} > 0.01\), we do not have enough evidence to reject the null hypothesis. Therefore, we accept the null hypothesis: \(Acc_{fast}\) is not less than \(Acc_{ori}\) at the significant level of 0.01. The main reason why the accuracy of the model does not decrease is that a small number of sliding windows is enough to train a good model.
Furthermore, the average accuracy found by FastUTSAD is greater than the original accuracy of 6.55–17.58% for different accuracy metrics. The FastUTSAD model shows an increase in accuracy compared to the original accuracy for the three accuracy metrics. Specifically, for the RF, \(Acc_{fast}\) is 23.08 on average, which is 17.58% more than the original \(Acc_{ori}\) of 19.63. Similarly, for the VUS PR, \(Acc_{fast}\) is 40.64, which is 6.92% higher than \(Acc_{ori}\) of 38.01. Lastly, for the VUS ROC, \(Acc_{fast}\) is 72.44, which is 6.55% higher than \(Acc_{ori}\) of 67.99. It should be noted that \(Acc_{fast}\) is higher than \(Acc_{ori}\) for all accuracy metrics. The main reason is that more training windows will cause the model to overfit the training set and thus perform poorly on the test set. To explain the relationship between the number of training windows and accuracy, we explore the relationship between the training loss and the test loss on the number of training windows since a lower test loss usually means better accuracy. We randomly sample a UTS from the IOPS dataset and plot the relationship between training/test loss and the percentage of training windows, as shown in Fig. 4. The test loss at 30% of the original windows is the lowest, instead of 100% of the windows. Therefore, we believe that 30% of the training sliding windows can achieve better accuracy than the original sliding windows. In addition, 80% to 100% of the original windows will have a lower training loss but a higher test loss, a.k.a. overfitting.
Data Efficiency: According to the previous section, FastUTSAD can achieve good accuracy on average. Here, we show that FastUTSAD only needs a small number of sliding windows (8.51–8.91% for different accuracy metrics) to train a model with the promised accuracy. In detail, with respect to the accuracy metric RF, FastUTSAD only needs 8.51% on average of the original training sliding windows to train a model, that of 8.91% for VUS PR, and that of 8.70% for VUS ROC. In summary, training a model without significantly losing model accuracy does not need to use all the sliding windows. Instead, a small number of training sliding windows are enough (8.51–8.91%). To clear the distribution of \(n_{sw}^*\), we counted the number of times on different \(n_{sw}^*\) on the selected 9 algorithms and 115 UTS. We show the results in Fig. 5. According to Fig. 5, 85% of the models only need less than 30% of the original windows. A small number of models need more than 30% of the original windows. These models are trained on the YAHOO dataset. Since the average length of YAHOO is small, this ratio becomes higher. See more information in section Effect of data size.
Time Efficiency: According to the above two sections, FastUTSAD has been shown to achieve better model accuracy and reduce the model training sliding windows. Because the sliding windows in model training are reduced, the training time is naturally reduced. As shown in Table 5, we reduced model training from 67.77 h to less than 2.5 h. In summary, FastUTSAD can reduce the model training time by about 93.49–93.82% without significantly losing algorithm accuracy within the significant level of 0.01. Here, we have gained some new insights about UTSAD tasks: (1) More training data do not necessarily lead to better model performance when the training data is huge; (2) Scaling laws may work in classical algorithms.
Effects of sampling methods and gaps
In this section, we show the effect of sampling methods (Random, LHS, and Dist) and sampling gaps (64, 128, 256, 512, and 1024) on model accuracy (VUS ROC, VUS PR, and RF) and training time (hour). These performance metrics are the averaged value over five folds. The results are shown in Fig. 6. According to Fig. 6, the Dist is shown to have poor accuracy on both three accuracy metrics (e.g., VUS ROC). Instead, the random method can achieve the best accuracy on all three measures with a sample gap of 256. It achieves the best accuracy on all three accuracy metrics, and the training time is minimal. The Dist sampling method achieved the worst accuracy because it changed the distribution (anomaly rate) of the anomaly sliding windows. To verify the distribution of the sampling window, we draw 8.5% sliding windows (achieved the best result in section Experimental observations on data efficiency of UTSAD methods) from the 115 \(W_D\) generated from the selected 115 UTS. We then calculated the anomaly rate of the sampled sliding windows. We find that the anomaly rate of the Dist method is 21.84%, which is significantly different from the original anomaly rate of 10.44%. That is, the distribution of the sliding windows has changed by the Dist method. Therefore, Dist performs worse than LHS and random. Instead, the sampling distribution of LHS and random is the same as the original distribution (LHS is 10.44%, random is 10.46%). This could explain why the effect of LHS is often similar to that of random in Fig. 6.
Effect of stop threshold
This subsection evaluates the influence of different stop thresholds \(\alpha\) in FastUTSAD. As shown in Table 6, it can be seen that all accuracy measures decrease with increasing \(\alpha\). Similarly, the computation time \(T_{fast}\) of FastUTS decreases. This means that the smaller the value of \(\alpha\) is, the better FastUTSAD performs. However, the model training time used by FastUTSAD will be large. The reason is that with a larger \(\alpha\), FastUTSAD becomes stricter, that is, requiring a larger \(Acc_{inc}\). And it is easy to stop training after the first three steps. That is, it can stop before finding the optimal sliding windows for training \(W_D^*\). Hence, it exhibits worse accuracy. The value of \(\alpha\) depends on whether the user is concerned with time or accuracy. If the user cares about time, one can choose a larger \(\alpha\) such as 0.1. If the user cares about accuracy, one can choose a smaller \(\alpha\) such as 0.001.
Effect of data size
In this section, we explore how the size of the original training data affects the minimal training sliding windows \(W_D^*\) FastUTSAD found. We computed the mean, median, and standard deviation of \(W_D^*\) on all algorithms and UTS corresponding to the specified dataset. The results of \(W_D^*\) for the different datasets are shown in Table 7. As shown in Table 7, YAHOO has the highest mean, median, and standard deviation, as the original data size is small (1.54k data points). Therefore, \(W_D^*\) will be large. For large datasets (\(\ge 230k\)) such as SVDB, FastUTSAD only needs a small proportion (0.45%) of the original sliding windows to train a model without nearly losing the accuracy of the model. Therefore, \(W_D^*\) will be small.
Conclusion
In this paper, we introduce FastUTSAD, a new framework that improves the data efficiency of UTSAD methods. In detail, FastUTSAD reduces the number of training sliding windows and the training time of existing UTSAD methods without significantly reducing their accuracy. It features a highly adaptable and expandable architecture with different sampling methods and a heuristic MCT method to decide the smallest number of training windows a UTSAD method requires. We evaluate FastUTSAD on nine UTSAD algorithms on eight benchmark datasets. The experiments show that FastUTSAD reduces the training data and the training time by about 91.09–91.49% and 93.49–93.82% without significantly losing model accuracy at a significant level of 0.01. In general, FastUTSAD provides significant insights on improving the data efficiency of UTSAD methods, which can facilitate the deployment of UTSAD models in realworld industrial scenarios.
There are still several problems that we have not addressed in this work. First, we have not yet considered anomaly detection on multivariate time series. Second, we focus on subsequencebased anomaly detection but ignore pointbased anomaly detection. Finally, FastUTSAD can only work well on large datasets. We will continue our efforts in future work to overcome these limitations.
Availability of data and materials
In this work, we used eight publicly available datasets and nine popular algorithms to validate our proposed framework. The datasets are available at https://www.thedatum.org/datasets/TSBUADPublic.zip. The algorithms used in this study are modified from public algorithms, which are available at https://github.com/TheDatumOrg/TSBUAD and https://github.com/timeeval/timeevalalgorithms.
Code availability
The source code will be available upon publication.
Abbreviations
 TSAD:

Time series anomaly detection
 UTSAD:

Univariate time series anomaly detection
 FastUTSAD:

The framework we propose in this research
 UTS:

Univariate time series
 MTS:

Multivariate time series
 D :

A univariate time series
 l :

The size (length) of a sliding window on a UTS
 \(t_{step}\) :

The time step for generating sliding windows
 W :

The sliding windows generated from a UTS D
 \(w_i\) :

A sliding window in W, \(w_i \in W\)
 \(W_D^*\) :

The smallest training sliding windows found by FastUTSAD
 \(\mathcal {A}\) :

An algorithm for UTSAD
 \(n_{sw}\) :

The number of sampled sliding windows
 \(n_{sw}^*\) :

The minimal number of sampled sliding windows
 \(Acc_{inc}\) :

The increased accuracy
 MCT:

Multistep continual training strategy
 \(Acc_{ori}\) :

The model accuracy trained by the original sliding windows
 \(Acc_{fast}\) :

The model accuracy trained by FastUTSAD
 \(T_{ori}\) :

The training time on the original sliding windows
 \(T_{fast}\) :

The training time by the FastUTSAD
 \(T_{dp}\) :

The time for data processing
 \(T_{redu}\) :

The time reduction by the FastUTSAD
 \(\alpha\) :

The stop threshold of the MCT
 \(s_{gap}\) :

The sampling gap of the MCT
References
Agarwal PK, HarPeled S, Varadarajan KR. Geometric approximation via coresets. Comb Comput Geom. 2005;52(1):1–30.
Akyildiz IF, Su W, Sankarasubramaniam Y, et al. A survey on sensor networks. IEEE Commun Mag. 2002;40(8):102–14. https://doi.org/10.1109/MCOM.2002.1024422.
AlShedivat M, Li L, Xing EP, et al (2021) On data efficiency of metalearning. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, pp 1369–1377, http://proceedings.mlr.press/v130/alshedivat21a.html
Amihud Y. Illiquidity and stock returns: crosssection and timeseries effects. J Financial Markets. 2002;5(1):31–56. https://doi.org/10.1016/S13864181(01)000246.
An J, Cho S. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture IE. 2015;2(1):1–18.
Bachlin M, Plotnik M, Roggen D, et al. Wearable assistant for parkinson’s disease patients with the freezing of gait symptom. IEEE Trans Inf Technol Biomed. 2009;14(2):436–46. https://doi.org/10.1109/TITB.2009.2036165.
Barnett V, Lewis T. Outliers in statistical data. New York: Wiley; 1994.
Boniol P, Linardi M, Roncallo F, et al. Unsupervised and scalable subsequence anomaly detection in large data series. VLDB J. 2021;30(6):909–31. https://doi.org/10.1007/s00778021006558.
Boniol P, Paparrizos J, Kang Y, et al. Theseus: navigating the labyrinth of timeseries anomaly detection. Proc VLDB Endow 2022;15(12):3702–05.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. https://doi.org/10.1023/A:1010933404324.
Breunig MM, Kriegel HP, Ng RT, et al. LOF: Identifying densitybased local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 93–104, 2000; https://doi.org/10.1145/342009.335388
Challenge A (2018) Kpi anomaly detection competition. https://competition.aiopschallenge.com/home/competition/1484452272200032281. Accessed 7 Nov 2023
Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv. 2009;41(3):1–58. https://doi.org/10.1145/1541880.1541882
Chatterjee A, Ahmed BS. IoT anomaly detection methods and applications: a survey. Internet of Things. 2022;19: 100568. https://doi.org/10.1016/j.iot.2022.100568.
Chen W, Xu H, Li Z, et al. Unsupervised anomaly detection for intricate kpis via adversarial training of VAE. In: IEEE INFOCOM 2019  IEEE Conference on Computer Communications; 2019, p. 1891–1899
Cheng H, Tan PN, Potter C, et al. Detection and characterization of anomalies in multivariate time series. In: Proceedings of the 2009 SIAM International Conference on Data Mining (SDM), pp 413–424; 2009. https://doi.org/10.1137/1.9781611972795.36
Cook AA, Misirli G, Fan Z. Anomaly detection for iot timeseries data: A survey. IEEE Internet Things J. 2020;7(7):6481–94. https://doi.org/10.1109/JIOT.2019.2958185.
Darban ZZ, Webb GI, Pan S, et al (2022) Deep learning for time series anomaly detection: a survey. CoRR abs/2211.05244. https://doi.org/10.48550/arXiv.2211.05244
Davis J, Goadrich M. The relationship between precisionrecall and ROC curves. In: Proceedings of the TwentyThird International Conference on Machine Learning, 2006, pp 233–240, https://doi.org/10.1145/1143844.1143874
Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74. https://doi.org/10.1016/J.PATREC.2005.10.010.
Gao J, Song X, Wen Q, et al. Robusttad: robust time series anomaly detection via decomposition and convolutional neural networks. CoRR abs/2002.09545. 2020; https://doi.org/10.48550/arXiv.2002.09545.
Goldstein M, Dengel A. Histogrambased outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI2012), 2012, pp 59–63.
Greenwald SD, Patil RS, Mark RG. Improved detection and classification of arrhythmias in noisecorrupted electrocardiograms using contextual information. In: [1990] Proceedings Computers in Cardiology; 1990, p. 461–464. https://doi.org/10.1109/CIC.1990.144257.
de Haan P, Löwe S. Contrastive predictive coding for anomaly detection. CoRR abs/2107.07820. 2021. https://arxiv.org/abs/2107.07820.
Hlynsson HD, EscalanteB. AN, Wiskott L. Measuring the data efficiency of deep learning methods. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM)  Volume 1, 2019, pp 691–698, https://doi.org/10.5220/0007456306910698
Jubran I, Maalouf A, Feldman D. Overview of accurate coresets. WIREs Data Mining and Knowl Discov. 2021;11(6): e1429. https://doi.org/10.1002/widm.1429.
Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. 2020; CoRR abs/2001.08361. https://doi.org/10.48550/arXiv.2001.08361
Kaushik S, Choudhury A, Sheron PK, et al. AI in healthcare: Timeseries forecasting using statistical, neural, and ensemble architectures. Front Big Data. 2020;3:4. https://doi.org/10.3389/fdata.2020.00004.
Kingma DP, Welling M. Autoencoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings, arXiv:1312.6114
Laptev N, Amizadeh S, Billawala Y (2015) S5a labeled anomaly detection dataset, version 1.0 (16m). 2015. https://webscope.sandbox.yahoo.com/catalog.php
Lehnertz K, Elger CE. Can epileptic seizures be predicted? evidence from nonlinear time series analysis of brain electrical activity. Phys Rev Lett. 1998;80(22):5019. https://doi.org/10.1103/PhysRevLett.80.5019.
Li Y, Long PM, Srinivasan A. Improved bounds on the sample complexity of learning. J Comput Syst Sci. 2001;62(3):516–27. https://doi.org/10.1006/JCSS.2000.1741.
Liu D, Zhao Y, Xu H, et al. Opprentice: towards practical and automatic anomaly detection through machine learning. In: Proceedings of the 2015 ACM Internet Measurement Conference, 2015, pp 211–224, https://doi.org/10.1145/2815675.2815679
Liu F, Zhou X, Cao J, et al. Anomaly detection in quasiperiodic time series based on automatic data segmentation and attentional LSTMCNN. IEEE Trans Knowl Data Eng. 2022;34(6):2626–40. https://doi.org/10.1109/TKDE.2020.3014806.
Liu FT, Ting KM, Zhou Z. Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, 2008, pp 413–422, https://doi.org/10.1109/ICDM.2008.17
Ma J, Sun L, Wang H, et al. Supervised anomaly detection in uncertain pseudoperiodic data streams. ACM Trans Internet Technol (TOIT). 2016;16(1):1–20.
Malhotra P, Vig L, Shroff G, et al. Long short term memory networks for anomaly detection in time series. In: 23rd European Symposium on Artificial Neural Networks, ESANN 2015, Bruges, Belgium, April 22–24, 2015, pp 89–94, https://www.esann.org/sites/default/files/proceedings/legacy/es201556.pdf
Mayeza CA, Munyeka W. The socialization of first entering students: an exploratory study at south african university. Int J Educ Excell. 2021;7(1):99–115.
Michelucci U (2022) An introduction to autoencoders. arXiv preprint arXiv:2201.03898
Moody GB, Mark RG. The impact of the MITBIH arrhythmia database. IEEE Eng Med Biol Mag. 2001;20(3):45–50. https://doi.org/10.1109/51.932724.
Moon J, Yu J, Sohn K. An ensemble approach to anomaly detection using high and lowvariance principal components. Comput Electr Eng. 2022;99: 107773.
Munir M, Siddiqui SA, Dengel A, et al. DeepAnT: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access. 2019;7:1991–2005.
Paparrizos J, Boniol P, Palpanas T, et al. Volume under the surface: a new accuracy evaluation measure for timeseries anomaly detection. Proc VLDB Endow. 2022;15(11):2774–87.
Paparrizos J, Kang Y, Boniol P, et al. TSBUAD: An endtoend benchmark suite for univariate timeseries anomaly detection. Proc VLDB Endow. 2022;15(8):1697–711. https://doi.org/10.14778/3529337.3529354.
Roggen D, Calatroni A, Rossi M, et al. Collecting complex activity datasets in highly rich networked sensor environments. In: Seventh International Conference on Networked Sensing Systems (INSS), 2010, pp 233–240, https://doi.org/10.1109/INSS.2010.5573462
Ros F, Guillaume S. Sampling techniques for supervised or unsupervised tasks. Springer. 2020. https://doi.org/10.1007/9783030293499.
Sakurada M, Yairi T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, 2014, pp. 4–11.
Schmidl S, Wenig P, Papenbrock T. Anomaly detection in time series: a comprehensive evaluation. Proc VLDB Endow. 2022;15(9):1779–97. https://doi.org/10.14778/3538598.3538602.
Schölkopf B, Williamson RC, Smola A, et al. Support vector method for novelty detection. Adv Neural Inf Process Syst. 1999;12:582–588.
Su Y, Zhao Y, Niu C, et al. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2828–2837.
Sylligardos E, Boniol P, Paparrizos J, et al. Choose wisely: an extensive evaluation of model selection for anomaly detection in time series. Proc VLDB Endow. 2023;16(11):3418–32.
Tatbul N, Lee TJ, Zdonik S, et al. Precision and recall for time series. Adv Neural Inf Process Syst. 2018:31.
Thill M, Konen W, B ̈ack T (2020) Time series encodings with temporal convolu tional networks. In: Bioinspired optimization methods and their applications  9th international conference, BIOMA 2020, Brussels, Belgium, November 19–20, 2020, Proceedings, pp 161–173.
Van NT, Thinh TN, et al (2017) An anomalybased network intrusion detection system using deep learning. In: 2017 international conference on system science and engineering (ICSSE), IEEE, pp 210–214
Wagner D, Michels T, Schulz FCF, et al. Timesead: benchmarking deep multivariate timeseries anomaly detection. Trans Mach Learn Res. 2023. https://openreview.net/forum?id=iMmsCI0JsS.
Wang R, Nie F, Wang Z, et al. Multiple features and isolation forestbased fast anomaly detector for hyperspectral imagery. IEEE Tran Geosci Remote Sens. 2020;58(9):6664–76. https://doi.org/10.1109/TGRS.2020.2978491.
Wang R, Liu C, Mou X, et al. Deep contrastive oneclass time series anomaly detection. In: Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pp 694–702, 2023;https://doi.org/10.1137/1.9781611977653.ch78.
Woike M, AbdulAziz A, Clem M. Structural health monitoring on turbine engines using microwave blade tip clearance sensors. In: Smart Sensor Phenomena, Technology, Networks, and Systems Integration 2014. SPIE, p 90620L, 2014; https://doi.org/10.1117/12.2044967.
Xu H, Chen W, Zhao N, et al. Unsupervised anomaly detection via variational autoencoder for seasonal KPIs in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, 2018, pp. 187–196
Yao Y, Ma J, Ye Y. Regularizing autoencoders with wavelet transform for sequence anomaly detection. Pattern Recognit. 2023;134: 109084.
Yuan Y, Yu ZL, Gu Z, et al. A novel multistep qlearning method to improve data efficiency for deep reinforcement learning. Knowl Based Syst. 2019;175:107–17. https://doi.org/10.1016/j.knosys.2019.03.018.
Zhang W, Yang Z, Wang Y, et al. Grain: Improving data efficiency of graph neural networks via diversified influence maximization. Proc VLDB Endow. 2021;14(11):2473–82. https://doi.org/10.14778/3476249.3476295.
Zhao Y, Nasrullah Z, Li Z. Pyod: a python toolbox for scalable outlier detection. J Mach Learn Res. 2019;20(96):1–7.
Zhong Z, Fan Q, Zhang J, et al (2023) A survey of time series anomaly detection methods in the AIOps domain. CoRR abs/2308.00393. https://doi.org/10.48550/arXiv.2308.00393
Zong B, Song Q, Min MR, et al. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In: Conference Track Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. https://openreview.net/forum?id=BJJLHbb0
Funding
The National Natural Science Foundation of China (No. 61562010), National Key Research and Development Program of China (No. 2023YFC3341205), and the Research Projects of the Science and Technology Plan of Guizhou Province (Nos. [2021] 449, [2021] 261, [2023] 010, [2023] 276, and [2023] 338) funded this work.
Author information
Authors and Affiliations
Contributions
Conceptualization, HL and WS; investigation, WS; methodology, WS; administration, HL; software, WS; supervision, HL, QL, MC, XZ, and YW; validation, HL, YW, and QL; visualization, WS and QL; analysis, WS and QL; writing—original draft, WS; writingreview and editing, HL and YW. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
This article does not contain any studies with human participants or animals performed by any authors.
Consent for publication
I hereby give my consent to publish the research article. I confirm that all authors have agreed to the publication. I understand that the article will be made publicly available.
Competing interests
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sun, W., Li, H., Liang, Q. et al. On data efficiency of univariate time series anomaly detection models. J Big Data 11, 83 (2024). https://doi.org/10.1186/s40537024009407
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537024009407