Deep anomaly detection through visual attention in surveillance videos

This paper describes a method for learning anomaly behavior in the video by finding an attention region from spatiotemporal information, in contrast to the full-frame learning. In our proposed method, a robust background subtraction (BG) for extracting motion, indicating the location of attention regions is employed. The resulting regions are finally fed into a three-dimensional Convolutional Neural Network (3D CNN). Specifically, by taking advantage of C3D (Convolution 3-dimensional), to completely exploit spatiotemporal relation, a deep convolution network is developed to distinguish normal and anomalous events. Our system is trained and tested against a large-scale UCF-Crime anomaly dataset for validating its effectiveness. This dataset contains 1900 long and untrimmed real-world surveillance videos and splits into 950 anomaly events and 950 normal events, respectively. In total, there are approximately ~ 13 million frames are learned during the training and testing phase. As shown in the experiments section, in terms of accuracy, the proposed visual attention model can obtain 99.25 accuracies. From the industrial application point of view, the extraction of this attention region can assist the security officer on focusing on the corresponding anomaly region, instead of a wider, full-framed inspection.

Recently, Sultani et al. [6] have introduced a broad dataset and a multiple-instance learning (MIL)-based solution [7,8] for this computer vision challenge in order to bridge the gap between surveillance camera storage and the restricted number of human monitors. Different from [6], Landi et al. [9] and Xu et al. [10] introduce a localized detection instead of considering full-frame video processing. In particular, Landi et al. [9] proposes exploiting the inherent location of anomalies and investigating whether the use of spatiotemporal information [11] can help detect anomalies. They combine the model with a module for tube extraction, which helps the analysis to concentrate on a specific set of spatiotemporal coordinates. A downside of this approach is that authors prefer to choose the manual / in-hand annotation rather than automatically driven localization by computer vision techniques. It leads to a time-consuming effort and ineffective. In contrast, Xu et al. [10] automatically locate all potential attention regions where fighting actions may occur, extracting several activation boxes from a motion activation map that measures the level of activity at each position. Then, the authors cluster all localized proposals around the extracted attention regions based on the spatial relationship between each pair of human proposals and activation boxes. It is important to note that Xu et al. [10] only focus on localizing the fight event in public area, thus this approach is not applicable for a unified anomaly detection system.
In fact, occlusions, illumination changes, motion blur and other environmental variations [12] are still challenging tasks in untrimmed public video footage. Therefore, in this paper, we propose an automatic yet efficient attention region localization approach through background subtraction. First, attention/moving regions are located using a robust background subtraction method. Once the attention regions obtained, it will be fed into a 3D CNN action recognition. It is noteworthy that our model only uses the obtained attention region from each frame during training. Like [6], we also address the detection of anomalies as a regression problem and propose a model consisting of a video encoder followed by a fully trainable regression network. In summary, this paper makes the following contributions.
A hybrid approach incorporating background subtraction and bilateral filter to localize attention regions for efficient anomaly detection is proposed.
A novel localization idea for a deep learning network to learn anomaly scores for video segments is introduced. This paper is organized as follows: In section II, we present related works. In section III, we introduce our proposed method, including extracting attention regions from spatiotemporal information, and the detailed implementation of localized anomaly detection. In section IV, we test our method and summarize our results. Finally, in section V, we conclude the paper.

Related works
Anomaly detection is one of computer vision's most difficult and ongoing issues. With the increasing demand for public safety and surveillance, vast numbers of cameras have been installed in many public spaces, including airports, plazas, subway stations, and train stations. These cameras generate huge amounts of video data, resulting in an inefficient and exhausting process for a human operator to find suspicious or unusual occurrences. Moreover, there is an urgent need for an automated device to increase productivity and save energy. As a result, significant efforts have been made towards smart video surveillance, and many approaches have been proposed to allow significant progress to be made in the detection of video anomalies.
Various approaches to detecting abnormal behavior have been developed in the past . In [34], the video and audio data is used to identify violent behavior in video surveillance. Mohammadi et al. [1] proposed a new behavior-based heuristic approach to classifying violent and non-violent videos. Different from previous works, authors in [14,15] suggested to use tracking as an anomaly to model normal motion. Due to difficulties in obtaining reliable tracks, a number of approaches avoid tracking and learn about global motion patterns using histogram-based methods [16], social force models [30], mixture of dynamic texture models [20], Hidden Markov Model (HMM) [21], topic modeling [18], motion patterns [35] and context-driven method [19]. One of work from Mehran et al. [30] was trying to measure the interaction force of the scene by measuring the difference between desired and real velocities obtained by particle advection, which uses the social-force model. These approaches learn how to distribute normal motion patterns and detect low probable patterns as anomalies given the training videos of normal behaviors.
Recently, approaches based on deep learning have been presented. Xu et al. [13] use a machine learning framework to learn video features rather than to use hand-crafted features. Multiple Single Class Support Vector Machine (SVM) models are then used from the learned features to score the anomaly level for each input. The multiple SVM results are then combined for the final detection of anomalies. Hasan et al. [24] proposed a convolutional auto-encoder (Conv-AE) framework for the reconstruction of the scenes, and then computed the reconstruction costs for the identification of anomalies. Zhou et al. [36] proposed spatio-temporal CNNs to learn joint appearance and motion characteristics. Sultani et al. [6] combined deep neural network with multiple instance learning to classify real-world anomalies, such as accident, explosion, fighting, abuse, arson, etc. Similar to [6], our approach considers not only normal behaviors, but anomalous behaviors for detection of anomalies. In addition, our work introduces a visual attention idea in order to localize the region of interest (ROI).

Methods
Our algorithm incorporates the BG subtraction with the bilateral filter. The bilateral filter is used to alleviate the noise from the untrimmed public incoming frames. BG subtraction is used to retrieve the foreground (candidate attention regions to register). Finally, the extracted attention region of various anomaly events will be predicted through a deep learning pipeline. Figure 1 illustrates the overview of our proposed work, the details of which are discussed in detail in the following parts. We divide the section into two primary parts; visual attention detection and event action detection.

Visual attention detection
We use a bilateral BG subtraction approach based on texturing which effectively eliminates noise and retains the edges of observed areas [37,38]. In our case, such a technique is capable of building a stable BG model in order to clearly see the area of the moving objects as a visual attention region and the rest as an uninterested area.
As shown in Fig. 2, we visualize the comparison between the bilateral BG subtraction with two famous approaches, namely improved model of Mixture of Gaussians (MOG2) [39] and K-Nearest Neighbors (KNN) [40]. Both methods processed the image pixel by pixel and often regarded the noise as a candidate for moving pixels, such as shadow interference and intermittent motion. On the contrary, the bilateral texture-based approach was able to exclude the noise properly, and extract the correct region. To be specific, in Fig. 2, the noise/misclassified regions are highlighted through the circle in red color (mostly shadow is regarded as moving pixels on MOG2 and KNN approaches), while correct regions were drawn through the rectangle in green color (in the original image). Although the previous works are able to produce a more complete segmented foreground object, the misclassified region can affect the extraction of the region of interest (ROI). Therefore, in our proposed anomaly detection pipeline, the visual attention region can be obtained more accurately and efficiently through the bilateral BG subtraction method. The comparative evaluations (in Fig. 2) were conducted using the UCF-Crime [6] dataset that contains various classes of anomaly activity. This approach can achieve ~ 100 fps for 240 × 320 pixels input format in Graphical Processing Unit (GPU). Therefore, this approach is very efficient to localize the region before performing anomaly detection through deep-learning pipeline.
First, we use bilateral filtering [41] to an input frame I, and denoted the greyscale output image as I bilateral . The I bilateral is used to generate a non-overlapping block-based texture. More specifically, the I bilateral is divided into blocks of sizes n × n pixels. The n setting is set to 4 in our system. Then, we calculate the mean of each block and construct a binary bitmap using it. The bitmap BM bil is obtained by comparing the mean with each pixel value in a block. If the value of the pixel is below the mean, the binary value is 0, and vice versa. Finally, the BM bil of each block is used to build the initial BG model BM mod , and becoming a reference when the new incoming frame exists. Our current BG model update rule and its appropriate learning rate are similar to our previous method [42].
In theory, when a new frame arrives, we simply calculate a hamming distance for each block to decide if the observed block BM bil_obs is regarded as BG block or attention region block. Note that, the b ij indicates the corresponding bit value in i, j position of a block. The bilateral filter is very slow compared to most filters while keeping the edges of the active area relatively sharp. Therefore, instead of using global memory, we use the texture memory of a CUDA to process an input frame and perform a bilateral GPU filter [41]. For clarity purposes, Fig. 3 describes the step-by-step generation of texture information.
In addition, Fig. 4 illustrates a simple example in order to generate the texture information on a single 4 × 4 pixel block. Figure 5 shows the image generated using     Figure 5b, f remove too many details, whereas Fig. 5d, h shows excessive unimportant details. Figure 5c, g prove the validity of this texture descriptor and shows that block size = 4 is an excellent choice.

Feature extraction through the pre-trained C3D model
The 3D CNN is commonly used for various computer vision applications, especially for classification, detection, and recognition task. Typically, the 3D CNN model consists of several layers, namely pooling, convolutional, and fully-connected (FC) layers. The preceding layer by means of kernels with a pre-defined, fixed-size receptive field is connected to every layer. The 3D CNN model learns the setup of hyper-parameters from a big data collection to represent the video clip's global or local characteristics. That model architecture has different layer types and activation functions to display better representational features than human-engineered software.
As illustrated in Fig. 6, for its good performance and efficiency, the popular Convolutional 3D Networks (C3D) [43] is selected as our pre-trained feature extractor. Recent studies [44][45][46][47] have shown that fine tuning of a more complex dataset results in excellent classification and detection performance using a pre-trained Sports-1 M dataset model [48]. The reason for this training procedure is that the 3D CNN receives general representation of video clips from pre-training. The model adjusts the parameter after the fine-tuning to show the specific features of the video segment, while retaining the ability to display the general video segment. This training strategy is implicitly implemented, coupled with a sampling of shuffles and cross-validation. Figure 7 shows the flow diagram of proposed visual attention-based anomaly detection. Specifically, we derive visual characteristics from the C3D network's fully connected (FC) layer FC6. We re-size each video frame to 240 × 320 pixels before computing features and set the frame rate to 30 fps. We compute C3D features for every 16-frame video clip followed by l 2 normalization. We take the average of all 16-frame clip features within that segment to get features for a video segment. These features (4096D) are input into a neural network of 3-layer FC. For detection purposes, we inference every 160 frames (every 10 C3D extracted files) gradually in order to convince whether the anomaly scenes exist or not. The regression network outputs the video anomaly score. Since the score ranges from 0 to 1, we can interpret it as the likelihood of an unusual event occurring in the localized segment being investigated. Inspired by [6], we utilize the MIL ranking loss as sparsity and smoothness constraints [8] and consider each video segment as an instance of the bag. Given an input video M, its anomaly score Sc(M) must comply with the following:

Implementation details of anomaly detection
where the threshold T drives the binary classification into normal and anomalous videos. Ideally, anomalous segments score close to 1 while regular videos map values close to 0. In typical setting, the T is set to 0.5. We use the activation function of rectified linear activation unit (ReLU) and adopt a 50% dropout regularization [49] between the layers of FC.
It is important to note that our training sample consists of a 16-frame video segment M, where each frame already output a localized area from previous steps. We train the localized model using the public and a comprehensive dataset called UCF-Crime [6]. The number of training data are 800 un-anomalous videos and 810 anomalous videos, respectively. In general, there are 14 classes of event that provided by the authors [6] Table 1.

Results and discussion
During experiment, the PC is equipped with Intel i7-7700HQ processor, 16 GB of memory, and NVIDIA GeForce GTX 1050 Ti 4 GB. The PyTorch 1.2 is employed as a framework and pre-trained model of C3D is used for spatial-temporal feature extractions. In Table 2, we describe the statistics about the localized UCF-Crime dataset which is used throughout the training and testing stage. In addition, Fig. 8 shows some examples of localized anomalous regions in various classes. The UCF Crime dataset [6] consists of surveillance videos which are data obtained from LiveLeak and YouTube. In this experiment, we evaluated three classes from the UCF Crime dataset to act as a baseline test set for evaluating the accuracy. The dataset provides the ground-truth label in binary classification for each tested video. Therefore, it is straightforward to perform the evaluation of detection thoroughly. In Fig. 9, we demonstrate an anomalous event example from the UCF-Crime dataset through qualitative comparison, namely robbery, fighting and road accident, respectiv ely. As clearly shown in Fig. 9 that we only feed the attention regions to the deep-learning pipeline. In other words, the uninterested region will be blurred and will not produce any important visual features during extraction, training and inference process. Note that, the x and y-axis indicate frame numbers and probability of anomaly scores, respectively. We compare our proposed method with previous work by Waqqas [6]. Although there are several types of research which introduce the localization in anomaly detection, their approach focused on one anomaly event, such as fighting action (as proposed by Xu [10]). Therefore, in order to evaluate several anomalous events, we compare the proposed work with the full-frame approach. In Fig. 9a shows two-person approaching a man and rob his mobile phone, then leaving the area by motorbike. Clearly, both works are able to detect the robbery event with high probability (see the highlighted frame No. 550 in Fig. 9a). In addition, in subsequent highlighted frame No. 700 in Fig. 9a, we visualize the normal event when the two robbers leaving the scene by motorbike. Our approach is more accurate by yielding a significantly lower score anomalous probability (almost zero). Note that, the higher the probability score, the more likely anomaly events will be. In an anomaly fighting scene, a security officer is trying to protect the area from Colored window (blue) shows ground truth anomalous region. a-c show videos containing robbery, fighting, and road accident, respectively the intruder, while in normal scene illustrates the intruder leaves the area after failing to fight against the officer. As visualized in Fig. 9b, the anomaly score of Waqqas [6] and the proposed work is very competitive. Similarly, Fig. 9c visualizes the road accident scene that the anomaly score of our proposed work outperforms the previous work [6].
In Fig. 10a-c, the respective highlighted frames from Fig. 9 in higher resolution are provided. It is clearly shown that our method is able to localize the anomalous regions successfully through the BG subtraction idea. From the industrial application point of view, the extraction of this attention region can assist the security officer on focusing on the corresponding anomaly region, instead of a wider, full-framed inspection.
As shown in Table 3 above, it is obvious by applying the proposed localization approach achieves higher accuracy on several tested videos. The accuracy of each tested video was calculated from each segment that contains anomalous events. For example, the robber started their action from video segment 14 to 15, the fighting was initially begun from segment 4 to 18, and road accidents occurred very quickly start from segment 5 to 7. Therefore, the accuracy needs to be evaluated on several segments and calculate the average score. In average, our proposed approach is able to obtain stable accuracy on every tested video (as concluded in Fig. 11). Furthermore, in order to compare the accuracy of trained model, we also collect 135 test videos from UCF-Crime dataset and extract corresponding C3D features. The accuracy is simply calculated by accumulating the correct predictions over the number of tested videos. In [6], 133 of 135 videos are labeled correctly, while our proposed visual attention learning can classify 134 of 135 videos correctly. The details can be found in Table 4 below: In the real-world scene, multiple events are possibly occurring in one CCTV footage. For example, a robber tries to rob something but at the same time, the victim is fighting back. Therefore, we also visualize one tested video that we obtained arbitrarily through YouTube. This scene has shown a robber was approaching a group of people in the station, but he failed to accomplish the action and those people were successfully self-defense their goods. Similar to the previous UCF-Crime evaluation, we conduct a thorough analysis by examining each video segment and measure its accuracy. As shown in qualitative measurements below, the proposed localized approach is able to detect two separated anomalous segments, while the previous work is failed to detect the event when the group of people was fighting back and force the robber escaping from the area.  The ground-truth are manually labeled through manual inspection frame-by-frame. The events are occurring from frame no. 500 to 1500, then continuing from frame no. 2300 to 2500. The corresponding accuracies of anomalous segments are provided in Table 5.   Table 4 The comparison of accuracy (%) between full-frame and our proposed locality learning

Method Accuracy
Sultani et al. [6] 98.51 Proposed method 99.25 Table 5 The accuracy (%) evaluation of a Multi-events tested video from youtube between full-frame and our proposed locality learning

Conclusions
In this paper, an automatic localization through robust computer vision techniques in anomaly detection is proposed. Experimental results show that: (1) finding a localized attention region from each segment helps anomaly detection; (2) our method is able to obtain accurate results in different kinds of event, e.g. road accident, robbery, and fighting and (3) Incorporating a robust BG subtraction can help to find the region of interest (ROI) as correct as possible. In terms of accuracy, the proposed visual attention model can obtain 99.25 of accuracy. It is noteworthy, we utilize a weakly-supervised network for training. More generally, we believe that our approach of extracting the visual attention region could benefit many other online tasks, such as video object localization and classification, and plan to pursue this in future work. Our work is limited to the anomaly events which contain the moving object.