 Research
 Open access
 Published:
Anomaly behaviour detection based on the metaMorisita index for large scale spatiotemporal data set
Journal of Big Data volume 5, Article number: 23 (2018)
Abstract
In this paper, we propose a framework for processing and analysing largescale spatiotemporal data that uses a battery of machine learning methods based on a metadata representation of point patterns. Existing spatiotemporal analysis methods do not include a specific mechanism for analysing metadata (point pattern information). In this work, we extend a spatial point pattern analysis method (the Morisita index) with metadata analysis, which includes anomaly behaviour detection and unsupervised learning to support spatiotemporal data analysis and demonstrate its practical use. The resulting framework is robust and has the capability to detect anomalies among largescale spatiotemporal data using metadata based on point pattern analysis. It returns visualized reports to end users.
Introduction
Anomaly detection for analysing spatiotemporal data remains a rapidly growing problem in the wake of an everincreasing number of advanced sensors that are continuously generating largescale datasets. For example, vehicle GPS tracking, social media, financial network and router logs, and high resolution surveillance cameras all generate a huge amount of spatiotemporal data. This technology is also important in the context of cyber security since cyber data carries with it an IP address which can map to a specific geolocation and a timestamp. Yet, current cybersecurity approaches are not able to process this kind of information effectively. To illustrate this deficiency, consider the scenario of a distributed denialofservice (DDoS) attack in which the network packets may come from different IP addresses with sparse locations. In such a case, a spatiotemporal analyzing system [1] is required to analyse the spatial pattern of the DDoS attack. Yet, user oriented analytic environments for cyber security with spatiotemporal marks are currently limited to traditional statistical methods like spatialtemporal outlier detection and hotspot detection [2].^{Footnote 1} Furthermore, much of the current work in large scale analytics focuses on automating analysis tasks, such as detecting suspicious activity in a wide area motion and time interval. But these approaches do not provide analysts of cyber security data with spatiotemporal marks the flexibility to employ creativity and discover new trends in the data while operating over extremely large datasets. Current solutions are prohibitive because they require a multidisciplinary skillset.
One possible solution to performing analytics on such large scale spatiotemporal data is to retrieve the metadata of spatial point patterns [5], and apply metadata processing and storage approaches [6], together with domain knowledge derived by machine learning and statistical means. An added advantage of this method is that metadata hides the details of the point patterns thus providing privacy while still supporting a variety of analytics.
We, thus, propose a framework for performing analytics with spatiotemporal data that has the following properties:

Privacy protection: We use a meta analysis of tracking data as an indicator of subjects’ behavior. The geolocation of the subject will not be exposed to the system user.

High scalability: We are able to retrieve the behavior pattern for different amounts of data since the Morisita index provides the scalability adapted to different amounts of tracking data.

Convenience: We designed a convenient way to map the anomaly event of the cyber threat to the physical threat since the cyber threat can be visualized on the real map.
In more detail, we propose a framework to store and process largescale spatiotemporal data over a “metadata based point pattern” infrastructure, while providing users with a metadata analysis that hides the details of largescale spatiotemporal data and provides them with a frontend interface that allows them to run a variety of security checks including outlier detection for a single subject, anomaly group detection, anomaly behavior detection and anomaly event detection. Furthermore, the spatiotemporal data is stored in various data stores. As a result, this framework provides highperformance analytical features, flexibility, and extensibility.
The theoretical contribution and novelty of our work lies in the combination of methods from the areas of spatiotemporal analysis, machine learning and statistical analysis. By extracting relevant methods from these three fields of research, we created an effective and efficient tool for anomaly detection by monitoring the cyber and physical levels, simultaneously.
Background
Spatiotemporal data differs from traditional data since both spatial and temporal attributes are available in addition to the actual measurements/attributes.
In this work, we treat the spatiotemporal data as a time series of spatial data. We are using spatial point patterns as the “snapshot” of spatial data within a specific time frame. The Morisita index has been used as the measure of the spatial point patterns.
Spatial point patterns
Spatial point patterns can be stored in a twodimensional data format to which a variety of analytical methods can be applied to discover useful data patterns in largescale spatial data. Extracting exact patterns from geospatial data is more complicated than doing so with ordinary data sets because of the nature of geospatial data sources and their associated data structures, which refers to the two or three dimensional data structure.
Figure 1 represents 6 possible spatial distributions of points: random, semiregular, aggregated, random with a density trend, semiregular with a density trend, aggregated with a density trend [5].
Common spatial analysis packages use real numbers [e.g. geospatial POIs (Point of Interest) with uncertainty information] like GPS navigation data with errors, categorical values (e.g. fishery production by species) and logical values (e.g. saline water/freshwater) to mark the point patterns [7]. These point patterns are composed of a huge number of points that follow distributions such as those illustrated in Fig. 1 [8]. The region of spatial data can represent a complicated shape, such as an arbitrary polygon or a irregular pixel image pattern. In this work, the spatial datamining functions are implemented in R.
Morisita index
The Morisita index is a common method for analyzing spatial patterns. It is a statistical measure of dispersion based on the spatial Poisson process. To compute the Morisita index, the function will first calculate the quadrat counts of the spatial point pattern.^{Footnote 2} Then the generated index of spatial aggregation for the previous pattern will be the Morisita index. In more detail, the algorithm divides the spatial domain into Q quadrats of equal size and shape.
Then the algorithm counts the number of points falling in every quadrat. Finally, the n[i] number of points in the ith quadrat will be counted as a vector of values called the Morisita index. The sum of the number of points is represented as the N value.
We can also plot the result of this analysis and the Morisita index of dispersion can calculate the overlap between samples.
The formula used in the analysis package is
\(n_i\): number of points in the ith quadrats, N: total number of points [10].
This formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas) [11]. The Morisita index is used to compare the similarity between different samples. The advantage of the Morisita index is that it is a vector of data that only varies with size of quadrats, not with population density. For more information about the statistical description of the Morisita index, please see [11].
Related work
In his seminal book “statistics for spatiotemporal data [12]”, Cressie et al. characterizes the process of statistical spatiotemporal data analysis in the presence of uncertain and (often) incomplete observations. This work includes prediction in space (interpolation), prediction in time (forecasting), assimilation of observations and mechanistic models and inference on controlling process parameters. The concept of the poisson point process in the book is also the foundation of our research which relies on the Morisita index. However the Morisita index was originally designed for ecological research by Morisita [11]. The method has been implemented by Baddeley et al. [7] in R to analyze spatial point pattern data. Our work intends to extend Baddeley et al. [7] work to spatiotemporal data type by retrieving characteristic value using Morisita index.
Some pilot study in spatial statistics like Kriging [13, 14] are methods of interpolation of spatial data. The values interpolated conform to the Gaussian process. The metaMorisita index is different as it actually calculates the density of the clusters in each quadrats. Although the two methods appear related, they have different functions. The function of Kriging is to interpolate the predicting points into the insufficient (usually undersized) geospatial data set, which may contains missing points and irregular spatial objects like polygons. These predicted values are generated by the model of spatial autocorrelation. The function of Morisita index is to exploratory analyze the large scale (usually oversized) geospatial data set by different measures (defined by the diameter of the quadrats). That means, the Kriging method predicts the unknown values (making a prediction) [15]. The Morisita index does not insert any extra values into the raw geospatial data set.
Other spatial statistical models have been established for spatiotemporal processes and spatial point processes. Examples of temporal models of point process include Cox and Isham [16]; Daley and VereJones [17]. Examples of spatial models of point process include Cressie [18]; Diggle [19]; MØller and Waagepetersen [20]. The model of spatiotemporal process is not as well defined a field as the spatial model. Pioneer work in this area includes Diggle [21], Diggle and Gabriel [22]. Literature reviews like Zhuang et al. [23] can be used as references for this field. Some work like Illian et al. [24] introduce models such as goodnessoffit tests, calculation of summary statistics to process the single realization of the point process.
Our study, however, is functionally closer to clustering work in Machine Learning than the spatiotemporal analyses just mentioned. Performance comparisons will be described in the "Results and discussion" section on quantitative evaluation. Typical clustering algorithms in machine learning including Kmeans [25], densitybased spatial clustering of applications with noise (DBSCAN) [26], expectation maximization (EM) [27]^{Footnote 3} are efficient methods to detect the cluster of spatial point patterns.
The advantage of Kmeans and its derivatives Kmediods [28], CLARANS (Clustering Large Applications based on RANdomized Search) [29], Kmodes [30], ISODATA (Iterative SelfOrganizing Data Analysis Technique) [31], FCM (fuzzycmeans) [32] is scalability for large data and efficiency. However the k value is difficult to predict. The expectation maximization (EM) also has disadvantages—the convergence is slow and not capable to provide estimation of the asymptotic variancecovariance matrix of the maximum likelihood estimator (MLE). The density based algorithms like DBSCAN (densitybased spatial clustering of application with noise) and their derivatives [26], GDBSCAN [33], DBRS [34], STDBSCAN [35], OPTICS (ordering points to identify the clustering structure) [36] have some advantage like to detect the outlier efficiently. However it is hard to set the global parameter. Also DBSCAN is not precise enough to measure the clusters adjacent to each other (neck problem).
We now describe how machine learning and clustering has previously been applied to spatiotemporal analyses and contrasts these works with our proposed approach. Yang et al. [2] highlights the demands of analysing human mobility data and detecting hot spots. The author proposes a framework to identify human mobility hotspots that represent the status of human mobility in local areas and group these hotspots into different classes by clustering their temporal signatures. Their work focuses on converting spatiotemporal data to convergent hotspot and dispersive hotspot. Clustering analysis follows based on the temporal characteristics. Their work focuses on the trajectory of human mobility. It models behaviour but it does so in a short time window such as one hour. Izakian et al. [37] considers Fuzzy Cmeans (FCM) as a conceptual and algorithmic setting to deal with the problem of anomaly detection. Their work is also based on small size of time series which contains only 10 data points. Our work uses 1 day as the time to calculate the behaviour indicator, which can represent the pattern of the subject over a long period of time and large scale data set.
Birant et al. [38] proposes a threestep approach: clustering, checking spatial neighbours, and checking temporal neighbours to detect spatiotemporal outliers in large databases. Cheng et al. [39] proposes a multiscale approach to detect the spatiotemporal outliers by evaluating the change between consecutive spatial and temporal scales. Their work is based on classification, aggregation, comparison, verification. These two works both focus on detecting the spatiotemporal outliers in large data set. They do not analyse the behaviour pattern as our framework does.
Saligrama et al. [40] proposes a novel graphbased statistical notion called MAXLCS (local neighborhoodbased composite scores) that unifies the idea of temporal and spatial locality. Their work focuses on detecting local anomalies but not for large scale group detection, as in our work.
Young et al. [41] proposes scalable timeseries models so that geographically aggregated call volume can accurately identify the onset of major events when the approximate time and location of the event is known. Their work is based on known event. Our work, which is based on unsupervised learning, does not require any prerequisite information.
Liu et al. [42] presents a new datadriven framework for a spatiotemporal feature extraction scheme built on the concept of symbolic dynamics for discovering and representing causal interactions. The extracted spatiotemporal features are then used to learn systemwide patterns via a restricted Boltzmann machine (RBM). Their work is implemented on an energy system with intelligent sensing and control systems. The limit of their work is based on anomalies of sensor data. They don’t design the behaviour indicator for tracking data like our work.
Capdevila et al. [43] discusses the mining events using the twitter data. The author proposes the Warble, which is a new probabilistic model and learning scheme. Their work focuses on event detection by probabilistic model. The origin of our work is based on spatial analysis. It is a different way to analyze geospatial data. Anagnostopoulos et al. [44] also discusses the twitter data. However this paper focused on targeted outdoor advertising.
Pappalardo1 et al. [45] highlights the Ditras (DIarybased TRAjectory Simulator), which is a framework to simulate the spatiotemporal patterns of human mobility. The author proposes the framework to identify human mobility by diary and trajectory generators. Their work focuses on the statistical properties of real trajectories. Our work is not based on the trajectories but on the characteristic value from the statistical model.
The framework design
The purpose of this section is to present the essential elements of our study: the characteristics of the method on which our work is based; the data sets we employed to demonstrate the usefulness of these characteristics and the framework we built to exploit the basic method to the fullest. In more detail, we first describe the way in which the Morisita index can be used to detect anomalous events on a synthetic example. We then describe the data sets on which we conducted our actual study. The first one of these data sets contains a large number of taxi trajectories over a week. It is a physical security problem chosen to illustrate the usefulness of our network on a large data set. Most of our subsequent analysis was conducted on that data set. The second data set is based on Twitter data and contains spatiotemporal information about tweets. As such, it is a cyber dataset and shows how our framework allows cyber and physical security to be considered jointly, as it should be in order to improve real safety. However, the individual Twitter user information is not available from the data set which restricts the type of analysis that can be conducted on it. This is why the Taxi data set was used to demonstrate the versatility of our approach. The third part of this section introduces our framework and explains its functionality.
In this work, we use the spatstat [7] package which is capable of analysing three or more dimensional point pattern datasets. This spatial analysis package supports a variety of statistical analysis methods such as model fitting, spatial data sampling, and statistical formulation. The particular method used in this work is the Morisita index value which was presented in "Morisita index" section. Other models will be explored in future work.
In this study, we calculate the highest Morisita index value as the behaviour indicator in a long time slot (1 day) to avoid sampling bias. If the data is mostly distributed uniformly, the Morisita index value falls between 0 and 1. However, if the data is in clumped distribution then the Morisita value falls between 1 and n [10]. n is the maximum value of the Morisita index which means the density of points is > 1 suggest clustering.
Morisita index as the realtime clustering indicator
In this study, we use the Morisita index method to process the point pattern before detecting an anomalous event. The Morisita index is designed to determine the density of point patterns according to their statistical characteristics; the method is extensively applied to classify and visualize data with geolocation. We chose this method to indicate the point patterns of crowds of people and mark each situation using the corresponding Morisita value to indicate the density of the crowd. Thus, the process includes methods from computational statistics. Given that we want to indicate that these point patterns belong to different social events according to the statistical characteristics of their density values, the method will return Morisita values as a density value. For example, the density value of a downtown social event yields a large Morisita value, whereas the density value of a suburban social event yields a small value. For each social event, the density value is correlated to the Morisita value in indicating the social event. This is illustrated in Figs. 2 and 3.
Figure 2 shows that a crowd with tracking devices is uniformly distributed in the metro area. Figure 4 shows the Morisita plot corresponding to Fig. 2. In this plot, the Morisita value falls into the [0, 1] range. That means that the point pattern is close to the uniform distribution. Figure 2, therefore, shows that the people with tracking devices are uniformly distributed.
Figure 5 shows a crowd with tracking devices gathering in the downtown area. Figure 3 shows the Morisita plot corresponding to Fig. 5. The highest value of the Moritisa plot is the highest density of the point pattern. The maximum value of Morisita index is close to 8. It shows that people with tracking devices are gathering in some area.
For the downtown case, the maximum of Morisita index values climbs up to 8, which means our raw data was clumped together. For the suburban case, the maximum of Morisita index values falls down to 1.5. The value is significantly smaller than the maximum value 8 in Fig. 3. The result means that the point pattern in this case was not as clumped as in the downtown case.
Figure 7 shows a crowd with tracking devices gathering in the suburban area. Figure 6 shows the Morisita plot corresponding to Fig. 7. The highest value of the Moritisa plot is the highest density of the point pattern. The maximum value of Morisita index is close to 1.5.
It also shows that the crowd with tracking devices in Fig. 7 was not as clumped as in Fig. 5.
The Morisita index is superior for comparing the similarities between different samples. If the data is mostly uniformly distributed, the Morisita index value falls between 0 and 1. However, if the data is in a clumped formation, then the Morisita value falls between 1 and N [7]. N is the highest positive number that indicates the highest degree of density. In this case, we use our framework to get the distribution and point pattern of the largescale tracking data with spatiotemporal marks. Users should check the Morisita value as the real time indicator to determine if it is a gathering event or not. The gathering can be physical or cyber (with IP address mapping). This example shows the function that can help users find social events from point patterns.
Data sets
The first data set is a sample of the TDrive trajectory dataset from Microsoft Research [46, 47] that contains a trajectory of 10,357 taxis from 02/02/2008 to 02/08/2008. The total number of points in this dataset is about 15 million and the total distance of the trajectories reaches 9 million km. Taxi drivers are experienced drivers who can usually drive around the metro area. The taxis with tracking devices are mobile sensors probing the behaviour pattern of the subject. So, the taxi tracking record contains the information of both the spatiotemporal pattern and their behaviour patterns.
The second dataset is a twitter data set. It contains data derived from spatiotemporal information of tweets originated from the city of Milan during the months of November and December 2013 [48]. There is no content of the tweets included in this data set.
The simulated data set was used to indicate the behavior of the Morisita index on specific spatial point patterns. The taxi and twitter data set were used to show the Morisita index values generated by the spatiotemporal data.
The taxi and twitter datasets are both typical large scale spatiotemporal datasets. The statistical anomaly event detection and interpretation of new trends and spatiotemporal pattern changes in sequences of social and political events are hidden behind the large scale data. The taxi data is a typical tracking data set with spatiotemporal marks. The twitter data is a typical social media data set with spatiotemporal marks.
MetaMorisita index architecture
Figure 8 shows the system architecture of the metaMorisita index based framework. As shown in the figure, we propose a framework able to handle three different tasks: outlier detection for a single subject, anomaly detection for a group of subjects and anomalous social event detection. The outlier detection for a single subject means that the system can detect some anomaly behaviour for a single person, for example, frequent credit card charges in an anomalous location. The anomaly detection for a group of subjects means that the system can identify anomalous subjects by their behaviour, for instance, the person who posts larger amounts of tweets than other people. The anomalous social event detection means that the system can detect social events, such as, network flow burst.
We will now describe each of the components of the flow chart of Fig. 8.
During data processing, we extract the highest Morisita value of taxi drivers for each day from the dataset. The Morisita value which is the indicator of density value can be treated as the indicator of behaviour.
First, to identify the outlier, a box plot is employed to generate an outlier list based on the box plot of Morisita values. Second, we employ a clustering method to classify taxi drivers into several groups according to the statistical characteristics of the Morisita value. A Kmeans algorithm is designed to extract the specific locally convergent and dispersive drivers from the metadata of point patterns. Finally, we execute a time series analysis using a change point detection algorithm to extract the change point of human convergent and dispersive behaviours from the metadata of point patterns. We now describe the data processing and anomaly detection parts of our framework in more detail.
Data preprocessing
For a taxi driver, we can construct an individual’s behaviour pattern by analysing the point pattern in the time sequence, where the longitude, latitude, and timestamp of a mobile tracking subject and the pattern represent the time when the spatiotemporal points are updated. For two adjacent Morisita values of the records, the time interval is 1 day. We can identify the density of the point pattern during the interval. The Morisita value as the behaviour indicator can be extracted from the spatiotemporal data within every day. For example, the first Morisita index value was collected on 02/02/2008, and the second record was collected on 02/03/2008; we can extract one time series of Morisita value during 02/02/2008–02/08/2008. The time attributes of the two adjacent records are considered to be the time slot. Thus, we can extract the flow matrices (time slots) of Morisita value for 1 week. We use the time series of the Morisita index values to denote the behaviour pattern of taxi drivers in 1 week. For each day, we calculate the Morisita value of each taxi driver, which represents the behaviour pattern of the taxi drivers’ moves from day to day during that week. We define the time series of the Morisita value as the largest difference observed during that week. The time series thus provides a behaviour pattern for the taxi driver during the week. The variation of Morisita values during a week can reveal the potential function of the anomaly detection. The clustering analysis of the Morisita values can be used as in the following section to identify convergent and dispersive behaviours.
Anomaly behaviour detection
Box plot method for outlier detection in a single subject In order to detect anomalies in a single subject’s behavior, we used a box plot and studied the outliers in that plot.
Clustering method for anomaly group detection For anomaly group detection, we used a clustering method, namely kmeans [49]. kmeans clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. kmeans clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. In this work, kmeans was used with the value k = 4 which provided the best visualization. Since the box plot approach could also be used to detect anomalies in a group situation, we validated the results obtained by kmeans with the output of the boxplot.
Time series analysis for social event detection In order to detect social events, we performed time series analysis using the PELT (Pruned Exact Linear Time) [50, 51] algorithm which is a change point detection method. The PELT algorithm is a multiple change point method that is both computationally efficient and flexible in its application [52]. It has been shown that under certain conditions, especially when the number of change points is increasing linearly with n, the computational efficiency of PELT is O(n). As a result, we can detect change points efficiently from the time series of the behaviour pattern of our data sets.
Experimental methods
We applied our full framework to the taxi driver data set and illustrated its usefulness on the Twitter data. The first two parts of this section describe our analysis on individual taxi drivers and groups of taxi drivers, respectively. The third part is an initial illustration of our framework on the twitter data set.
The “Experimental methods” section concerns the application of the metaMorisita index. The “Results and discussion” section concerns the performance analysis of our metaanalysis method based on the Morisita index. The “Experimental methods” section describes the proposed process and demonstrates its effectiveness. The “Results and discussion” section evaluates our metaMorisita based approach by comparing its performance to that of the most popular clustering algorithms from machine learning.
Behaviour pattern for taxi driver
Table 1 shows spatiotemporal data from taxi driver records. The data was obtained from Microsoft Research [46, 47].
The first column in Table 1 is the ID of the Taxi. The second column is the timestamp of the tracking data. The third and fourth columns are the coordinates of the Taxi. The data was collected in 5 min intervals.
Table 2 shows the information from Table 1 mapped onto the Morisita representation. The first column in Table 2 is the ID of the Taxi. The second column is the highest Morisita value of tracking data per day for each driver. The third column is the date.
In this first study, we use the box plot method to process the metadata of point patterns before detecting the anomalies. The metadata values are designed to determine the abnormal values according to their statistical characteristics; the method is extensively applied to classify and visualize the abnormal points in the metadata. We chose this method to indicate the outliers of the metadata of the point patterns and marked each outlier using the corresponding metadata value. Figure 9 shows the box plot of the tracking metadata of 30 taxi drivers.
Given that we want to indicate these outliers according to the statistical characteristics of their density values, the method will return Morisita values as density values. The density value indicates that the taxi driver displayed anomalous behaviour on the day the outlier was recorded. From the box plot, we can find the outliers, such as Taxi 549 on 02/07/2008. The Morisita value is extremely high on that day. That means Taxi 549 displayed anomalous behaviour on that day. Taxi 549, Taxi 493, Taxi 1011, and Taxi 430 have box plots that are different from other taxi drivers.
Statistical analysis for metadata of spatiotemporal data from taxi drivers
As mentioned in "Anomaly behaviour detection" section, anomaly detection for a group of drivers was conducted using clustering and time series analysis based on metadata of point patterns, where the output from the metadata of point patterns is converted to a simple time series data using a Morisita value. We used Kmeans as the unsupervised learning method to cluster the group of drivers. Also, we use change point detection to detect the change point of the behaviour pattern.
In particular, we used PELT (Pruned Extract Linear Time) [50] as the change point detection approach. PELT was used to detect the exact moment of breakout when the algorithms report a change in distribution (if at all), along with precision, recall, and the Fmeasure.
Anomaly detection for the behaviour pattern of taxi drivers
In this framework we chose Kmeans to cluster the group of drivers. Figure 10 shows the 2D representation of a clustering plot displaying the behaviour patterns of 30 taxi drivers using the Kmeans algorithm. Figure 11 shows the fourcluster plot displaying the behaviour patterns of 30 taxi drivers using Kmeans algorithms.
As a result of Kmeans clustering, Taxi 549, Taxi 493, Taxi 1011, Taxi 430 were grouped as anomalies. The box plot was used to validate the cluster analysis. The Kmeans results were shown to match the results of the box plot.
Figure 12 shows the spatial distribution of Taxi 549 and Taxi 234 in the city of Beijing between 02/02/2008 and 02/08/2008 . The pink lines represent the road network. The red dots are the location of taxis. Taxi 549 is marked as anomaly taxi driver as the result of Kmeans analysis. The median values of Morisita index are 8000 and 400 for Taxi 549 and 243 in Fig. 9. Taxi 549’s tracking record are more clumped than normal taxis in the upper east area (airport). When looking at the spatial distribution it is clear that Taxi 549 behaves differently from normal driver Taxi 234.
Change point detection for groups of taxi drivers
In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change has occurred, or whether several changes might have occurred, and identifying the times of any such changes. In this framework we choose the PELT (Pruned Exact Linear Time) [50] method to detect the change point of the behaviour patterns of taxi drivers.
Figure 13 shows the change point value of the time series of Morisita values between 02/02/2008–02/08/2008. The vertical red lines mark the change point.
Figure 14 shows the decomposition of the additive time series between 02/02/2008–02/08/2008.
As a result of change point detection, Fig. 13 shows that 02/06/2008 is the date at which the maximum change point in the behaviour patterns of taxi drivers was observed. It turns out that 02/06/2008 was the Eve of the Chinese New Year. This may explain the high density value of the point patterns, which may have been caused by the high traffic volume during the holiday. Figure 14 also shows that every morning represents a change point in the behaviour patterns of taxi drivers. These two observations validate the usefulness of our approach since they show that the events detected by our approach correspond to actual events. In the future, this approach could be used to detect spontaneous gatherings resulting from incidents such as accidents, natural disasters or spontaneous demonstrations and could help alert emergency services and security patrols faster, thus providing greater security.
Analysis for twitter data
We now turn to the analysis of the second data set, the twitter data set. Figure 15 shows the spatial distribution of tweets in Milan, Italy at 9 a.m. and 8 p.m. on 1 December 2013 [53]. The pink lines represent the road network. The red dots are the location of twitter users.
Figure 16 shows the line chart of changes in Morisita values over time for this twitter data on 12/01/2013. The line chart in Fig. 16 shows that the Morisita values is higher at night than during the day time. This may be explained by the fact that at night (see Fig. 15, right plot), twitter users do not move around as much, but instead stay put in a concentrated area of the city whereas during the day (see Fig. 15, left plot), they are more dispersed and, therefore, tweet from various parts of the city. This, once again, is not particularly novel and useful information here, but it illustrates how gatherings of Twitter users in a single location can be detected, which could be useful for security reasons.
Figure 17 shows the spatial distribution of tweets in Milan, Italy at 12/02/2013 (Monday) and 12/07/2013 (Saturday) [53]. The pink lines represent the road network. The red dots are the location of twitter users.
Figure 18 shows the line chart of changes in Morisita value over time for this twitter data between 12/01/2013 and 12/12/2013. The line chart in Fig. 18 shows that the Morisita value is higher during the weekend than during the working day. The Morisita value on the Monday (12/02/2013 and 12/09/2013) were local minima. This may be explained by the fact that on weekends (see Fig. 17, right plot), twitter users do not move around as much, but instead stay put in a concentrated area of the city whereas during the working day (see Fig. 17, left plot), they are more dispersed and, therefore, tweet from various parts of the city. This, once again, is not particularly novel and useful information here, but it illustrates how gatherings of Twitter users in a single location can be detected, which could be useful for security reasons.
The user information has been masked in the raw data by the original distributor due to privacy reason. We can, therefore, not analyse individual twitter user based on the metaMorisita index method. So we use Taxi data as substitute. The report based on a single Taxi user can be found in "Statistical analysis for metadata of spatiotemporal data from taxi drivers" section. Using the same learning method as the one used on Taxi data, the anomaly in twitter users could be detected efficiently provided that data on individual users is made available.
Results and discussion
Once again, given the lack of individual twitter data, we could not perform a full quantitative analysis on that data set. Instead, we performed a full quantitative analysis of the Taxi data set. Our experimental environment is based on an eightcore server equipped with Intel(R) Xeon(R) CPU E51630 v4 @ 3.70 GHz and 16 GB memory. The version of operating system is Ubuntu 16.04.1.
Time analysis
In this section we recorded the time analysis result of the performance evaluation of different algorithms on our Taxi database.
Figure 19 shows the elapsed time of different analysis methods. The four different methods in the comparison are Kmeans, densitybased spatial clustering of applications with noise (DBSCAN), the expectation maximization (EM) Clustering algorithm and the metaMorisita index algorithm.
There is no significant difference when processing small scale data such as the data containing less than \(10^6\) points. The metaMorisita index , however, obtains better performance when dealing with large scale data such as when the data contains \(10^8\) points. Thus, the experiment shows that the metaMorisita index obtains better performance than traditional clustering methods for large scale spatiotemporal data. This is because the metaMorisita index retrieves the characteristic value of different time windows for the whole spatiotemporal data set. Then the machine learning algorithm runs on the metadata only, which significantly reduces the computational complexity. On the other hand, the other clustering algorithms like Kmeans compute their results directly based on the entire large scale data set. The remaining question concerns the quality of the results obtained by the MetaMorisita index. We now explore this question by comparing the results obtained by each of the methods just considered to a reference method.
Reference method
Map algebra [54] is a basic setbased algorithm that manipulates the geospatial data. Several algebraic operations like addition, subtraction, etc. can be performed on two or more raster layers of similar dimensions. The output of the map algebra primitive operations is a new raster layer (map). Map algebra operations work on four different classes: local, focal, global and zonal. The operations on raster cells and pixels are local operations. The operations on the entire layer are focal operations. The operations on the cells which have the same value are zonal operations. In Geographic Information Systems (GIS), map algebra is implemented by script or procedure. All the operations are displayed on the map. Map algebra calculates the exact number of incidents (here, Taxi occurrences, but could also be, number of tweets, etc.) that occur in a particular location. While map algebra can give us exact results, it is not practical to use in large scale analyses, which is why alternative methods, such as clustering methods and metaMorisita analysis, were sought and map algebra only used as a reference for a small sample of data as a sanity check. In this section we use map algebra as the referential method to detect clusters in the Taxi data. The number of the clusters will be recorded as the reference value.
Figure 20 shows the point pattern of Taxi 39 on 02/07/2008. This figure is generated by map algebra operations. The behaviour of the taxi driver can be visualized by the density value of the spatiotemporal tracking record. In this map the density value has been displayed with color. Red corresponds to a high density value, and blue corresponds to a low density value. In this case we manually count the number of clusters of size greater than or equal to 3. This count is recorded in Table 3 as the reference value.
Accuracy, precision, recall evaluation
In this section we recorded the accuracy, precision and recall rate of the four algorithms previously considered.
The first column in Table 3 is the ID of the Taxi. The other entries represent the number of clusters which have been detected by different algorithms. The result are also visualized in Fig. 21. The assumption is that the closer the match in number of clusters detected, the closer the actual match is between detected and reference clusters.
The first column in Table 4 is the evaluation metric under consideration. The other entries in the table are the results obtained for these metrics by the four methods under consideration. The result was also visualized in Figure 22.
Figure 21 shows the number of clusters (size ≥ 3) obtained by different analysis methods. The four different methods in the comparison are Kmeans, DBSCAN, EM Clustering and metaMorisita index. The reference value for the number of clusters for each taxi is shown by the purple curve. The metaMorisita value is shown in red. The figure clearly shows that the red curve is the closest match to the purple curve in Fig. 21.
Figure 22 shows the evaluation of different analysis methods. Once again, the four different methods in the comparison are KMeans, DBSCAN, EM Clustering and metaMorisita Index. The Kmeans and EM obtain similar results. They both yield a high number of false negatives (FN). DBSCAN is a little different. It detects all the clusters but obtains too many False Positives (FP).
MapAlgebra algorithms [54] have been used as the reference value since they manually compute occurrences in the same location. The clusters in metaMorisita index are detected by the number of points falling into the same quadrat. As a result we can see that the metaMorisita index significantly improves the accuracy, precision and recall rate compared with other learning algorithms. The reason is that the metaMorisita index algorithm comes from spatial statistics which is similar to map algebra. Furthermore, the Euclidean distance used in the machine learning algorithms caused instability in the clustering result for different data samples. The corresponding concept in Morisita index is the smallest diameter of the quadrat, which is a constant value for geospatial data with the same significant numbers. For example, coordinates (in WGS84 geodetic datum) with four significant values like (Longitude: 116.2936, Latitude 39.9227) , the scale of the data is about 10 m. That constant measure ensures that the metaMorisita value is stable for different data samples.
Figure 23 shows the relationship between the point pattern and the Morisita index. The outlier is the cluster which contains 146 points and is displayed in red on the picture. This cluster has been detected by the Morisita index. However the Morisita index is based on random sampling. As a result, duplicate clusters have also been erroneously detected by the Morisita index which suggests that more than one cluster of this type is present. Please note that the errors made by the metaMorisita index occurred only in the in the vicinity of the outlier area where the density of the cluster is too high for several clusters to form.
Conclusions and future work
The availability of largescale spatiotemporal data sets (e.g., social media, vehicle or cellphone tracking data, financial network log) provides the opportunity and challenge to study behaviour patterns to better understand the interactions between cyber and physical events. In this paper, we explore tracking data by investigating the spatiotemporal patterns of taxi drivers and twitter users. A brief work flow is proposed to identify and extract spatiotemporal patterns of outliers based on metadata of the tracking data. Two case studies of Beijing, China and Milan, Italy are employed to test the proposed method; multiple typical spatiotemporal convergent and dispersive patterns are identified in the large area. We discuss the spatiotemporal distribution of these patterns in different functional areas to obtain better knowledge of the behaviour pattern.
In the paper “Statistical Modeling: The Two Cultures” [55], Breiman compared the data and algorithm modeling cultures. The framework we presented is the combination of spatial statistics and machine learning. The Morisita index is a data modeling approach for spatial data in statistics. However the original Morisita index does not provide the ability to learn. In our work, the Morisita index (data model) has been used to generate the characteristic value of raw data, then learning approaches (algorithm model) were applied to meta data generated by the Morisita index. Our framework thus combines both the advantage of machine learning and spatial statistics. At the same time metaMorisita index prevents the high computational complexity of current clustering algorithms applied to spatial data.
The findings derived from this study provide insights about the location, time, intensity of the taxi drivers in Beijing and twitter users in Milan, which is helpful for mining behaviour pattern and surveillance for cyberphysical subjects. The identified patterns can help government agencies and urban administrations make targeted adjustments to monitoring cyberphysical events with high anomaly activity as a way to improve the efficiency of the methods used to maintain security in society. In addition, the findings can be used as a reference for understanding subject behaviour. For example, if we only know the spatiotemporal distribution of the active areas of a city, it is possible to have a general understanding of daily subject behaviour and dispersion in other cities according to the discussion in "Statistical analysis for metadata of spatiotemporal data from taxi drivers" section.
In the future, we will use spatiotemporal statistical models to analyse intelligence information about the behaviour of each subject, to determine the spatiotemporal interactions among different areas of the city, and to explore the behaviour patterns among different social roles, which can provide indepth knowledge regarding the interactions between subjects and their social roles.
Our plan is to collect tracking data with different social roles, such as Teacher, Police Officer, UPS drivers, etc. We will train the system on each social role separately in order to learn behavior patterns from each category and help us detect behaviors that do not fall into expected categories and may be considered suspicious.
In general, the framework provides a universal solution for spatiotemporal analytic tasks beyond the metadata of spatial point pattern, which is needed for statisticians and researchers. The Morisita index was selected as the indicator of daily behaviour from spatial point pattern. Multiple analytic methods have been used as efficient statistical computing approaches to accommodate multiple spatialtemporal data sources and data schemas. The statistical analyses beyond the metadata of spatial point pattern, which work to reduce the computational complexity for large scale spatiotemporal data, are flexible enough to be added to existing spatiotemporal data warehouse systems. Using this framework is a more convenient, flexible, and scalable way for data analysts and statisticians to process and analyse largescale cyber security data with spatiotemporal marks.
Notes
The term quadrat in ecology and geography means a plot to isolate a standard size of area to study the distribution of an item over a large area [9].
EM is a much broader statistical estimation method than simple clustering algorithm. It is a general modelling process which can be applied to clustering.
References
Shrestha A, Zhu Y, Manandhar K. Nettimeview: applying spatiotemporal data visualization techniques to ddos attack analysis. LNCS. 2014;8887:357–66.
Yang X, Zhao Z, Lu S. Exploring spatialtemporal patterns of urban human mobility hotspots. Sustainability. 2016;8(7):674.
Chen D, Lu CT, Kou Y, Chen F. On detecting spatial outliers. GeoInformatica. 2008;12:455–75.
Brimicombe AJ. Cluster detection in point event data havingtendency towards spatially repetitive events. In: Proceedings of the 8th international conference on geocomputation, Ann Arbor; 2005.
Hijbeek R, Koedam N, Khan MNI, Kairo JG, Schoukens J, DahdouhGuebas F. An evaluation of plotless sampling using vegetation simulations and field data from a Mangrove forest. PLoS ONE. 2013;8(6):e67201.
Yang Z. Spatial data mining analytical environment for large scale geospatial data. Ph.D. thesis, University of New Orleans. University of New Orleans Theses and Dissertations. 2284. 2016. http://scholarworks.uno.edu/td/2284
Baddeley A, Rubak E, Turner R. Spatial point patterns: methodology and applications with R. Boca Raton: CRC Press; 2015.
Ioup E, Yang Z, Barré B, Sample J, Shaw KB, Abdelguerfi M. Annotating uncertainty in geospatial and environmental data. IEEE Internet Comput. 2015;19:18–27.
Stiling P. Ecology: global insights and investigations. New York: McGrawHill Education; 2011.
Berthelsen KK, Jalilian A, van Lieshout MC, Rajala T, Schuhmacher D, Waagepetersen R. Spatstat quick reference guide. http://spatstat.org/resources/spatstatQuickref.pdf
Morisita M. Measuring of the dispersion and analysis of distribution patterns. Memoires of the Faculty of Science, Kyushu University, Series E. Biology. 1959;2:215–35.
Cressie N, Wikle CK. Statistics for spatiotemporal data. New York: Wiley; 2011.
Wahba G. Spline models for observational data. Philadelphia: SIAM; 1990.
Koziel S. Accurate modeling of microwave devices using krigingcorrected space mapping surrogates. Int J Numer Model. 2011;25:1–4.
ArcMap 10.3. How kriging works. Redlands: ESRI. 2016. http://desktop.arcgis.com/en/arcmap/10.3/tools/spatialanalysttoolbox/howkrigingworks.htm
Cox DR, Isham V. Point processes. Boca Raton: Chapman and Hall; 1980.
Daley DJ, VereJones D. An introduction to the theory of point processes volume I: elementary theory and methods. Berlin: Springer; 2003.
Cressie NAC. Statistics for spatial data, revised edition. New York: Wiley; 2015.
Diggle PJ. Statistical analysis of spatial point patterns. London: Hodder Education Publishers; 2003.
MØller J, Waagepetersen RP. Statistical inference and simulation for spatial point processes. Boca Raton: CRC Press; 2003.
Diggle PJ. Spatiotemporal point processes, partial likelihood, foot and mouth disease. Stat Methods Med Res. 2006. https://doi.org/10.1191/0962280206sm454oa.
Diggle PJ, Kaimi I, Abellana R. Partiallikelihood analysis of spatiotemporal pointprocess data. Biometrics. 2009. https://doi.org/10.1111/j.15410420.2009.01304.x.
Zhuang J, Ogata Y, VereJones D. Stochastic declustering of spacetime earthquake occurrences. J Am Stat Assoc. 2002;97:369–80.
Illian J, Penttinen A, Stoyan H, Stoyan D. Statistical analysis and modelling of spatial point patterns. New York: Wiley; 2008.
MacQueen J. Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Prob. 1967;1:281–97.
Ester M, Kriegel HP, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases with noise. In: KDD’96 Proceedings of the second international conference on knowledge discovery and data mining. Cambridge: AAAI Press; 1996. pp. 226–231.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc. 1977;39:1–38.
Park HS, Jun CH. A simple and fast algorithm for kmedoids clustering. Exp Syst Appl. 2009;36:3336–41.
Ng RT, Han J. Clarans: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng. 2002;14:1003–16.
He Z, Xu X, Deng S. Attribute value weighting in kmodes clustering. 2007.
Merzougui M, Nasri M, Bouali B. Isodata classification with parameters estimated by evolutionary approach. In: 2013 8th International Conference on intelligent systems: theories and applications (SITA); 2013.
Bezdek C, Ehrlich J, Full W. Fcm: The fuzzy cmeans clustering algorithm. Comput Geosci. 1984;10:191–203.
Sander J, Ester M, Kriegel HP, Xu X. Densitybased clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Dis. 1998;2:169–94.
Wang X, Hamilton HJ. Dbrs: A densitybased spatial clustering method with random sampling., Lecture notes in computer science book seriesBerlin: Springer; 2003.
Birant D, Kut A. Stdbscan: an algorithm for clustering spatialtemporal data. Data Knowl Eng. 2007;60:208–21.
Ankerst M, Breunig MM, Kriegel HP, Sander J. Optics: ordering points to identify the clustering structure. In: Proceeding SIGMOD ’99 proceedings of the 1999 ACM SIGMOD international conference on management of data. 1999.
Izakian H, Pedrycz W. Anomaly detection and characterization in spatial time series data: a clustercentric approach. IEEE Trans Fuzzy Syst. 2014;22:1612–24.
Birant D, Kut A. Spatiotemporal outlier detection in large databases. J Comput Inf Technol. 2006;14:291–7.
Cheng T, Li Z. A multiscale approach for spatiotemporal outlier detection. Trans GIS. 2006;10:253–63.
Saligrama V, Zhao M. Local anomaly detection. In: Proceedings of the 15th international conference on artificial intelligence and statistics (AISTATS), vol. 22. La Palma. 2012.
Young WC, Blumenstock JE, Fox EB, McCormick TH. Detecting and classifying anomalous behavior in spatiotemporal network data. New York: KDDLESI; 2014.
Liu C, Ghosal S, Jiang Z, Sarkar S. An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed cps. In: Proceedings of the 7th international conference on cyberphysical systems, 1. Vienna. 2016.
Capdevila J, Cerquides J, Torres J. Mining urban events from the tweet stream through a probabilistic mixture model. Data Mining Knowl Discov. 2017;93:58–68.
Anagnostopoulos A, Petroni F, Sorella M. Targeted interestdriven advertising in cities using twitter. Data Mining Knowl Discov. 2017;32(3):737–63.
Pappalardo L, Simini F. Datadriven generation of spatiotemporal routines in human mobility. Data Mining Knowl Discov. 2017;91:511–24.
Yuan, J., Zheng, Y., Xie, X., Sun, G.: Driving with knowledge from the physical world. In: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’11 (2011)
Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y. Tdrive: driving directions based on taxi trajectories. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, GIS’10. 2010.
di Milano DP, SpazioDati. Social pulse—Milano. Harv Dataverse. 2015;12:1. https://doi.org/10.7910/DVN/9IZALB.
Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28:129–37.
Killick R, Eckley LA. Changepoint: an R package for changepoint analysis. J Stat Softw. 2014;58:1–9.
Killick R, Fearnhead P, Eckley IA. Optimal detection of changepoints with a linearcomputational cost. J Am Stat Assoc. 2012;107:1590–8.
Lesmeister C. Changepoint analysis of time series? Tech Rep. 2013. https://www.rbloggers.com/changepointanalysisoftimeseries/
Center of Computational Communication of Nanjing University. Case study of spatial analysis: spatial point pattern analysis (In Chinese). https://site.douban.com/146782/widget/notes/15468638/note/337537003/.
Longley PA, Goodchild M, Maguire DJ, Rhind DW. Geographic information systems and science. 3rd ed. New York: Wiley; 2010.
Breiman L. Statistical modeling: the two cultures. Stat Sci. 2001;16(3):199–231.
Authors' contributions
ZY and NJ conceived of and designed this study. ZY analyzed and drafted the manuscript. NJ gave many valuable suggestions about the machine learning algorithms and help editing the language of the manuscript. Both authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The data sets supporting the results of this article are included within the article
Consent for publication
All authors have approved the manuscript and agree with its submission to the journal.
Ethics approval and consent to participate
Not applicable.
Funding
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Yang, Z., Japkowicz, N. Anomaly behaviour detection based on the metaMorisita index for large scale spatiotemporal data set. J Big Data 5, 23 (2018). https://doi.org/10.1186/s4053701801338
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4053701801338