 Research
 Open Access
 Published:
Discovery of crime event sequences with constricted spatiotemporal sequential patterns
Journal of Big Data volume 10, Article number: 98 (2023)
Abstract
In this article, we introduce a novel type of spatiotemporal sequential patterns called Constricted SpatioTemporal Sequential (CSTS) patterns and thoroughly analyze their properties. We demonstrate that the set of CSTS patterns is a concise representation of all spatiotemporal sequential patterns that can be discovered in a given dataset. To measure significance of the discovered CSTS patterns we adapt the participation index measure. We also provide CSTSMiner: an algorithm that discovers all participation index strong CSTS patterns in event data. We experimentally evaluate the proposed algorithms using two crimerelated datasets: Pittsburgh Police Incident Blotter Dataset and Boston Crime Incident Reports Dataset. In the experiments, the CSTSMiner algorithm is compared with the other four stateoftheart algorithms: STSMiner, CSTPM, STBFM and CSTSPMiner. As the results of the experiments suggest, the proposed algorithm discovers much fewer patterns than the other selected algorithms. Finally, we provide the examples of interesting crimerelated patterns discovered by the proposed CSTSMiner algorithm.
Introduction
Discovering knowledge in the form of various types of patterns, inference rules or motifs from spatiotemporal events data is a topic attracting increasing attention of researchers worldwide [1,2,3]. Specifically, many realworld spatiotemporal datasets consist of a set of event types and their event instances defined by geographical locations and occurrence times. An example of a spatiotemporal event dataset is a set of crime event incidents, such as arson, homicide or vandalism, each of which is assigned an event type, a geographical location and time of occurrence. Discovering sequences of crime types that occur in a spatial area over a time period can contribute to a better understanding of the causes of these crimes and to their elimination [4,5,6,7,8].
In order to discover such sequences of spatiotemporal event types, one can consider applying one of the algorithms for spatiotemporal sequential patterns discovery. A spatiotemporal sequential pattern (in brief, ST sequential pattern), introduced in [9], is defined as a sequence of event types. By discovering ST sequential patterns, one can obtain insight into spatiotemporal relations between various event types. For example, the discovery of an ST sequential pattern arson \(\rightarrow\) vandalism \(\rightarrow\) bomb can lead to the critical behavior pattern of a dangerous terrorist. The example of another ST sequential pattern that could be discovered from the dataset presented in Fig. 1 is \(A \rightarrow B \rightarrow C \rightarrow D\).
Limitations of the existing work
In the literature, several types of methods and algorithms were already developed for the discovery of ST sequential patterns (see, for example, [9,10,11,12,13,14]). [9] introduced the first algorithm called STSMiner for the discovery of significant ST sequential patterns. [11, 12] define a significant ST sequential pattern as a pattern, whose participation index measure is greater than the userspecified threshold \(PI_{min}\). [12] refers to such a pattern as a PIstrong ST sequential pattern (we adapt this naming convention in this paper).
While the already proposed algorithms (such as STSMiner [9], STBFM [11], CSTSPMiner [12], CSTPM [10], STES [15]) can discover PIstrong (closed) ST sequential patterns in some datasets, however, in practice, the number of discovered redundant patterns can still be too huge to be analyzed by a user of the algorithm. Hence, in this paper, we offer a novel representation of ST sequential patterns which we call Constricted SpatioTemporal Sequential patterns (in brief, CSTS patterns) and analyze their theoretical properties. As presented in the paper, given the set of CSTS patterns one can approximate participation index values of all ST sequential patterns.
To verify the efficiency and effectiveness of the proposed approach, we use two realworld datasets of crime events: the Pittsburgh Police Incident Blotter Dataset and the Boston Crime Incident Reports Dataset. For example, one of the conducted experiments shows that for the Boston Crime Incidents Reports dataset, the proposed approach is able to discover 65,967 CSTS patterns, while the three algorithms discovering all spatiotemporal sequential patterns provide as many as 228,285 patterns. The discovered 65,967 CSTS patterns can be used to derive all 228,285 spatiotemporal sequential patterns and approximate participation index of each of them with a maximal error of \(\pm ~0.025\).
Contributions
The contributions of the paper are as follows:

We introduce the notion of a Constricted SpatioTemporal Sequential (CSTS) pattern that constitutes concise representation of all ST sequential patterns.

We thoroughly analyze theoretical properties of CSTS patterns. Specifically, we show that the set of CSTS patterns is a subset of the set of closed ST sequential patterns and that each CSTS pattern is also a closed ST sequential pattern. Moreover, we show that given the set of Participation Index (PI)strong CSTS patterns one can obtain the set of all PIstrong ST sequential patterns and approximate participation index of each of them with an approximation margin of \(\pm ~\varepsilon\).

We offer a new algorithm called CSTSMiner that discovers PIstrong CSTS patterns. CSTSMiner applies an introduced MAXTree structure for more efficient candidate patterns generation. The proposed MAXTree is generated in two main phases of CSTSMiner: the first phase called “topdown”, in which all PIstrong ST sequential patterns are generated using the breadthfirst strategy, and the second phase called “bottomup”, which calculates PIstrong CSTS patterns. We also offer the analysis of computational and memory complexity of CSTSMiner.

We experimentally compare the results obtained with the CSTSMiner algorithm with three other stateoftheart algorithms discovering ST sequential patterns: the adapted version of STSMiner [9], STBFM [11], CSTPM [10], which discover PIstrong ST sequential patterns. Moreover, we also compare CSTSMiner with the CSTSPMiner algorithm [12], which discovers PIstrong closed ST sequential patterns. Similarly to our CSTS patterns, closed spatiotemporal sequential patterns discovered by CSTSPMiner also constitute a concise representation of all ST sequential patterns. For the purpose of experiments, we selected and preprocessed two crime events datasets: the Pittsburgh Police Incident Blotter Dataset and the Boston Crime Incident Reports Dataset. As we show, CSTSMiner discovers much fewer redundant patterns than the other selected algorithms. Specifically, as the results of the experiments confirm, CSTSMiner provides up to 60% fewer patterns compared to the STSMiner, STBFM and CSTPM algorithms and up to 40% fewer patterns compared to the CSTSPMiner algorithm.

We provide experimental comparison of the effectiveness and efficiency of the abovementioned STSMiner, CSTPM, STBFM and CSTSPMiner algorithms discovering (closed) spatiotemporal sequential patterns. In our comparison, we showed the number of discovered patterns and execution time of each of algorithms for the same parameters of participation index threshold and neighborhood specification. Our implementations (prepared in the C++ language) of the selected algorithms (STSMiner, STBFM, CSTPM, CSTSPMiner) as well as the proposed CSTSMiner algorithm are available at the GitHub repository.^{Footnote 1}

Eventually, we provide examples of interesting crimerelated patterns discovered by CSTSMiner from the Pittsburgh Police Incident Blotter Dataset.
Structure
The structure of the article is as follows. In Sect. Related work, we offer a brief review of the related work. Section Basic notions offers basic notions of ST sequential patterns. In Sect. Discovery of constricted ST sequential patterns, we introduce the notion of a CSTS pattern. In Sect. Theoretical properties of CSTS patterns, we analyze theoretical properties of CSTS patterns. Section Constricted ST sequential patterns miner describes the proposed CSTSMiner algorithm. In Sect. Experiments, we provide the results of experiments and in Sect. Conclusion we conclude the article.
Related work
The discovery of concise representations of various patterns (especially frequent patterns and sequential patterns) attracts researchers’ attention. In [16], closed sequential patterns representation was introduced for the first time. Following [16], numerous works were dedicated to the problem of more efficient mining of closed sequential patterns. The examples include methods and algorithms offered in [17, 18] or, more recently, in [19] and [20]. Other related research directions include discovery of top closed sequential patterns with the highest support [21, 22] and the parallel discovery of closed sequential patterns [23]. A survey of the current methods for the discovery of closed sequential patterns can be found in [24].
Our proposed notion of CSTS patterns is similar to the notion of delta closed sequential patterns [25] in that they both avoid returning patterns whose significance measure is no greater than the significance measure of their supersequence patterns plus the approximation margin (which in the case of delta closed patterns is called \(\delta\)tolerance). However, unlike our CSTS patterns, delta closed sequential patterns are not designed to work with spatiotemporal event data.
While many methods and algorithms were offered to discover various types of spatiotemporal patterns, relatively few of them focused on discovering ST sequential patterns. To this end, in our experiments the proposed CSTSMiner algorithm is compared with the most representative algorithms mining ST sequential patterns, namely:

STSMiner [9]—the first algorithm offered for the discovery of ST sequential patterns. STSMiner uses the depthfirst patterns generation strategy. In this work, we adapted STSMiner to use the participation index measure instead of the sequence index measure introduced in [9]. Thus, the adapted version of STSMiner is capable of discovering PIstrong ST sequential patterns.

STBFM [11]—the algorithm discovering PIstrong ST sequential patterns by means of the breadthfirst pattern generation strategy. Maciąg and Bembenik [11] introduced a structure called SPTree that allows to efficiently generate candidate patterns using their first and second parent patterns and a children list of the first parent pattern. In addition, [11] presented the two variations of STBFM that can discover topK PIstrong ST sequential patterns. The experiments of [11] showed that STBFM is capable of discovering some interesting crimerelated patterns.

CSTSPMiner [12]—the algorithm discovering closed PIstrong ST sequential patterns. CSTSPMiner applies the breadthfirst candidate patterns generation strategy to obtain all PIstrong closed ST sequential patterns. For each obtained PIstrong ST sequential pattern \(\overrightarrow{s}\), CSTSPMiner determines if this pattern is closed or not using a check condition verifying if \(\overrightarrow{s}\) is a closure pattern of any of its parent patterns.

CSTPM [10]—the algorithm discovering Cascading SpatioTemporal Patterns (CSTP). CSTP patterns consist not only of ST sequential patterns but also of cascades of event types. In our work, to directly compare CSTPM with the proposed CSTSMiner, we adapted the CSTPM algorithm for the discovery of ST sequential patterns rather than cascading ST patterns.
The reviews of methods and algorithms for (spatiotemporal) patterns discovery with particular emphasis on spatiotemporal event datasets can be found in [15, 26,27,28,29,30,31,32,33].
Basic notions
The definitions presented in this section are formulated based on the works [9, 11, 12].
ST sequential patterns
Let \({\textbf{F}}\) denote a set of n event types and \({\textbf{D}}\) denote a dataset of event instances in which each \(e \in {\textbf{D}}\) consists of a spatial location, an occurrence time (timestamp) and an event type \(F \in {\textbf{F}}\). \({\textbf{D}}\) will be called a spatiotemporal event dataset. Moreover, let \({\textbf{D}}\) denote the number of event instances in \({\textbf{D}}\). The set of all event instances of type F in dataset \({\textbf{D}}\) will be denoted by \({\textbf{D}}(F)\).
Spatiotemporal event datasets often occur in realworld. The possible examples include: a dataset of crime instances and crime event types or a dataset of conflict incidents and their types. Let us consider an example of a spatiotemporal event dataset presented in Fig. 1. This dataset consists of event instances \({\textbf{D}} = \{a_1, a_2, b_1,\) \(\dots , b_8, c_1,\) \(\dots , c_8, d_1, \dots , d_3,\) \(e_1, \dots , e_5\}\) and a set of five event types \({\textbf{F}} = \{A, B, C, D, E\}\).^{Footnote 2} To better illustrate the notions introduced in this section, let us assign the set of event types \({\textbf{F}} = \{A, B, C, D, E\}\) of the dataset presented in Fig. 1 realworld crime event types as follows:

A—Vandalism,

B—Robbery,

C—Simple assault,

D—Arson,

E—Aggravated assault.
Spatiotemporal sequential pattern (ST sequential pattern) \(\overrightarrow{s}\) is defined as a sequence of elements, each of which is an event type from \({\textbf{F}}\) [9]. Please note that an event type \(F \in {\textbf{F}}\) can occur multiple times in an ST sequential pattern \(\overrightarrow{s}\)^{Footnote 3}. Before we provide a formal definition of an ST sequential pattern, we recall the definition of a spatiotemporal neighborhood of an event instance with respect to an event type:
Definition 1
(Neighborhood of an event instance with respect to an event type [9, 12]) For an event instance e, the neighborhood of e with respect to an event type \(F \in {\textbf{F}}\) is denoted by \({\textbf{N}}(e, F)\) and is defined as follows:
where R and T are usergiven spatial distance and time window thresholds, respectively.
In Fig. 1, the neighborhood \({\textbf{N}}(a_1, B)\) consists of event instances \(\{b_1, b_2\}\). (We would say that the neighborhood of a Vandalism crime event \(a_1\) contains with respect to event type B two crime events of Robbery \(b_1, b_2\)). Similarly, the neighborhood \({\textbf{N}}(c_1, E) = \{e_1, e_2\}\) (thus, crime instance \(c_1\) of Simple assault contains with respect to Aggravated assault two crime events \(e_1, e_2\)).
The spatial distance threshold of neighborhoods given in Fig. 1 is \(R = 10\), while the time window threshold is equal to \(T = 20\). In our experiments, the spatial distance between locations of two event instances is calculated as the Euclidean distance between these locations. Please note that according to Definition 1, an event instance \(e_j\) can be located in a neighborhood of an event instance \(e_i\), only if the occurrence time of \(e_j\) is greater than the occurrence time of \(e_i\) (i.e., the two event instances with the same occurrence time can not mutually belong to their neighborhoods because the difference between their occurrence time would be \(e_i.time  e_j.time = 0\)).
Definition 2
(Spatiotemporal sequential pattern) A spatiotemporal sequential pattern (in brief, ST sequential pattern) is a sequence of event types in \({\textbf{F}}\). ith element of sequence \(\overrightarrow{s}\) is denoted by \(\overrightarrow{s}[i]\). Sequence \(\overrightarrow{s}\) which consists of m elements is denoted as \(\overrightarrow{s}[1] \rightarrow \overrightarrow{s}[2] \rightarrow \dots \rightarrow \overrightarrow{s}[m]\). The number of elements of sequence \(\overrightarrow{s}\) is defined as its length.
An example of an ST sequential pattern for the dataset presented in Fig. 1 is \(\overrightarrow{s} = A \rightarrow B \rightarrow C\), the length of which is 3. The important question is how to efficiently calculate neighborhoods of event instances. In our implementation, we adapted the computationally efficient plane sweep algorithm (see e.g. [34]).
Definition 3
(Set of event instances supporting an element of an ST sequential pattern [9, 12]) A Set of event instances supporting ith element of ST sequential pattern \(\overrightarrow{s}\) is denoted by \({\textbf{I}}(\overrightarrow{s},i)\) and is defined as follows:
For each ST sequential pattern \(\overrightarrow{s}\), we can unambiguously distinguish sets of event instances supporting elements of that pattern. For the first element of a pattern, the set of event instances \({\textbf{I}}(\overrightarrow{s}, 1)\) supporting that element is defined simply as all event instances of event type \(\overrightarrow{s}[1]\) in \({\textbf{D}}\). For every next element of \(\overrightarrow{s}\) (say i), the set of supporting event instances \({\textbf{I}}(\overrightarrow{s}, i)\) consists of all those event instances of event type \(\overrightarrow{s}[i]\) which belong to neighborhoods of instances contained in the supporting set \({\textbf{I}}(\overrightarrow{s}, i  1)\).
Let us consider an example of a previously given ST sequential pattern \(\overrightarrow{s} = A \rightarrow B \rightarrow C\) (Vandalism \(\rightarrow\) Robbery \(\rightarrow\) Simple assault) of the dataset in Fig. 1. The sets of event instances supporting \(\overrightarrow{s}\) are as follows:

\({\textbf{I}}(\overrightarrow{s}, 1) = {\textbf{D}}(A) = \{a_1, a_2\}\);

\({\textbf{I}}(\overrightarrow{s}, 2) = \bigcup \limits _{a \in {\textbf{I}}(\overrightarrow{s},1)} {\textbf{N}}(a, B) = \{b_1, b_2, b_3, b_4\}\);

\({\textbf{I}}(\overrightarrow{s}, 3) = \bigcup \limits _{b \in {\textbf{I}}(\overrightarrow{s}, 2)} {\textbf{N}}(b, C) = \{c_1, c_2, c_3 \}\);
In this article, we apply the previously introduced in [10, 11, 15] participation ratio and participation index measures of significance of discovered patterns. The participation ratio of an ith element of an ST sequential pattern \(\overrightarrow{s}\) expresses the quotient of the number of event instances supporting ith element of \(\overrightarrow{s}\) to the number of event instances of \(\overrightarrow{s}[i]\) event type in \({\textbf{D}}\).
Definition 4
(Participation Ratio (PR) and Participation Index (PI) [12]) The participation ratio of an ith element of ST sequential pattern \(\overrightarrow{s}\), where \(i \ge 1\), is denoted by \(PR(\overrightarrow{s}, i)\) and is defined as the ratio of the cardinality of the set of event instances supporting ith element of \(\overrightarrow{s}\) to the number of all instances of type \(\overrightarrow{s}[i]\) in the dataset \({\textbf{D}}\); that is: \(PR(\overrightarrow{s}, i) = \dfrac{\big I(\overrightarrow{s}, i)\big }{\big {\textbf{D}}(\overrightarrow{s}[i])\big }\).
The participation index of ST sequential pattern \(\overrightarrow{s} = \overrightarrow{s}[1] \rightarrow \overrightarrow{s}[2] \rightarrow \dots \rightarrow \overrightarrow{s}[m]\) is denoted by \(PI(\overrightarrow{s})\) and is defined as the minimum from the participation ratios of all elements of \(\overrightarrow{s}\); that is, \(PI(\overrightarrow{s}) = \text {min}\big (\{PR(\overrightarrow{s}, i) \ i = 1, 2, \dots , m\}\big )\).
By Definition 4, the value of participation ratio is always in the range [0,1]. Participation index is defined as the minimum of participation ratios of all elements of an ST sequential pattern.
Let us consider the pattern \(\overrightarrow{s} = A \rightarrow B \rightarrow C\) (Vandalism \(\rightarrow\) Robbery \(\rightarrow\) Simple assault) and let us calculate PR and PI values of this pattern given the dataset and neighborhoods parameters of Fig. 1.

\(PR(\overrightarrow{s}, 1)= 1\),

\(PR(\overrightarrow{s}, 2)= \frac{4}{8} = 0.5\) (that is half of all Robbery event instances occur in neighborhoods of Vandalism event instances),

\(PR(\overrightarrow{s}, 3)= \frac{3}{8} = 0.375\) (only three instances of Simple assault event type occur in neighborhoods of event instances of the set \({\textbf{I}}(\overrightarrow{s}, 2)\).
Thus, the participation index of \(\overrightarrow{s} = A \rightarrow B \rightarrow C\) equals to \(PI(\overrightarrow{s}) = 0.375\).
In Table 1, we listed all ST sequential patterns whose participation indexes are greater than 0 that can be discovered in the dataset of Fig. 1.
Definition 5
(PIstrong ST sequential pattern) A candidate ST sequential pattern \(\overrightarrow{s}\) is called PIstrong if its participation index \(PI(\overrightarrow{s})\) is greater than the participation index threshold \(PI_{min}\).
Discovery of all PIstrong ST sequential patterns can be performed, for example, using the STBFM algorithm introduced in [11].
Let us assume that \(PI_{min} = 0.5\). The set of all PIstrong ST sequential patterns for the dataset of Fig. 1 is: A(1), B(1), C(1), D(1), E(1), \(B \rightarrow B(0.625)\), \(B \rightarrow D(1)\), \(C \rightarrow E(0.8)\).
Definition 6
Let \(\overrightarrow{s_1} = \overrightarrow{s_1}[1] \rightarrow \overrightarrow{s_1}[2] \rightarrow \dots \rightarrow \overrightarrow{s_1}[m_1]\) and \(\overrightarrow{s_2} = \overrightarrow{s_2}[1] \rightarrow \overrightarrow{s_2}[2] \rightarrow \dots \rightarrow \overrightarrow{s_2}[m_2]\) be ST sequential patterns. \(\overrightarrow{s_1}\) is a subsequence of \(\overrightarrow{s_2}\) and \(\overrightarrow{s_2}\) is a supersequence of \(\overrightarrow{s_1}\) if \(m_1 \le m_2\) and there exists an integer k, where \(0 \le k \le m_2  m_1\), such that \(\overrightarrow{s_1}[1] = \overrightarrow{s_2}[1+k] \wedge \overrightarrow{s_1}[2] = \overrightarrow{s_2}[2+k] \wedge \dots \wedge \overrightarrow{s_1}[m_1] = \overrightarrow{s_2}[m_1 + k]\).
If \(m_1 < m_2\), then \(\overrightarrow{s_1}\) is a proper subsequence of \(\overrightarrow{s_2}\) and \(\overrightarrow{s_2}\) is a proper supersequence of \(\overrightarrow{s_1}\).
Theorem 1
(Antimonotonicity property of the participation index for supersequences [12]) Let \(\overrightarrow{s_1}\) and \(\overrightarrow{s_2}\) be ST sequential patterns. If \(\overrightarrow{s_1}\) is a subsequence of \(\overrightarrow{s_2}\), then \(PI(\overrightarrow{s_1}) \ge PI(\overrightarrow{s_2})\).^{Footnote 4}
For the dataset presented in Fig. 1, \(\overrightarrow{s_1} = A \rightarrow B\) is a proper subsequence of \(\overrightarrow{s_2} = A \rightarrow B \rightarrow C \rightarrow E\) (\(\overrightarrow{s_2}\) is a proper supersequence of \(\overrightarrow{s_1}\)).
As follows from Theorem 1, the PI value of ST sequential pattern \(\overrightarrow{s_2}\) is always less than or equal to the PI value of any of its proper subsequence \(\overrightarrow{s_1}\). The STBFM, CSTSPMiner and CSTPM algorithms apply Theorem 1 to efficiently generate candidate ST sequential patterns using the breadthfirst search strategy.
Closed ST sequential patterns
Maciag 12] introduced a concise and lossless representation of ST sequential patterns called closed ST sequential patterns. The important property of closed ST sequential patterns is that one can obtain the value of participation index of any ST sequential pattern given only the set of all closed ST sequential patterns. In Sect. “Discovery of constricted ST sequential patterns” of this work, we theoretically and experimentally compare the proposed constricted ST sequential patterns to the closed ST sequential patterns. Thus, Definition 7 recalls the notions of closed ST sequential pattern, closure of an ST sequential pattern and PIstrong closed ST sequential pattern.
Definition 7
(Closed ST sequential pattern and closure of an ST sequential pattern [12]) ST sequential pattern \(\overrightarrow{s_1}\) is closed if there exists no proper supersequence \(\overrightarrow{s_2}\) of \(\overrightarrow{s_1}\), such that the participation index \(PI(\overrightarrow{s_2}) = PI(\overrightarrow{s_1})\).
A closure of ST sequential pattern \(\overrightarrow{s_1}\) is a supersequence \(\overrightarrow{s_2}\) of \(\overrightarrow{s_1}\), such that \(\overrightarrow{s_2}\) is a closed ST sequential pattern and \(PI(\overrightarrow{s_2}) = PI(\overrightarrow{s_1})\).
A PIstrong closed ST sequential pattern is a closed ST sequential pattern whose participation index is greater than the threshold \(PI_{min}\).
For example, for the set of ST sequential patterns of Table 1, \(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C(0.25)\) is a closed ST sequential pattern. \(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C(0.25)\) is also a closure of the following patterns:

\(A \rightarrow B \rightarrow B(0.25)\),

\(C \rightarrow E \rightarrow C(0.25)\),

\(A \rightarrow B \rightarrow B \rightarrow C(0.25)\),

\(B \rightarrow C \rightarrow E \rightarrow C(0.25)\),

\(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E(0.25)\),

\(A \rightarrow B \rightarrow C \rightarrow E \rightarrow C(0.25)\),

\(B \rightarrow B \rightarrow C \rightarrow E \rightarrow C(0.25)\).
An ST sequential pattern can be closed and be its own closure. For example, \(B \rightarrow B(0.625)\) is a closed ST sequential pattern and it is also its own closure. Please note that according to Definition 7 an ST sequential pattern can have more than one closure. For example, pattern \(B \rightarrow C \rightarrow E(0.375)\) of the dataset in Fig. 1 has two closures:

\(A \rightarrow B \rightarrow C \rightarrow E(0.375)\),

\(B \rightarrow B \rightarrow C \rightarrow E(0.375)\).
Similarly to Table 1, in Table 2, we provide the set of all closed ST sequential patterns that can be discovered from the dataset of Fig. 1.
Discovery of constricted ST sequential patterns
In this section, we first present our motivation for introducing Constricted ST Sequential patterns. Next, we provide the elementary notions of such type of patterns.
Motivation
Closed ST sequential patterns of Definition 7 are a concise representation of all ST sequential patterns: hence, the set of closed ST sequential patterns can be used to derive all ST sequential patterns. However, in the case of many realworld spatiotemporal event data, participation indexes of ST sequential patterns strictly depend on the specification of the neighbourhoods of event instances as well as spatial and temporal distribution of the locations of event instances in the dataset \({\textbf{D}}\). Specifically, for a given ST sequential pattern \(\overrightarrow{s}\) of length k, its proper supersequence patterns of length greater than k usually have only slightly smaller values of participation indexes than \(\overrightarrow{s}\). According to Definition 7, none of such supersequences can constitute a closure of \(\overrightarrow{s}\) and thus be a closed ST sequential pattern.
To illustrate such a situation, let us consider pattern \(\overrightarrow{s_1} = B \rightarrow B \rightarrow C (0.5)\) from Table 1. As follows from Table 2, pattern \(B \rightarrow B \rightarrow C (0.5)\) is a closed pattern (and thus as follows from Definition 7 it is its own closure). Let us now consider pattern \(\overrightarrow{s_2} = B \rightarrow B \rightarrow C \rightarrow E(0.375)\). \(\overrightarrow{s_2}\) is a proper supersequence of \(\overrightarrow{s_1}\) (as follows from Definition 6). However, because PI value of \(\overrightarrow{s_2}\) is less than PI value \(\overrightarrow{s_1}\), \(\overrightarrow{s_2}\) cannot be a potential closure of \(\overrightarrow{s_1}\) despite the fact that the difference between participation indices of both is only 0.75.
Nevertheless, providing only supersequences of \(\overrightarrow{s}\) can often allow us to approximate the participation index of \(\overrightarrow{s}\).
Hence, in this paper, we offer a notion of a constricted ST sequential pattern. We define a constricted ST sequential pattern \(\overrightarrow{s_1}\) as such a maximal (that is the longest) supersequence of pattern \(\overrightarrow{s}\) for which (i) the difference between participation indexes of \(\overrightarrow{s}\) and \(\overrightarrow{s_1}\) is minimal and (ii) participation index of \(\overrightarrow{s_1}\) is greater than or equal to the participation index of \(\overrightarrow{s}\) minus approximation margin \(\varepsilon\)^{Footnote 5}. We show in Sect. “Theoretical properties of CSTS patterns” that given a set of PIstrong constricted ST sequential patterns, one can obtain a set of all PIstrong ST sequential patterns and approximate participation indexes of each of them with an approximation margin \(\pm ~\varepsilon\).
Elementary notions
Let us begin with the definitions of a maximal supersequence of an ST sequential pattern and a minimal proper supersequence of an ST sequential pattern.
Definition 8
(Maximal supersequence of an ST sequential pattern) For an ST sequential pattern \(\overrightarrow{s_1}\) of a dataset \({\textbf{D}}\), its maximal ST supersequnce pattern \(\overrightarrow{s_2}\) is such a supersequence of \(\overrightarrow{s_1}\) whose length is the greatest.
Please note that \(\overrightarrow{s_1}\) can have more than one maximal ST supersequence pattern.
Definition 9
(Minimal proper supersequence of an ST sequential pattern) For an ST sequential pattern \(\overrightarrow{s_1}\) of a dataset \({\textbf{D}}\), its minimal proper supersequnce pattern \(\overrightarrow{s_2}\) is such a proper supersequence of \(\overrightarrow{s_1}\) whose length is the smallest.
Please note that \(\overrightarrow{s_1}\) can have more than one minimal proper ST supersequence patterns.
Let us consider the set of all supersequences of the pattern \(\overrightarrow{s_1} = B \rightarrow B\) presented in Table 1:

\(B \rightarrow B\),

\(A \rightarrow B \rightarrow B\),

\(B \rightarrow B \rightarrow C\),

\(B \rightarrow B \rightarrow D\),

\(A \rightarrow B \rightarrow B \rightarrow C\),

\(B \rightarrow B \rightarrow C \rightarrow E\),

\(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E\),

\(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C\).
The maximal supersequence of the pattern \(\overrightarrow{s_1}\) is \(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C\) and the minimal proper supersequences of \(\overrightarrow{s_1}\) are \(A \rightarrow B \rightarrow B\), \(B \rightarrow B \rightarrow C\), \(B \rightarrow B \rightarrow D\).
Definition 10
(\(\varepsilon\)constricted maximal supersequence of an ST sequential pattern) \(\varepsilon\)constricted maximal supersequence \(\overrightarrow{s_1}\) of an ST sequential pattern \(\overrightarrow{s}\) (in brief, Constricted ST Sequential pattern, CSTS pattern) is such a supersequence of \(\overrightarrow{s}\) which preserves the two conditions:

1
\(PI(\overrightarrow{s_1}) \ge PI(\overrightarrow{s})  \varepsilon\), and

2
the difference \(PI(\overrightarrow{s})  PI(\overrightarrow{s_1}\)) is minimal over the set of all maximal supersequences of \(\overrightarrow{s}\).
We denote the set of all \(\varepsilon\)constricted maximal supersequences of pattern \(\overrightarrow{s}\) as \({\mathcal {C}}^{max}(\overrightarrow{s})\). \(\varepsilon\) is an approximation margin parameter, whose value is userspecified and is in the range [0,1].
To illustrate Definition 10, let us consider again the pattern \(\overrightarrow{s} = B \rightarrow B\) presented in Table 1 and let us assume that \(\varepsilon = 0.25\). The participation index of \(\overrightarrow{s}\) equals 0.675. The \(\varepsilon\)constricted supersequence of \(\overrightarrow{s}\) is the ST sequential pattern \({\mathcal {C}}^{max}(\overrightarrow{s}) = \{B \rightarrow B \rightarrow C \rightarrow E\}\), whose participation index equals 0.325.
Please note that an ST sequential pattern can have more than one \(\varepsilon\)constricted maximal supersequence. For example, let us consider ST sequential pattern \(\overrightarrow{s} = B \rightarrow C \rightarrow E\) presented in Table 1 and let us assume that approximation margin \(\varepsilon = 0.1\). The two \(\varepsilon\)constricted maximal supersequences of \(\overrightarrow{s}\) are: \({\mathcal {C}}^{max}(\overrightarrow{s}) = \{A \rightarrow B \rightarrow C \rightarrow E, B \rightarrow B \rightarrow C \rightarrow E\}\), whose PI values are both equal to 0.375. Please also note that according to Definition 10 an ST sequential pattern \(\overrightarrow{s}\) can be its own \(\varepsilon\)constricted maximal supersequence^{Footnote 6}.
From Definition 10 follows that an ST sequential pattern \(\overrightarrow{s}\) can be a CSTS pattern of another ST sequential pattern, but also can have its own CSTS patterns. For example, let us consider the pattern \(\overrightarrow{s} = B \rightarrow B \rightarrow C \rightarrow E\) from Table 1 and let us assume that \(\varepsilon = 0.25\). \(\overrightarrow{s}\) is a CSTS pattern of the ST sequential pattern \(B \rightarrow B\), but also has its own CSTS pattern \({\mathcal {C}}^{max}(\overrightarrow{s}) = \{A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C\}\).
Definition 11
By \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s_1})\) (reverse maximal closure set) we denote a set of ST sequential patterns for which \(\overrightarrow{s_1}\) is the \(\varepsilon\)constricted maximal supersequence (\(\overrightarrow{s_1}\) is a CSTS pattern).
Let us consider an ST sequential pattern \(\overrightarrow{s_1} = A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C\) from Table 1, whose \(PI = 0.25\). Additionally, let us assume that approximation margin is \(\varepsilon = 0.25\). The set \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s_1})\) is (the numbers in parentheses specify participation indexes):

\(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C(0.25)\),

\(A \rightarrow B \rightarrow B \rightarrow C \rightarrow E(0.25)\),

\(B \rightarrow B \rightarrow C \rightarrow E \rightarrow C(0.25)\),

\(A \rightarrow B \rightarrow B \rightarrow C(0.25)\),

\(B \rightarrow B \rightarrow C \rightarrow E(0.375)\),

\(B \rightarrow C \rightarrow E \rightarrow C(0.25)\),

\(A \rightarrow B \rightarrow B(0.25)\),

\(B \rightarrow B \rightarrow C(0.5)\),

\(B \rightarrow C \rightarrow E(0.375)\),

\(A \rightarrow B(0.5)\), \(B \rightarrow C(0.5)\), \(E \rightarrow C(0.5)\).
The set \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s_1})\) is applied by the proposed CSTSMiner algorithm to identify CSTS patterns.
Definition 12
(The set of all PIstrong CSTS patterns) The set of all PIstrong CSTS patterns is defined as the set of all ST sequential patterns that are PIstrong (according to Definition 5) and are CSTS patterns (according to Definition 10).
The proposed CSTSMiner algorithm first discovers all PIstrong ST sequential patterns and then subsequently returns only those of them which are PIstrong CSTS patterns.
Theoretical properties of CSTS patterns
In this section, we derive theoretical properties of the introduced notions of (PIstrong) CSTS patterns. Specifically, we show that:

for the parameter \(\varepsilon = 0\) the set of all CSTS patterns is equivalent to the set of all closed ST sequential patterns (Lemma 1);

each CSTS pattern is a closed ST sequential pattern regardless of the value of the approximation margin \(\varepsilon\) (Lemma 2);

the set of CSTS patterns is a subset of the set of closed ST patterns for any value of the approximation margin parameter \(\varepsilon\) (Theorem 2);

it can be verified if an ST sequential pattern \(\overrightarrow{s}\) is PIstrong given the set of PIstrong CSTS patterns and how to approximate PI value of \(\overrightarrow{s}\) (Lemma 3);

one can approximate the value of PI of \(\overrightarrow{s}\) with an error no greater than \(\pm ~\varepsilon\) (Lemma 4).
Lemma 1 shows that given an ST sequential pattern \(\overrightarrow{s}\) and its CSTS pattern \(\overrightarrow{s_1}\) as well as for \(\varepsilon = 0\), \(\overrightarrow{s_1}\) is a closure of \(\overrightarrow{s}\) and \(\overrightarrow{s_1}\) is a closed ST sequential pattern.
Lemma 1
Let \(\overrightarrow{s}\) be an ST sequential pattern, \(\varepsilon = 0\) and let \(\overrightarrow{s_1}\) be a CSTS pattern of \(\overrightarrow{s}\). The CSTS pattern \(\overrightarrow{s_1}\) is a closure of \(\overrightarrow{s}\) and \(\overrightarrow{s_1}\) is a closed ST sequential pattern.
Proof
The proof of lemma follows from Definitions 7 and 10. By Definition 7, a closure of ST sequential pattern \(\overrightarrow{s}\) is such a closed ST sequential pattern \(\overrightarrow{s*}\) that is a supersequence of \(\overrightarrow{s}\) and whose \(PI(\overrightarrow{s*}) = PI(\overrightarrow{s})\). For \(\varepsilon = 0\) we have:

the PI value of \(\overrightarrow{s_1}\) being a CSTS pattern of \(\overrightarrow{s}\) equals PI value of \(\overrightarrow{s}\),

CSTS pattern \(\overrightarrow{s_1}\) is a maximal supersequence of \(\overrightarrow{s}\).
Thus, for \(\varepsilon = 0\) the CSTS pattern \(\overrightarrow{s_1}\) is a closed ST sequential pattern. \(\square\)
It follows from Lemma 1 that for the parameter \(\varepsilon = 0\) the set of all PIstrong CSTS patterns is equivalent to the set of all PIstrong closed ST sequential patterns (in other words, for \(\varepsilon = 0\) the proposed algorithm CSTSMiner will return the same patterns set as the CSTSPMiner of [12]).
In Lemma 2 we show that each CSTS pattern is a closed ST sequential pattern regardless of the value of the approximation margin \(\varepsilon\).
Lemma 2
Each CSTS pattern is a closed ST sequential pattern regardless of the value of approximation margin \(\varepsilon\).
Proof
Let us assume that \(\overrightarrow{s_1}\) is a CSTS pattern of an ST sequential pattern \(\overrightarrow{s}\) given any value of \(\varepsilon\). Now, let us assume that there exists a proper maximal supersequence \(\overrightarrow{s_2}\) of \(\overrightarrow{s_1}\) whose participation index \(PI(\overrightarrow{s_2})\) equals the participation index \(PI(\overrightarrow{s_1})\), that is \(\overrightarrow{s_2}\) is a closure of \(\overrightarrow{s_1}\). By Definition 7\(\overrightarrow{s_2}\) has to be a closed ST sequential pattern. However, as follows from Definition 10 this would contradict that \(\overrightarrow{s_1}\) is a CSTS pattern. Hence, CSTS pattern \(\overrightarrow{s_1}\) is a closed ST sequential pattern. \(\square\)
In Theorem 2, we show that the set of CSTS patterns is a subset of the set of closed ST patterns for any value of the approximation margin parameter \(\varepsilon\).
Theorem 2
For any \(\varepsilon \in [0,1]\), the set of all CSTS patterns is a subset of the set of all closed ST sequential patterns.
Proof
We already showed in Lemma 1 that for \(\varepsilon = 0\), the set of all CSTS patterns is equal to the set of all closed ST sequential patterns. We also presented in Lemma 2 that each CSTS pattern is a closed ST sequential pattern regardless of the value of \(\varepsilon\). Let us assume that \(\varepsilon > 0\). If there exists a closed ST sequential pattern \(\overrightarrow{s}\) that has a CSTS pattern and \(\overrightarrow{s}\) is not a CSTS pattern itself, then \(\overrightarrow{s}\) will not be included in the set of CSTS patterns (in such a case, the CSTS patterns set is a proper subset of the set of closed ST sequential patterns). Otherwise, if each closed ST sequential pattern is also a CSTS pattern, then the set of CSTS patterns is equal to the set of all closed ST sequential patterns. In either case, the CSTS patterns set is a subset of the closed ST sequential patterns set.
\(\square\)
Theorem 2 shows that the number of discovered CSTS patterns is always less than or equal to the number of closed ST sequential patterns.
Let us illustrated Lemma 1, 2 as well as Theorem 2 with example patterns of the dataset presented in Fig. 1:

For \(\varepsilon = 0.5\), pattern \(B \rightarrow B \rightarrow C \rightarrow E(PI = 0.375)\) is a CSTS pattern (as it is \(\varepsilon\)constricted maximal supersequence of e.g. pattern \(C \rightarrow E(PI = 0.8)\)). At the same time pattern \(B \rightarrow B \rightarrow C \rightarrow E\) is also a closed ST sequential pattern (as follows from Table 2).

For \(\varepsilon = 0\), the set of CSTS patterns is the same as the set of closed ST sequential patterns. Thus, each closed ST sequential pattern presented in Table 2 is also a CSTS pattern (e.g. pattern \(B \rightarrow D (PI =1)\) is a closed ST sequential pattern and CSTS patterns derived from singleton patterns \(B (PI =1)\) and \(D (PI =1)\)).
Lemma 3 shows how to verify if an ST sequential pattern \(\overrightarrow{s}\) is PIstrong given the set of PIstrong CSTS patterns and how to approximate its PI value.
Lemma 3
For each ST sequential pattern the following hold:

(i)
An ST sequential pattern \(\overrightarrow{s}\) is PIstrong only if there exists a supersequence of \(\overrightarrow{s}\) in the set of PIstrong CSTS patterns.

(ii)
The participation index value of a PIstrong ST sequential pattern \(\overrightarrow{s}\) is equal to or less than \(PI(\overrightarrow{s_1}) + \varepsilon\), where \(\overrightarrow{s_1}\) is such a minimal proper supersequence of \(\overrightarrow{s}\) in PIstrong CSTS patterns, whose \(PI(\overrightarrow{s_1})\) is the greatest.
Proof
 Ad(i):

Follows immediately from Theorem 1.
 Ad(ii):

Case 1. Let us first assume that there is only one supersequence \(\overrightarrow{s_1}\) of \(\overrightarrow{s}\) in the set of PIstrong CSTS patterns. In this case, \(\overrightarrow{s_1}\) is a \(\varepsilon\)constricted maximal ST supersequence of \(\overrightarrow{s}\) and by Definition 10\(PI(\overrightarrow{s}) \in \Big [PI(\overrightarrow{s_1}), PI(\overrightarrow{s_1}) + \varepsilon \Big ]\). Case 2. Now let us assume that there is more than one proper supersequence of \(\overrightarrow{s}\) in the set of PIstrong CSTS patterns. Since we can not indicate which one of them is \(\varepsilon\)constricted maximal ST supersequence, then \(PI(\overrightarrow{s}) \in \Big [PI(\overrightarrow{s_1}), PI(\overrightarrow{s_1}) + \varepsilon \Big ]\), where \(\overrightarrow{s_1}\) is a minimal proper supersequence from all proper supersequences of \(\overrightarrow{s}\) in the set of PIstrong CSTS patterns.
\(\square\)
Lemma 3 indicates that the set of PIstrong CSTS patterns (unlike the set of all PIstrong closed ST sequential patterns) is a lossless representation of all PIstrong ST sequential patterns but not informative (in a sense that the PI value of a PIstrong ST sequential pattern can be obtained from the set of PIstrong CSTS patterns only with a certain approximation). The question is how much the approximation of such PI value of a PIstrong ST sequential pattern differs from its exact PI value. In Lemma 4, we provide the value of such maximal approximation.
Lemma 4
Given an ST sequential pattern \(\overrightarrow{s}\) that has a supersequence in the CSTS patterns set one can not:

underestimate the exact PI value of \(\overrightarrow{s}\) by less than \(PI(\overrightarrow{s})  \varepsilon\), and

overestimate the exact PI value of \(\overrightarrow{s}\) by more than \(PI(\overrightarrow{s}) + \varepsilon\).
Proof
Let us assume that \(\overrightarrow{s_1}\) is any proper supersequence of \(\overrightarrow{s}\) in the set of CSTS patterns. We will consider two extreme cases:
Case 1. The value of \(PI(\overrightarrow{s_1})\) equals \(PI(\overrightarrow{s})  \varepsilon\). In such a case, one can not underestimate \(PI(\overrightarrow{s})\) by less than \(PI(\overrightarrow{s})  \varepsilon\).
Case 2. The value of \(PI(\overrightarrow{s_1})\) equals \(PI(\overrightarrow{s})\). In such a case, one can not overestimate \(PI(\overrightarrow{s})\) by more than \(PI(\overrightarrow{s}) + \varepsilon\).
Thus, the estimation of the \(PI(\overrightarrow{s})\) value of the pattern \(\overrightarrow{s}\) given the set of CSTS patters is always in the range \(\Big [PI(\overrightarrow{s})  \varepsilon , PI(\overrightarrow{s}) + \varepsilon \Big ]\). \(\square\)
Constricted ST sequential patterns miner
This section introduces our algorithm called CSTSMiner for discovering the set of all PIstrong CSTS patterns. In Table 3, we present the notation used in the algorithms of this section. The main CSTSMiner procedure is presented in Algorithm 1. The algorithm consists of two phases: (i) “topdown”—iterative generation of all PIstrong ST sequential patterns of length k from patterns of length \(k  1\) until it is impossible to generate new patterns; (ii) “bottomup”—calculation of PIstrong CSTS patterns. The “topdown” phase adapts the STBFM algorithm [11] for the discovery of PIstrong ST sequential patterns. Specifically, to efficiently generate new PIstrong patterns of length k from PIstrong patterns of length \(k1\), we adapted the SPtree structure and extended it to the proposed MAXTree structure. MAXTree is used to not only iteratively generate new patterns but also, unlike SPTree offered in [11], to identify all PIstrong CSTS patterns. The “bottomup” phase calculates PIstrong CSTS patterns in a recursive way starting with the set of the longest PIstrong ST sequential patterns \(L_k\) obtained in the “topdown” phase.
“Topdown” phase of the CSTSminer algorithm
The “topdown” phase of Algorithm 1 is conducted by executing steps 1–18 of this algorithm. In step 1 of Algorithm 1, a singular candidate ST sequential pattern is generated from each event type in \({\textbf{F}}\) and remembered in the set \(L_1\) (patterns in \(L_1\) constitute the first level of MAXTree). By Definition 4, singular patterns are always PIstrong since their PI values are equal to 1.
Subsequently, the PIstrong ST sequential patterns of length 2 are generated and remembered as \(L_2\). The generation of such PIstrong ST sequential patterns is conducted in steps 3–13 using two nested loops, each of which iterates over all patterns in \(L_1\). A new candidate pattern \(\overrightarrow{s}\) of length 2 is always generated by concatenating the two event types of singular patterns \(\overrightarrow{s_i}\) and \(\overrightarrow{s_j}\). We will refer to the patterns \(\overrightarrow{s_i}\) and \(\overrightarrow{s_j}\) as the first and the second parent of \(\overrightarrow{s}\), respectively. Please note that \(\overrightarrow{s}\) can consist of two the same event types (in such a case, \(\overrightarrow{s_i} = \overrightarrow{s_j}\)). The set of instances supporting the second element of \(\overrightarrow{s}\) is calculated in step 6 and consists of event instances of type \(\overrightarrow{s}[2]\) in \({\textbf{D}}\) which belong to neighborhoods of event instances in the set \({\textbf{I}}(\overrightarrow{s_i}, 1)\).
The participation index of the candidate generated pattern \(\overrightarrow{s}\) is calculated and verified in steps 7 and 8 of Algorithm 1. If \(\overrightarrow{s}\) occurred to be PIstrong, then \(\overrightarrow{s}\) is inserted into the list of children of its first parent \(\text {children}(\overrightarrow{s_i})\) and into the set \(L_2\).
Steps from 15 to 18 of the Algorithm 1 consist of the iterative generation and verification of PIstrong ST sequential patterns of length greater than 2 using the function \(\textit{GenAndVerify}{}\) shown in Algorithm 2. Specifically, the \(\textit{GenAndVerify}{}\) function generates all PIstrong ST sequential patterns \(L_k\) from PIstrong ST sequential patterns \(L_{k1}\). To this end, the function uses the first parent, the second parent as well as the children list of patterns in \(L_{k1}\). As follows from Lemma 5 (presented in [12] as Lemma 5),^{Footnote 7} a candidate ST sequential pattern \(\overrightarrow{s}\) of length \(\ge 3\) can be obtained by concatenating all elements of its first parent with the last element of its second parent.
Lemma 5
[Construction of an ST sequential pattern from its first and second parent [12]] Let \(\overrightarrow{s}\) be an ST sequential pattern of length \(m \ge 3\) and \(\overrightarrow{s_1}\) be the first parent of \(\overrightarrow{s}\). Then, \(\overrightarrow{s} = \overrightarrow{s_1} \rightarrow F_m\), where \(F_m\) is the last element of \(\text {parent}_1(\overrightarrow{s})\), and \(\text {parent}_2(\overrightarrow{s}) = \text {parent}_2(\overrightarrow{s_1}) \rightarrow F_m\).
Algorithm 2 proceeds as follows. First, the set \(L_k\) is initialized. Subsequently, the loop in step 2 iterates over all patterns in \(\overrightarrow{s_i} \in L_{k1}\) (\(\overrightarrow{s_i}\) is the first parent of a new candidate pattern \(\overrightarrow{s}\)) and the loop in step 4 iterates over all children patterns \(\overrightarrow{s_j} \in \text {parent}_2(\overrightarrow{s_i})\) of the second parent of \(\overrightarrow{s_i}\). Each of such \(\overrightarrow{s_j}\) child patterns constitutes the second parent of a new candidate pattern \(\overrightarrow{s}\). Thus, the elements of \(\overrightarrow{s}\) are: \(\overrightarrow{s} {:}{=} \overrightarrow{s_i}[1] \rightarrow \overrightarrow{s_i}[2] \rightarrow\) \(\dots \rightarrow \overrightarrow{s_i}[k  1] \rightarrow \overrightarrow{s_j}[k  1]\).
After the elements of \(\overrightarrow{s}\) are obtained, the set of instances supporting its last element is computed according to step 6 of Algorithm 2 and the PI value of \(\overrightarrow{s}\) is calculated according to step 7 of Algorithm 2. If the PI value is greater than the participation index threshold \(PI_{min}\), then \(\overrightarrow{s}\) is inserted to \(L_{k}\) and appended to the children list of \(\overrightarrow{s_i}\): \(\text {children}(\overrightarrow{s_i})\). Otherwise, \(\overrightarrow{s}\) is discarded as not being a PIstrong pattern.
The generation of new candidate ST sequential patterns ends when the \(\textit{GenAndVerify}{}\) function can not generate any new patterns.
“Bottomup” phase of the CSTSminer algorithm
While the “topdown” phase generates all PIstrong ST sequential patterns, the “bottomup” phase of Algorithm 1 starts in step 19 and is entirely dedicated to calculation of these PIstrong ST sequential patterns which are PIstrong CSTS patterns. The verification of PIstrong ST sequential patterns as being CSTS patterns starts from the set \(L_k\) of the longest patterns and is iteratively continued for the subsequent sets of patterns \(L_{k1}, L_{k2}, \dots , L_2\).
The function \(\textit{VerifySupersequence}{}\) presented in Algorithm 3 obtains two ST sequential patterns \(\overrightarrow{s}\) and \(\overrightarrow{s_i}\) and verifies if \(\overrightarrow{s}\) is a CSTS pattern of \(\overrightarrow{s_i}\). To this end, function \(\textit{VerifySupersequence}{}\) applies Definition 10 and Theorem 1. Specifically, \(\textit{VerifySupersequence}{}\) conducts the following steps:

1
Checks if the PI value of \(\overrightarrow{s}\) is greater than the PI value of \(\overrightarrow{s_i}\) minus approximation margin \(\varepsilon\) (in step 1). This fulfills the first condition of Definition 10.

2
Checks if \(\overrightarrow{s_i}\) already belongs to the \(RC^{max}\) list of \(\overrightarrow{s}\) (in step 2). Due to construction of MAXTree, it can happen that the function \(\textit{VerifySupersequence}{}\) will be invoked multiple times for the same two sequences \(\overrightarrow{s}\) and \(\overrightarrow{s_i}\). Thus, the check prevents the situation when \(\overrightarrow{s_i}\) is added multiple times to the list \(RC^{max}(\overrightarrow{s})\).

3
Verifies whether there is no pattern of \(\overrightarrow{s_i}\) in the set \(C^{max}(\overrightarrow{s_i})\) or the length of \(\overrightarrow{s}\) equals the length of the patterns in \(C^{max}(\overrightarrow{s_i})\). In such a case, the two situations are possible:

Either \(C^{max}(\overrightarrow{s_i}) = 0\) or the PI value of \(\overrightarrow{s}\) equals the PI values of \(C^{max}(\overrightarrow{s_i})\) patterns. In any case, \(\overrightarrow{s}\) is inserted to \(C^{max}(\overrightarrow{s_i})\) and \(\overrightarrow{s_i}\) is inserted to \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s_i})\).

Alternatively, if the PI value of \(\overrightarrow{s}\) is greater than the PI values of patterns in \(C^{max}(\overrightarrow{s_i})\), then \(\overrightarrow{s_i}\) is removed from the \(\mathcal{R}\mathcal{C}^{max}\) lists of all patterns in \(C^{max}(\overrightarrow{s_i})\) and \(C^{max}(\overrightarrow{s_i})\) is set to be empty. Next, \(\overrightarrow{s}\) is added to \(C^{max}(\overrightarrow{s_i})\) and \(\overrightarrow{s_i}\) is added to \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s})\). These operations fulfill the second condition of Definition 10.

If the PI value of \(\overrightarrow{s}\) is greater than the PI value of \(\overrightarrow{s_i}\) minus approximation margin \(\varepsilon\), then \(\textit{VerifySupersequence}{}\) function is recursively invoked for \(\overrightarrow{s}\) and the parent patterns \(\text {parent}_1(\overrightarrow{s_i})\), \(\text {parent}_2(\overrightarrow{s_i})\) of \(\overrightarrow{s_i}\) to verify whether \(\overrightarrow{s}\) is also their CSTS pattern. Otherwise, as follows from Theorem 1, \(\overrightarrow{s}\) can not be a CSTS pattern of any of the parent patterns \(\text {parent}_1(\overrightarrow{s_i})\), \(\text {parent}_2(\overrightarrow{s_i})\) of \(\overrightarrow{s_i}\). Thus, invokes of \(\textit{VerifySupersequence}{\overrightarrow{s}, \text {parent}_1(\overrightarrow{s_i})}\) and \(\textit{VerifySupersequence}{\overrightarrow{s}, \text {parent}_2(\overrightarrow{s_i})}\) are skipped.
The complete MAXTree created for the dataset shown in Fig. 1 is presented in Fig. 2. The tree is created by Algorithm 1 using the following input parameters: spatial threshold value \(R = 10\), time window threshold value \(T = 20\), participation index threshold value \(PI_{min} = 0.25\), approximation margin value \(\varepsilon = 0.25\). All patterns in the tree are PIstrong ST sequential patterns. However, only blue boxes represent those of them, which are also PIstrong CSTS patterns.
For the pattern \(\overrightarrow{s} = B \rightarrow B \rightarrow C\), the set of its CSTS patterns is \({\mathcal {C}}^{max}(\overrightarrow{s}) = \{A \rightarrow B \rightarrow B \rightarrow C \rightarrow E \rightarrow C\}\), while the minimal proper supersequence of \(\overrightarrow{s}\) among the set of PIstrong CSTS patterns returned by Algorithm 1 is \(\overrightarrow{s_1} = B \rightarrow B \rightarrow C \rightarrow E\). Let us consider how one can approximate the participation index value of pattern \(\overrightarrow{s} = B \rightarrow B \rightarrow C\) given the set of all PIstrong CSTS patterns presented in Fig. 2. Since the set of PIstrong CSTS patterns contains more than one supersequence of \(\overrightarrow{s}\), then the value of participation index of \(PI(\overrightarrow{s)} \le PI(\overrightarrow{s_1}) + \varepsilon = 0.375 + 0.25 = 0.625\). In fact, the exact PI value of \(\overrightarrow{s}\) equals \(PI(\overrightarrow{s}) = 0.5\).
CSTSminer complexity analysis
Let us start with the analysis of computational cost of the “Topdown” phase of Algorithm 1. Let us assume that \({\textbf{F}}\) and \({\textbf{D}}\) denote the number of event types and event instances in the dataset, respectively. Furthermore, let us assume that \(A_{Chl}\) is the average number of children of a node in the MAXTree and \(A_{I}\) is the average number of instances supporting an element of a sequential pattern. The plane sweep algorithm of [34] applied in this study requires on average \(A_{I} \cdot \log {{\textbf{D}}}\) number of operations to obtain the neighborhoods of event instances \(I_{\overrightarrow{s}[i]}\) supporting an ith element of a sequential pattern assuming that event instances in \({\textbf{D}}\) are initially sorted in nondecreasing order according to the occurrence times. Thus, the average number of computations needed to obtain children patterns of a sequential pattern in MAXTree equals \(A_{chl} \cdot A_{I} \cdot \log {{\textbf{D}}}\). Let us assume that \(L_k\) denotes the average number of PIstrong ST sequential patterns of length \(k = 2, 3, \dots\). The number of computations needed to generate patterns of MAXTree of length \(3, 4, \dots\) equals \(L_{k} \cdot A_{chl} \cdot A_{I} \cdot \log {{\textbf{D}}}\). Additionally, let \(T  1\) denote the number of levels of MAXTree excluding level 1 which contains all event types in \({\textbf{F}}\). Since Algorithm 1 generates all patterns of length \(L_2\) in two nested loops iterating over patterns \(L_1\) that consist of singleton event types in \({\textbf{F}}\), we can approximate the calculation numbers of “Topdown” phase as
Let us now analyse computational complexity of the “Bottomup” phase of Algorithm 1. CSTSMiner calls Algorithm 3 for both parents of each pattern of levels \(k = 2, 3, \dots\). Given the height of MAXTree equal to T, the computational cost of phase “Bottomup” is strictly dependent on the specified value of \(\varepsilon\) parameter. Let assume that for a generated MAXTree, the participation indexes changes from 1 for singleton patterns \(L_1\) to \(PI_{min}\) for patterns in \(L_{T}\). Given that each pattern has two parent patterns, we can assume that the number of ascendant patterns to be verified for a given ST sequential pattern equals: \(2^{(1 PI_{min}) \cdot \varepsilon \cdot T}\). Thus, the computational cost of the “Bottomup” phase equals:
\({\mathcal {C}}^{max}(\overrightarrow{s})\) is the average number of CSTS patterns of sequence \(\overrightarrow{s}\). Algorithm 3 in a pessimistic cases require to scan over such list when a detection of a CSTS pattern is detected.
To analyse memory complexity of Algorithm 1 let us note that in order to restore a sequence of elements constituting a pattern, at each level \(k = 1, 2, \dots ,\) of the tree, it is enough to store only the last element of a sequence of event types constituting a pattern. Thus, the memory required to store elements of sequences of the discovered PIstrong STsequential patterns along with their participation indexes equals \(T \cdot L_k\). Also, in order to generate new candidate patterns it is enough to store in the computer’s memory only event instances of the last elements of patterns of two subsequent lengths \(k1\) and k. Hence, the memory complexity of “Topdown” phase of Algorithm 1 equals:
In order to asses the memory complexity of the “Bottomup” phase let us assume that \({\mathcal {C}}^{max}(\overrightarrow{s})\) and \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s})\) are the average numbers of CSTS patterns of \(\overrightarrow{s}\) and the average number of patterns for which \(\overrightarrow{s}\) is a CSTS pattern, respectively. For each pattern \(\overrightarrow{s}\) of MAXtree, Algorithm 3 maintains the list \({\mathcal {C}}^{max}(\overrightarrow{s})\). However, the list \(\mathcal{R}\mathcal{C}^{max}(\overrightarrow{s})\) needs to be maintained for all patterns exept the singleton patterns. Thus, we can asses the memory complexity of the second phase of Algorithm 1 as:
Experiments
In this section, we first review datasets used for the experiments and describe our experimental setup. Then we provide results of the comparison of the proposed CSTSMiner algorithm with the STSMiner [9], CSTSPMiner [12], STBFM [12] and CSTPM [10] algorithms.
Selected datasets
For the experiments with the proposed CSTSMiner algorithm we selected two publicly available datasets, each of which consists of crime event incidents.
Pittsburgh police incident blotter dataset
The first dataset selected for the experiments is Pittsburgh Police Incident Blotter Dataset that contains crime incidents collected by the Police Department of Pittsburgh City over the period 31.12.1989–31.12.2019 [35]. The dataset was validated according to the Uniform Crime Reporting (UCR) standards [36] and consists of such attributes as: incident time, incident location (defined by street name and number), incident neighborhood, incident type, description of offense, incident longitude and latitude locations. For the purposes of experiments we selected the following attributes of the dataset: incident type, incident time, longitude and latitude which are directly used by the proposed CSTSMiner algorithm. In the experiments, we use only crime incidents reported between 01.01.2017 and 31.12.2019. The number of crime incidents reported over this time period is 122,895. However, approximately 40% of them contain missing values for one of the selected attributes: incident type, geographical location or incident time. Thus, we decided to remove such incidents. The resultant dataset contains 72,867 crime event incidents of 236 unique incident types. All of these 236 incident types are selected as event type set F. To the most frequent incident types belong: theft from auto, simple assault, public drunkenness, criminal mischief or harassment. The attribute incident time (which specifies incident occurrence time with exact occurrence date and a timestamp given in hours, minutes and seconds) is transformed into the number of minutes that passed from the timestamp 01.01.2017 00:00. Thus, the time window parameter T used by the CSTSMiner algorithm is specified in minutes. In Table 4, we present the characteristic of the resultant dataset.
In Fig. 3, we present the location of the first two thousand crime incidents of ten most frequent incident types from the resultant Pittsburgh Police Incident Blotter Dataset.
Boston crime incident reports dataset
The second of the selected datasets is Boston Crime Incidents Reports Dataset provided by the Boston Police Department [37]. The dataset was collected over the period 08.07.2012–10.08.2015. However, in the experiments, we extracted only crime incidents that occurred between 01.01.2014–31.12.2014. The Boston Crime Incidents Reports Dataset contains several attributes, such as: incident location, incident time, incident type, used weapon type (such as, for example, unarmed or firearm) shooting presence, police shift or occurrence district and occurrence area. Similarly to the Pittsburgh Police Incident Blotter Dataset, for the experiments we selected only attributes: incident location (given by longitute and latitude), incident time and incident type. Since the attribute incident type does not contain only crime incident types, but also other incident types, such as medical assist or property found, we preprocessed the dataset to obtain incidents of all 26 crime event types present in the dataset. The examples of event types are: aggravated assault, arson, auto theft, drug charges.
Additionally, to better analyze the results that can be obtained with CSTSMiner, we selected incidents of only the ten least frequent crime types in the complete dataset to create the reduced dataset. These crime types are violation of liquor laws, operating under influence, manslaughter, homicide, harassment, gambling offense, embezzlement, crimes against children, bomb, arson.
In Table 5, we present the characteristic of the complete dataset, while in Table 6 the characteristic of the reduced dataset is shown. Since there are only 26 crime event types in the complete dataset, we decided to present the number of incidents of each type in the histogram shown in Fig. 4.
Experimental setup
In our experiments, the spatial distance threshold R of a neighborhood of an event instance e is specified in meters. However, the locations of event instances in the obtained datasets are specified using the longitude and latitude coordinates. Thus, we apply the following procedure to transform the distance between two event instances \(e_1, e_2 \in {\textbf{D}}\) into meters. First, each coordinate (either longitude or latitude) of an instance e is converted into radians according to Eq. (3) in which, e.lat refers to the latitude coordinate and e.lon refers to the longitude coordinate of instance e, respectively. Next, the distance in meters between two instances is obtained according to Eq. (4). In Eq. (4), \(\text {earthRadius}\) denotes the radius of Earth in kilometers.
The implementations of all algorithms selected for the experiments are prepared in C++. We ran the experiments using a computer equipped with Apple M1 processor and 16 GB of RAM memory. Our implementations of the algorithms (STSMiner, STBFM, CSTPM, CSTSPMiner, CSTSMiner) are available at the GitHub repository.^{Footnote 8}
Results of the experiments
In the experiments, we compare our proposed CSTSMiner with the four other algorithms: STSMiner applying participation index measure [9], STBFM [11], CSTPM [10]. All of them discover PIstrong ST sequential patterns. CSTSMiner is also compared with CSTSPMiner [12], which discovers PIstrong closed ST sequential patterns.
We aim to measure the number of discovered patterns by each algorithm and its computation time for each of the selected datasets. The obtained results for each dataset are presented in Tables 7, 8 and 9. The results presented in these tables were obtained for the following input parameters of the five compared algorithms:

For the Pittsburgh Police Incident Blotter Dataset: R = 350 m, T = 11,520 (8 days), \(PI_{min}\) = \(\{0.33, 0.32, \dots , 0.25\}\), \(\varepsilon\) = \(\{0.025, 0.05, 0.075\}\).

For the complete Boston Crime Incidents Report Dataset: R = 300 m, T = 5760 min (4 days), \(PI_{min}\) = \(\{0.055, 0.05, \dots , 0.015\}\), \(\varepsilon\) = \(\{0.05, 0.1, 0.15\}\).

For the reduced Boston Crime Incidents Report Dataset: R = 500 m, T = 43,200 min (30 days), \(PI_{min} = \{0.01, 0.0095, \dots , 0.005\}\), \(\varepsilon = \{0.05, 0.1, 0.15\}\).
Table 7 presents the results for the Pittsburgh Police Incident Blotter dataset. As it can be noted from the table, CSTSMiner can discover much fewer patterns for all three selected values of its approximation margin \(\varepsilon\) parameter than the other four selected algorithms. For example, for the participation index threshold \(PI_{min} = 0.26\), there are 143,666 PIstrong ST sequential patterns and 100,235 PIstrong closed ST sequential patterns, while the number of PIstrong CSTS patterns discovered for the approximation threshold value \(\varepsilon = 0.1\) is only 58,519. As it can be noted from Table 7, the increasing values of approximation margin \(\varepsilon\) can result in a significant increase of computation times of CSTSMiner. For example, for \(PI_{min} = 0.25\) STBFM and CSTSPMiner both executed in 207 s, while CSTSMiner for \(\varepsilon = 0.025\) executed in 253 s and for \(\varepsilon = 0.075\) it executed in 1744s.
Slightly different results were presented in Table 8 for the complete Boston Crime Incident Report dataset. In the case of this dataset, it was possible to obtain a similar reduction in the number of discovered patterns as in the case of the Pittsburgh Police Incident Blotter dataset. For example, for the participation index threshold \(PI_{min}\) equal to 0.015, the numbers of PIstrong ST sequential patterns and PIstrong closed ST sequential patterns are 2,819,490 and 2,040,303, respectively. For the same value of the \(PI_{min}\) parameter and \(\varepsilon = 0.15\), CSTSMiner provided 1,171,955 PIstrong CSTS patterns. Thus, the reduction in the number of patterns is 58% when compared to the STBFM algorithm as well as 43% when compared to the CSTSPMiner algorithm.
Finally, in Table 9 we present the results obtained for the reduced Boston Crime Incident Report dataset. The reduction in the number of discovered patterns is even more impressive in the case of this dataset. For the values of participation index threshold \(PI_{min} = 0.005\) and approximation margin \(\varepsilon = 0.15\), CSTSMiner provided 65 899 PIstrong CSTS patterns. For the same \(PI_{min}\) value, the numbers of PIstrong ST sequential patterns and PIstrong closed ST sequential patterns discovered are 228 285 and 76 894 patterns, respectively. Thus, the reduction in the number of discovered patterns by CSTSMiner when compared to STSMiner, CSTPM, STBFM is 71% and when compared to CSTSPMiner is 24%. However, it is worth noting that the computation time for the reduced dataset can be significantly higher in the case of the CSTSMiner algorithm than in the cases of the STBFM and CSTSPMiner algorithms. Also, for the reduced Boston Crime Incident Report dataset, CSTPM executes much longer than the other algorithms.
In Fig. 5, we provide the plots presenting the percent of the number of the discovered PIstrong CSTS patterns to the number of the discovered PIstrong ST sequential patterns. As it can be noted from the presented plots, the percent ranges between 15% and 90%. It was possible to obtain minimal values of the percent (around 15%) for the reduced Boston Crime Incidents Report dataset when the spatial threshold \(R = 600\) meters as well as temporal window \(T = 28800\) minutes were applied and the \(PI_{min}\) threshold was equal to the values in the range \(0.290.28\). In the case of both Pittsburgh Crime Incident Blotter and complete Boston Crime Incidents datasets, for the smaller values of the \(PI_{min}\) threshold (\(< 0.3\)) and the approximation margin \(\varepsilon\) equal to 0.2, the obtained percent is usually below 50%. Interestingly, not always the smaller value of the \(PI_{min}\) threshold resulted in a more significant reduction of the number of discovered patterns. For example, for the Pittsburgh Crime Incident Blotter dataset and parameters \(R = 350\) meters, \(T = 11,520\) minutes as well as approximation margin \(\varepsilon\) equal to 0.2 or 0.1 the smallest percent was obtained for \(PI_{min} = 0.28\).
In Fig. 6, we provide the plots presenting the percent of the number of PIstrong CSTS patterns to the number of PIstrong closed ST sequential patterns. As follows from Lemma 1, for \(\varepsilon = 0\) the set of PIstrong CSTS patterns is equal to the set of PIstrong closed ST sequential patterns. However, for the greater values of \(\varepsilon\) (such as, for example 0.1 or 0.2), CSTSMiner is capable of providing as few as 50% of the number of PIstrong closed ST sequential patterns discovered by CSTSPMiner. Importantly, even for the smaller values of the \(\varepsilon\) parameter (such as, for example, 0.01), CSTSMiner provided as few as 70% of the number of patterns provided by CSTSPMiner (as it is shown, for example, in the plots for the complete Boston Crime Incidents Report dataset presented in Fig. 6).
To summarize, the plots presented in Figs. 5 and 6 show that the CSTSMiner algorithm can discover significantly fewer PIstrong CSTS patterns than PIstrong ST sequential patterns and PIstrong closed ST sequential patterns, even when the small values of the \(\varepsilon\) parameter are applied.
In our next experiment, we aimed to compare the computation times of steps 1–18 (phase “topdown”) and steps 19–25 (phase “bottomup”) of Algorithm 1. As previously noted, the “topdown” phase is responsible for generating all PIstrong ST sequential patterns, while the “bottomup” phase finds those patterns which are also PIstrong CSTS patterns.
In Fig. 7, we present the comparison of computation times (using the logarithmic scale) of the both phases obtained for the Pittsburgh Police Incident Blotter dataset. Please note that for the smaller values of the approximation margin \(\varepsilon\) (such as \(\varepsilon = 0.025\)), the computation times for the “topdown” phase are more significant than for the “bottomup” phase. However, with the increasing values of \(\varepsilon\) and for the smaller values of \(PI_{min}\), the computation time of the “bottomup” phase can increase significantly. For example, for the parameters \(\varepsilon\) equal to 0.1 and \(PI_{min}\) equal to 0.25, the computation time of the “bottomup” phase of Algorithm 1 can be up to four times longer than the computation time of the “topdown” phase for the same values of these parameters.
The results presented in Fig. 7 are inline with the computational and space complexity analysis presented in Sect. Theoretical properties of CSTS patterns. In particular, for the increasing values of \(\varepsilon\) the computational cost of the “Bottomup” phase of Algorithm 1 can be much higher than the computational cost of phase “Topdown”. This is the result of the fact that the “Bottomup” time complexity: \(O\big ((T  1)\cdot L_k \cdot 2^{(1 PI_{min}) \cdot \varepsilon \cdot T} \cdot {\mathcal {C}}^{max}(\overrightarrow{s})\big )\) contains exponential function with the base of the function equal to 2 and exponent assessed by us as \((1 PI_{min}) \cdot \varepsilon \cdot T\).
Representative patterns selection
In this subsection, we provide some interesting examples of discovered sequential patterns of different types of crimes. To this end, we ran CSTSMiner using the Pittsburgh Police Incident Blotter Dataset with the following parameters: \(R = 300,\) \(T = 11,520\) minutes (8 days), \(\varepsilon = 0.05,\) \(PI_{min} = 0.3\).
The examples of interesting resultant patterns include:

1
public_drunkenness \(\rightarrow\) robbery/ bank/ knife (PI = 0.5).

2
public_drunkenness \(\rightarrow\) robbery/ bank/ strongarm (PI = 0.44).

3
simple_assault/ injury \(\rightarrow\) public_drunkenness \(\rightarrow\) public_drunkenness \(\rightarrow\) all_other_offenses (expt_traff) \(\rightarrow\) fail_ disord_per_to_disperse (PI = 0.30).

4
simple_assault/ injury \(\rightarrow\) public_drunkenness \(\rightarrow\) robbery/ bank/ _strongarm (PI = 0.33).

5
robbery/ highway/ gun \(\rightarrow\) sale/ use_of_air_rifles (PI = 0.5).
Patterns 1 and 2 could provide essential information about types of banks robberies. Pattern 1 states that half of the bank robberies using a knife were conducted within 300 ms from the reported public drunkenness incidents and up to 8 days after they occurred. Similarly, pattern 2 communicates that 44% of all bank robberies using weapon (strongarm) occurred within 300 ms from the reported public drunkenness incidents and up to 8 days after they occurred. Another interesting example is pattern 4, which states that half of the usage of air rifles occurred within 300 ms from the highway robbery incidents and up to 8 days after they happened.
Conclusion
In this article, we offered a new type of ST sequential patterns called \(\varepsilon\)constricted ST sequential patterns (CSTS patterns) and we thoroughly analyzed their theoretical properties. Specifically, we showed that a set of CSTS patterns is a subset of the set of closed ST sequential patterns and that each CSTS pattern is also a closed ST sequential pattern. Moreover, we showed that given the set of PIstrong CSTS patterns one can obtain the set of all PIstrong ST sequential patterns and approximate participation index of each of them with the approximation margin \(\pm ~\varepsilon\). We also offered a new algorithm called CSTSMiner that discovers all PIstrong CSTS patterns. CSTSMiner adapts the MAXTree structure for more efficient candidate patterns generation. The proposed MAXTree is generated in two main phases of CSTSMiner: the first one called “topdown” in which all PIstrong ST sequential patterns are generated using the breadthfirst strategy, and the second one called “bottomup” which calculates PIstrong ST sequential patterns being CSTS patterns. We analyzed properties and computation times of CSTSMiner.
The experiments with the CSTSMiner algorithm were conducted using two crimerelated datasets for the cities of Boston and Pittsburgh: the Pittsburgh Police Incident Blotter Dataset and the Boston Crime Incident Reports Dataset. Each of the selected datasets consists of various types of crime and numerous event instances. To better verify the capabilities of the proposed algorithm, we also extracted a reduced dataset from the complete Boston Crime Indecent Reports dataset. The resultant reduced dataset contains 10 least frequent crime event types and 896 event instances.
During the experimental evaluation, we compared the results obtained with the proposed CSTSMiner algorithm to four other stateoftheart algorithms: STSMiner [9], CSTPM [10], STBFM [11] and CSTSPMiner [12]. The STSMiner, CSTPM and STBFM algorithms discover PIstrong ST sequential patterns, whereas CSTSPMiner discovers PIstrong closed ST sequential patterns. Each of the selected algorithms, as well as the proposed algorithm, use the participation index to measure the significance of the discovered patterns. As we presented in the experiments, in the case of the Pittsburgh Police Incident Blotter Dataset and in the cases of the complete and reduced Boston Crime Incident Reports Dataset, CSTSMiner can return much fewer patterns than the other selected algorithms. In particular, in the case of the Pittsburgh Police Incident Blotter Dataset, CSTSMiner provides up to 60% fewer patterns compared to STBFM and up to 50% fewer patterns compared to CSTSPMiner. Similarly, for the complete Boston Crime Incident Reports Dataset, CSTSMiner provides up to 60% fewer patterns compared to STBFM and up to 40% fewer patterns compared to CSTSPMiner. For the reduced Boston Crime Incident Reports Dataset, CSTSMiner provides up to 85% fewer patterns than STBFM and up to 50% fewer patterns than CSTSPMiner.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on a reasonable request.
Notes
Please note that the spatial location of each event instance presented in Fig. 1 is for simplicity specified using only one dimension. However, in real datasets, the spatial location of event instances is usually defined by coordinates of two dimensions (for example, in the datasets selected for the experiments, spatial location is defined using longitude and latitude coordinates).
An example of such an ST sequential pattern to be discovered for the dataset in Fig. 1 is \(A \rightarrow B \rightarrow C \rightarrow E \rightarrow C\) (or Vandalism \(\rightarrow\) Robbery \(\rightarrow\) Simple assault \(\rightarrow\) Aggravated assault \(\rightarrow\) Simple assault).
We refer the reader to [12] for the proof of the theorem.
As we show in Sect. 5, for approximation margin \(\varepsilon = 0\), the proposed notion of a constricted ST sequential pattern is equivalent to the notion of a closed ST sequential pattern.
In fact, as we present in Sect. 6, each ST sequential pattern of the maximal length patterns set (say \(L_k\)) is its own \(\varepsilon\)constricted maximal supersequence.
References
Han J, Pei J, Kamber M. Data mining: concepts and techniques. Elsevier; 2011.
Zaki MJ, Meira W Jr, Meira W. Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press; 2014.
Hu Z, Wang L, Tran V, Chen H. Efficiently mining spatial colocation patterns utilizing fuzzy grid cliques. Inf Sci. 2022;592:361–88.
Buczak AL, Gifford CM. Fuzzy Association Rule Mining for Community Crime Pattern Discovery. In: ACM SIGKDD Workshop on Intelligence and Security Informatics. ISIKDD ’10. New York, NY, USA: Association for Computing Machinery; 2010. Available from: https://doi.org/10.1145/1938606.1938608.
Yu CH, Ding W, Morabito M, Chen P. Hierarchical spatiotemporal pattern discovery and predictive modeling. IEEE Trans Knowl Data Eng. 2016;28(4):979–93.
He J, Zheng H. Prediction of crime rate in urban neighborhoods based on machine learning. Eng Appl Artif Intell. 2021;106: 104460.
Wu J, Abrar SM, Awasthi N, FriasMartinez E, FriasMartinez V. Enhancing shortterm crime prediction with human mobility flows and deep learning architectures. EPJ Data Sci. 2022;11(1):53.
Dao THD, Thill JC. CrimeScape: analysis of sociospatial associations of urban residential motor vehicle theft. Soc Sci Res. 2022;101: 102618.
Huang Y, Zhang L, Zhang P. A framework for mining sequential patterns from spatiotemporal event data sets. IEEE Trans Knowl Data Eng. 2008;20(4):433–48.
Mohan P, Shekhar S, Shine JA, Rogers JP. Cascading spatiotemporal pattern discovery. IEEE Trans Knowl Data Eng. 2012;24(11):1977–92.
Maciąg PS, Bembenik R. A Novel Breadthfirst Strategy Algorithm for Discovering Sequential Patterns from Spatiotemporal Data. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods—Volume 1: ICPRAM, INSTICC. SciTePress; 2019. p. 459–466.
Maciąg PS, Kryszkiewicz M, Bembenik R. Discovery of closed spatiotemporal sequential patterns from event data. In: KnowledgeBased and Intelligent Information & Engineering Systems: Proceedings of the 23rd International Conference KES2019, Budapest, Hungary; 2019. p. 707–716. https://doi.org/10.1016/j.procs.2019.09.226.
He Z, Deng M, Cai J, Xie Z, Guan Q, Yang C. Mining spatiotemporal association patterns from complex geographic phenomena. Int J Geogr Inf Sci. 2020;34(6):1162–87. https://doi.org/10.1080/13658816.2019.1566549.
Andrzejewski W, Boinski P. Maximal mixeddrove cooccurrence patterns. Inf Syst Front. 2022;p. 1–24.
Aydin B, Angryk RA. Spatiotemporal event sequence mining from evolving regions. In: 2016 23rd International Conference on Pattern Recognition (ICPR); 2016. p. 4172–4177.
Yan X, Han J, Afshar R. CloSpan: Mining Closed Sequential Patterns in Large Datasets. In: Proceedings of the 2003 SIAM International Conference on Data Mining; 2003. p. 166–177.
Wang J, Han J, Li C. Frequent closed sequence mining without candidate maintenance. IEEE Trans Knowl Data Eng. 2007;19(8):1042–56.
Wang J, Han J. BIDE: efficient mining of frequent closed sequences. In: Proceedings. 20th International Conference on Data Engineering; 2004. p. 79–90.
Fumarola F, Lanotte PF, Ceci M, Malerba D. CloFAST: closed sequential pattern mining using sparse and vertical idlists. Knowl Inf Syst. 2016;48(2):429–63.
Gomariz A, Campos M, Marin R, Goethals B. ClaSP: an efficient algorithm for mining frequent closed sequences. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G, editors. Advances in knowledge discovery and data mining. Berlin, Heidelberg: Springer, Berlin Heidelberg; 2013. p. 50–61.
Tzvetkov P, Yan X, Han J. TSP: mining topK closed sequential patterns. In: Third IEEE International Conference on Data Mining; 2003. p. 347–354.
Zhang J, Wang Y, Yang D. CCSpan: mining closed contiguous sequential patterns. KnowlBased Syst. 2015;89:1–13.
Cong S, Han J, Padua D. Parallel Mining of Closed Sequential Patterns. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. KDD ’05. New York, NY, USA: Association for Computing Machinery; 2005. p. 562567.
FournierViger P, Lin JCW, Kiran RU, Koh YS, Thomas R. A survey of sequential pattern mining. Data Sci Pattern Recogn. 2017;1(1):54–77.
Wong AKC, Zhuang D, Li GCL, Lee ESA. Discovery of Delta closed patterns and noninduced patterns from sequences. IEEE Trans Knowl Data Eng. 2012;24(8):1408–21.
Kisilevich S, Mansmann F, Nanni M, Rinzivillo S. In: Maimon O, Rokach L, editors. Spatiotemporal clustering. Boston: Springer; 2010; 855–74.
Li Z. Spatiotemporal pattern mining: algorithms and applications. Cham: Springer International Publishing; 2014. p. 283–306.
Sunitha G, Reddy M, Rama A. Mining frequent patterns from spatiotemporal data sets: a survey. J Theor Appl Inf Technol. 2014;68(2).
Maciąg PS. A survey on data mining methods for clustering complex spatiotemporal data. In: Kozielski S, Mrozek D, Kasprowski P, MałysiakMrozek B, Kostrzewa D, editors. Beyond databases, architectures and structures towards efficient solutions for data analysis and knowledge representation. Cham: Springer International Publishing; 2017. p. 115–26.
Maciąg PS. Efficient Discovery of TopK Sequential Patterns in EventBased SpatiaTemporal Data. In: 2018 Federated Conference on Computer Science and Information Systems (FedCSIS); 2018. p. 47–56.
Atluri G, Karpatne A, Kumar V. Spatiotemporal data mining: a survey of problems and methods. ACM Comput Surv. 2018;51(4):83:183:41.
Ansari MY, Ahmad A, Khan SS, Bhushan G, et al. Spatiotemporal clustering: a review. Artif Intell Rev. 2019;1–43.
Aydin B, Boubrahimi SF, Kucuk A, Nezamdoust B, Angryk RA. Spatiotemporal event sequence discovery without thresholds. Geoinformatica. 2020;1–29.
Arge L, Procopiuc O, Ramaswamy S, Suel T, Vitter JS. Scalable SweepingBased Spatial Join. In: Proceedings of the 24rd International Conference on Very Large Data Bases. VLDB ’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1998. p. 570–581.
Pittsburgh Police Department. Pittsburgh Police Incident Blotter; 2020. Date accessed: 28.10.2020. http://plenar.io/explore/event/police_incident_blotter_archive.
UCR program. Uniform Crime Reporting program; 2020. Date accessed: 28.10.2020. https://www.fbi.gov/services/cjis/ucr.
Boston Police Department. Crime Incident Reports; 2014. Date accessed: 25.05.2018. http://plenar.io/explore.
Funding
This work was supported by the RENOIR (Reverse Engineering of Social Information Processing) program (Grant no. 691,152) and by the Institute of Computer Science of Warsaw University of Technology.
Author information
Authors and Affiliations
Contributions
PM proposed the notion of CSTS patterns, analyzed its theoretical properties, designed Algorithm 3 and provided its description, selected datasets for the experiments and conducted the experiments as well as their analysis. RB contributed to the writing of the manuscript, analysis of the proposed notions, algorithms and results of the experiments. AD contributed to the writing of the manuscript and analysis of the results of the experiments. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Maciąg, P.S., Bembenik, R. & Dubrawski, A. Discovery of crime event sequences with constricted spatiotemporal sequential patterns. J Big Data 10, 98 (2023). https://doi.org/10.1186/s4053702300780x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4053702300780x
Keywords
 Data mining
 Spatiotemporal sequential patterns
 Crimedata analysis
 Patterns discovery
 Concise representation