 Research
 Open Access
 Published:
Dynamic order Markov model for categorical sequence clustering
Journal of Big Data volume 8, Article number: 154 (2021)
Abstract
Markov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from subfrequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on realworld datasets demonstrate the promising performance of the proposed model.
Introduction
Categorical sequence data have grown enormously in commercial and scientific studies over the past decades. Categorical sequences, such as the biological sequences, webclicking or behavioral sequences [1,2,3,4,5,6], are consist of a finite set of categorical values with chronological (or spatial) dependencies that are characteristic of each one of them. For instance, protein sequence is characterized by the 20 possible amino acids with their spatial orders. Discovering and understanding the sequential dependencies hidden in a set of sequences are very important tasks for characterizing sequence families and still pose significant challenges to the field of data mining, in particular, sequence clustering and classification analysis.
Categorical sequence clustering aims at partitioning a set of categorical sequences into homogeneous groups by maximizing their intracluster similarity and minimizing their intercluster similarity, and has been extensively studied in pattern recognition, web mining, and bioinformatics [7,8,9,10,11]. Many applications have been put into practice by academics and practitioners, such as inferring protein subfamilies [12], predicting biological sequence function [13], discovering behavior patterns from web user profiles [14], and detecting fraud patterns in credit card behaviors [15]. Moreover, sequence clustering helps provide critical insights into the mountains of sequential data without requiring prior knowledge.
The quality of clustering results depends on the similarity measure used to evaluate how well a sequence matches with another sequence or fits a set of sequences/model. Consequently, the biggest challenge in categorical sequence clustering is to define an effective similarity measure between a set of sequences. The existing similarity measures can be classified into two categories: alignmentbased and alignmentfree methods. The alignmentbased methods focus on finding an optimal match by identifying regions shared by sequences, such as the local alignment [16] and global alignment [17], which are often associated with high time complexity or impaired by the presence of statistically significant patterns^{Footnote 1} in reverse orders. The alignmentfree methods, such as the frequent patterns and Markov model based methods [18,19,20], aim to capture the underlying patterns of the sequences that can be used to represent each sequence (by vectorization) or build a statistic model for clustering analysis. They are in general much more efficient and effective compared to the alignmentbased methods for clustering analysis.
In this paper, we are interested in the methods based on Markov techniques since they have shown to be highly effective in encoding chronological dependencies in sequential data [19, 21, 22]. Generally, such methods proceed in defining a probability framework for capturing the statistically significant patterns hidden in the sequences for sequence representation, while assuming that the probability of next categorical value depends on the proceeding context/pattern^{Footnote 2}. For simplifying computation, the existing methods impose a constraint on each pattern assuming that it is consecutive. This constraint has an effect of preventing exploration of many subpatterns that are not statistically significant enough in their consecutive form, but could become statistically significant if some noisy or nonfrequent values in the patterns are masked. For instance, the pattern GTIT and GTAT shown in Fig. 1a are subfrequent individually, but can become frequent in a sparse form of \(GT*T\)^{Footnote 3} in Fig. 1d. More formally, a sparse pattern, such as \(G*GT\) and \(GT*T\), is a pattern in which one or more of the value/s between the first and last value in the pattern are replaced by wildcard/s that can be matched by a subset of categorical values in the categorical value set. Figure 1d shows that one consecutive pattern GTGT and two sparse pattern \(G*GT\) and \(GT*T\) are discovered in a protein dataset. In this dataset, \(G*GT\) is frequent while both patterns GVGT and GSGT are subfrequent because of the noisy values V and S, which is often the case in categorical sequences being imbued with significant quantities of noise and will lead to information loss during the clustering analysis.
A major drawback of conventional Markov model based methods is that they take advantage of only a portion of statistical significant contexts as consecutive pattern detected according to a predefined threshold. All the patterns in Fig. 1a would be frequent while the threshold was set as \(\tau =15\), and if we keep all the patterns to make maximal use of information for model learning, which can make use of information underlying in the sequences on the cost of the exponential increase of parameter learning, but also will suffer from the overfitting problem. However, if the threshold were set such as \(\tau =20\), only one consecutive pattern would be detected as shown in Fig. 1b, which will cause underfitting problem. Overfitting or Underfitting describes sufficiency or insufficiency in the number and content of the patterns detected for the knowledge representation in the discovery process, which may lead to information redundancy or information loss. Although, the wildcard constrained approaches, a wellstudied topic, can retain information in the subfrequent pattern by containing wildcards [23,24,25,26], the issue of these methods is that the frequent consecutive patterns will be masked by the sparse patterns. For example, compared with the Fig. 1c, d, the frequent consecutive patterns (GTGT) is covered by the sparse pattern \(GT*T\) and \(G*GT\), which can not make use of the information in frequent consecutive patterns (GTGT) and may result in underfitting problem. In summary, the existing models likely yield underfitting or overfitting from the model perspective, which will inevitably lead to information loss and has impacts on the performance of clustering analysis, a common limitation of categorical sequence mining.
To best of our knowledge, the existing sequence clustering methods are either focus on detecting the frequent consecutive patterns or sparse patterns for categorical sequence analysis, which results in overfitting or underfitting problem in terms of knowledge discovery and representation. However, no literature reports mining both of them simultaneously for similarity measure for categorical sequence analysis, which promotes us to propose the sparse pattern detector (SPD) to detect both sparse and consecutive pattern for sequence clustering that is a tradeoff between underfitting and overfitting for model learning. The basic idea of proposing SPD is to relax the constraint of variable order Markov model on patterns being comprised of consecutive patterns, and this relaxation model is capable of exploiting both sparse and consecutive patterns with variable length, named Dynamic order Markov model (DOMM). In fact, DOMM is a generalization of the Markov approach that can make use of both frequent consecutive pattern and sparse patterns detected from the subfrequent patterns as shown in Fig. 1d. The major contributions of this work can be summarized as follows:

We propose a sparse pattern detector (SPD) to discover patterns by using probability suffix tree (PST) with wildcards. SPD not only inherits the capabilities of PST to discover consecutive patterns, but also is capable of discovering sparse patterns from subfrequent patterns with wildcard, which is completely different from conventional pattern detection methods since it retains the consecutive patterns, and also can detects the sparse patterns from subfrequent patterns simultaneously.

We propose a new Dynamic order Markov model that extends conventional Markov model to account for sparse patterns in addition to consecutive patterns, which can take advantage of the statistical information in subfrequent patterns. Among others, it allows the design of a new similarity measure between a sequence and a cluster.

We design a cascade algorithm for categorical sequence clustering based on the proposed DOMM, named Dynamic order Markov model for categorical sequence clustering (DMSC). Comprehensive experimental results on realworld datasets demonstrate the promising performance of our algorithm.
The remainder of this paper is organized as follows: we present the related work and preliminaries on Markov model based similarity measures. This is followed by a detailed description of the proposed model, including Dynamic Order Markov Model (DOMM), sparse pattern detector (SPD), the clustering optimizer based on DOMM and the divisive algorithm DMSC; and then we describe the experimental results and analysis on the performance of DMSC. Finally, we present a conclusion of our work and some potential works in the near future.
Related work and preliminaries
In this section, we will introduce the related work on categorical sequence clustering and the premises of typical Markov chain based models on categorical sequence analysis.
Categorical sequence clustering
Recent literature reports a number of patternbased categorical sequence clustering approaches [27,28,29,30,31,32,33,34]. SigClust, proposed by [20], is a signature based clustering method by an ngram model, which makes use of frequent grams as signature for sequence representation. Ref. [28] proposed an ngram based method (SCS) that allows mismatches within fixed length frequent consecutive grams to detect shorter grams with variable length for sequence representation. Those approaches suffer, however, from the gram length problem since a fixed length is the cause of either overfitting or pattern underfitting and the optimal length is difficult to obtained. This leads to the development of variable length models [19] and [27] who, in a perspective of pattern exploration, employed the PST to search for frequent and consecutive subsequences (patterns) with maximumlength from a set of sequences and use them as patterns for their Markov chain models. On the other hand, several gap/wildcard constrained methods have been proposed for extracting contexts/patterns for sequential data mining [26, 35,36,37]. For instance, [36] proposed a sequential pattern mining method using wildcard constraints that are specified by user with flexibility, which needs extra prior insights into the dataset. The model proposed by [37], named SMCD, groups fixed preceding patterns with identical conditional distributions into invariance classes and represents each class by a sparse pattern, which is used as the fixed order of a Markov chain model. Existing gap/wildcard constrained methods address overfitting problem by compressing multiple patterns into a sparse pattern containing wildcards [36], but result in underfitting problem. For instance, in [37], the frequent and consecutive pattern is also included by a sparse pattern (\(GTGT \in G*GT\)), which is a typical example of underfitting that results in information loss, a common issue for sequential data mining.
Generative model and similarity measure
The generative Model, widely used to model categorical sequences for clustering and classification, is based on the assumption that the conditional probability distribution of the next categorical value depends on the immediate preceding pattern [14, 21, 27, 37, 38]. For instance, given a sequence \(S = s_{1} s_{2}\ldots s_{n}\) of length n composed of categorical values from the set \(\Sigma\) and a set of sequences denoted by a Markov chain model \({\mathcal {M}}\), the generative probability of the sequence S over the model \({\mathcal {M}}\) is defined as follows:
where \(s_{0}=\emptyset\) and \(P_{{\mathcal {M}}}(s_{1}s_{0})= P_{{\mathcal {M}}}(s_{1})\) is the frequency of symbol \(s_{1}\) over the model \({\mathcal {M}}\), and \(P_{{\mathcal {M}}}(s_{i}s_{1} \ldots s_{i1})\) is the conditional probability of the next symbol \(s_{i}\) right after the preceding pattern \(s_{1} \ldots s_{i1}\) over model \({\mathcal {M}}\). Thereby, \(P_{{\mathcal {M}}}(S)\) should be relatively higher, if sequence S is subsumed by a more similar model \({\mathcal {M}}\) compared with the other models, which means that \(P_{{\mathcal {M}}}(S)\) can be used as the similarity measure between sequence S and model \({\mathcal {M}}\) (derived by a set of sequences \({\mathcal {M}}\)); namely, the higher the value of \(P_{{\mathcal {M}}}(S)\), the more likely that S comes from that group of sequences. However, such model requires that all preceding patterns are statistically significant which provides enough statistical information from the model perspective, but it is unrealistic in practice. Thus, the high or variable order Markov model are proposed to provide a better estimation by using constrains on the preceding patterns, yielding the Markov model based similarity measures which are widely used in sequence clustering and classification.
Markov model
Many Markov model based methods have been proposed for modeling the chronological dependencies of sequential data, which exhibit many interesting characteristics for sequence analysis, such as speech and pattern recognition [39] and sequence clustering [19]. Ref. [21] proposed a higher order Markov chain model for categorical data sequences, which has the potential to capture the chronological dependencies or patterns in sequential data where the dependencies are in a fixed length whose value is set to L for simplicity. From the Markov chain model perspective, the high order Markov model can be viewed as an generalization of the Markov chain model, since the pattern \(s_{1} \ldots s_{i1}\) is approximated by the subpattern \(s_{iL+1} \ldots s_{i1}\) with length L. Consequently, the L order Markov model is defined as follows:
The issue associated with the high order Markov model is that the patterns \(s_{iL+1} \ldots s_{i1}\) in \(s_{i}\) (\(1 \le i\le n\)) can not all be statistically significant over model \({\mathcal {M}}\) in practice due to the parameter of order L. Consequently, they can not provide reliable statistics for for estimating the conditional probability of \(s_{i}\), which in turn affects the estimated probability \(P_{{\mathcal {M}}}(S)\) from the model perspective. Moreover, the number of model parameters grows exponentially with the orders (\(\Sigma ^{L}(\Sigma 1)\) for L order). Thus, this model based methods may suffer from overfitting problem. Ref. [40] proposed a SMCD model based on the High order Markov model, which classifies all patterns with length L into several groups shared the same transition probability, such classification can reduce dimension of the parameter space effectively. For instance, there are five pattern GTGT, GTIT, GTAT, GSGT and GVGT in the Fig. 1a, which can be viewed as a fourthorder Markov chain. The five patterns can be classified into two groups of patterns: {GSGT, GVGT}, {GTGT, GTIT, GTAT}, which are represented by two sparse patterns \(G*GT\) and \(GT*T\) respectively as shown in Fig. 1c. However, it is not fair enough for the probability calculation of pattern GTGT since it is frequent inherently as shown in Fig. 1d, which will obviously result in underfitting by such representation as aforementioned. Ref. [22] proposed a sequence mining method to represent the higherorder dependencies in networks. Those high order Markov model based methods only can tackle the pattern with fixed length, which may also break chronological dependencies or structurally relevant information that vary in length among sequences.
As aforementioned, the fixed length pattern, to some extent, still can not satisfy the statistical needs of Markov model since any fixed length value may break the chronological pattern dependencies of sequential data. To deal with the pattern with variable length, [41] proposed the variable order Markov model for data compression, which is viewed as a further generalization of Markov chain model. The pattern \(s_{1} \ldots s_{i1}\) is generalized to learn from the given previous pattern that could be varying in length of patterns based on the data being modeled. Ref. [19] proposed a variable order Markov model for categorical sequence clustering by detect only the frequent consecutive pattern by PST for the model representation. Thus, the Eq. (1) is extended to dealing with pattern in variable length as follows:
In fact, the pattern \(s_{1} \ldots s_{i1}\) is approximated by the latest frequent subpattern \(s_{j} \ldots s_{i1}\) whose length can be larger or smaller than L which is capable of capturing the variable chronological dependencies in flexibility. Thus, such approximation can guarantee a reliable statistic for conditional probability estimation accordingly, since if \(s_{iL+1} \ldots s_{i1}\) is not frequent, \(s_{1} \ldots s_{i1}\) will never be frequent (apriori algorithm), but there exists a subpattern \(s_{j} \ldots s_{i1} (j>iL+1)\) being frequent that avoids making probability approximation meaningless. However, such models suffer from overfitting since they can not deal with the sparse patterns detected from subfrequent patterns.
The proposed model
In this section, we will first propose the derivation of Dynamic order Markov model (DOMM) that serves as similarity measure of the sequence clustering, and then describe the sparse pattern detector (SPD), which discovers the sparse and consecutive patterns of DOMM from the categorical sequences. Finally, we will present the divisive algorithm (DMSC) for categorical sequence clustering based on DOMM and SPD.
Dynamic order Markov model
In light of the above Markov models, we propose a new Dynamic order Markov model (DOMM) that is capable of modeling the sparse and consecutive patterns of variable length, including consecutive pattern (GTGT) and sparse pattern (\(GT*T\), \(G*GT\)) shown in Fig. 1d, in order to solve the challenges of the conventional Markov models. According to the Markov chain model, we therefore define the new DOMM in a general form as follows:
where \(\psi _{m}=im\sum _{j=0}^{m}\phi _{j}\), \(s_{\psi }\) denotes the \(\psi{\text{th}}\) categorical value of the pattern \(s_{1} \ldots s_{i1}\), while \(w^{\phi }\) denotes \(\phi\) consecutive wildcards ‘*’. Note that the model in Eq. (4) will decay to the variable order Markov model in Eq. (3) if \(\phi _{j}=0\) for all j. \(P_{{\mathcal {M}}}(s_{i}s_{\psi _{m}}w^{\phi _{m}} \ldots s_{\psi _{2}}w^{\phi _{2}} s_{\psi _{1}}w^{\phi _{1}})\) is the conditional probability of symbol \(s_{i} (s_{i} \in \Sigma )\) occurring right after the preceding pattern (consecutive or sparse pattern) \(s_{\psi _{m}}w^{\phi _{m}} \ldots s_{\psi _{2}}w^{\phi _{2}} s_{\psi _{1}}w^{\phi _{1}}\) denoted by \(\theta _{i}\). Thereby, \(P_{{\mathcal {M}}}(s_{i}s_{\psi _{m}}w^{\phi _{m}} \ldots s_{\psi _{2}}w^{\phi _{2}} s_{\psi _{1}}w^{\phi _{1}})\) in Eq. (4) is simplified to \(P_{{\mathcal {M}}}(s_{i}\theta _{i})\). Statistical information in this paper denotes the number of sequences in corresponding group or cluster, the length of each sequence and the pattern occurrence in each sequence. In order to make use of these statistical information in sequence S and model \({\mathcal {M}}\), the probability \(P_{{\mathcal {M}}}(s_{i}\theta _{i})\) in Eq. (4) is divided into two part: one part is from sequence S: \({\hat{P}}_{S}(s_{i}\theta _{i})\), the other is from model \({\mathcal {M}}\): \({\hat{P}}_{{\mathcal {M}}}(s_{i}\theta _{i})\). We therefore redefined model probability \(P_{{\mathcal {M}}}(s_{i}\theta _{i})\) in Eq. (4) as follows:
where \({\hat{P}}_{S}(s_{i}\theta _{i})\) is the conditional probability of \(s_{i}\) given the preceding pattern \(\theta _{i}\) in sequence S, and \({\hat{P}}_{{\mathcal {M}}}(s_{i} \theta _{i})\) is referred to as the model parameter derived from cluster C, which will be inferred later. If sequence S coming from model M, the conditional probability distribution of \(s_{i}\) right after pattern \(\theta _{i}\) in sequence S will be the same as model \({\mathcal {M}}\) and the \(\exp (\cdot )\) is equal to 0, thus, \(P_{{\mathcal {M}}}(s_{i}\theta _{i})\) would be equal to 1; otherwise, the \(\exp (\cdot )\) is equal to \(\infty\), and \(P_{{\mathcal {M}}}(s_{i}\theta _{i})\) would be approaching but not equal to 0. Thus, we guaranteed that the probability \(P_{{\mathcal {M}}}(s_{i}\theta _{i})\) is in the range 0 to 1.0. To take the sequence length and pattern occurrence into account, we estimate \({\hat{P}}_{S}(s_{i}\theta _{i})\) as follows:
where \(P_{S}(s_{i}\theta _{i})=\frac{n(\theta _{i}s_{i},S)}{n(\theta _{i},S)}\) is the experience probability of \(s_{i}\) following the pattern \(\theta _{i}\) in sequence S, while \(\omega _{S}^{\theta _{i}} = \frac{n(\theta _{i},S)}{S}\) is the frequency of \(\theta _{i}\) occurs in S; \(n(\theta _{i}s_{i},S)\), \(n(\theta _{i},S)\) denote the number of occurrence of pattern \(\theta _{i}s_{i}\) and \(\theta _{i}\), respectively, in sequence S.
For the occurrence calculation as \(n(\theta _{i}s_{i},S)\) in Eq. (6), if the pattern \(\theta _{i}\) is consecutive, it is easy to calculate \({\hat{P}}_{S}(s_{i}\theta _{i})\) by the occurrence of \(\theta _{i}s_{i}\) and sequence length S. However, if \(\theta _{i}\) is a sparse pattern with wildcard/s whose number and position varies among different pattern. We assume that there is one wildcard (denoted by *) in pattern \(\theta _{i}\) for simplicity, so that the pattern \(\theta _{i}\) can be denoted as \(\theta _{1}*\theta _{2}\) (\(\theta _{1}\) and \(\theta _{2}\) being the consecutive parts of sparse pattern \(\theta _{i}\)). Thereby, \(n(\theta _{i}s_{i},S)\) and \(n(\theta _{i},S)\) in Eq. (6) can be calculated as follows:
Thus, according to the Markov chain model in Eq. (1), and the DOMM in Eq. (5), the probability of sequence S coming from model \({\mathcal {M}}\) can be redefined as follows:
where \({\mathcal {M}}\) is derived from a set of sequences (refered to as cluster C), the larger \(P_{{\mathcal {M}}}(S)\), the more likelihood of sequence S coming from cluster C. Thus, this generative probability can be similarity measurement between a sequence and a cluster. According to the principle of relativity for similarity measures, we omitted exponential function in Eq. (8) to simplify calculation, and defined a new DOMM based similarity measurement as follows:
To be consistent with the conventional similarity measures, we turn the similarity measure in Eq. (9), into a dissimilarity measure (dsim) as follows:
As is clear from Eq. (10), the dissimilarity measure dsim(S, C) is meaningless when i is equal to 1 as \(s_{1}\) has no preceding pattern. However, there are two issue during the Eq. (10) calculation: one is that \(s_{i}\) will be repeated multiple times in S and map to the same pattern, another is that there are many short patterns detected by SPD. Thus, we ignore the short patterns and use all the longest frequent patterns to represent the sequences as for the dissimilarity approximation. Therefore, the dissimilarity approximation in Eq. (10) can be rewritten as follows:
where \(\Theta\) (\(\theta \in \Theta\)), \(\Sigma\) (\(s \in \Sigma\)) denotes all the longest frequent pattern detected by SPD in cluster C and the distinct categorical values that consist of the whole sequence in cluster C. We thus can use the dissimilarity measure in Eq. (11) to refine clusters by relocating sequence and estimate the compactness of a cluster. Based on Eq. (10), we can therefore define the clustering objective function of a categorical sequence dataset with K clusters. The objective function for the Sum of Clustering Error (SCError) is given as follows:
By minimizing the objective function Eq. (12) as presented in [42], we can obtain the parameter \({\hat{P}}_{{\mathcal {M}}}(s_{i}\theta _{i})\) of model \({\mathcal {M}}\) derived from the corresponding cluster as follows:
Sparse pattern detector
The sparse pattern detector (SPD) is used to detect the sparse and consecutive pattern for DOMM. The intuition behind the SPD is derived from the framework of PST containing wildcards that not only inherits capability of PST detecting the frequent consecutive patterns, but also exploits sparse pattern from the subfrequent patterns, which can make use of the statistical information in the form of sparse pattern. To provide an illustrative building process of SPD, we decompose the building process into three steps for a better understanding that happens simultaneously in practice. We start by building a PST on a cluster C with 250 sequences over \(\Sigma = \lbrace {a, b, c\rbrace }\) without pruning as shown in Fig. 2a. For simplicity, we set the maximum depth to L = 3 and the threshold to \(\tau =250\), while the depth is not limited in actuality, but we will be set to a maximal depth for the sake of space complexity in practice.
Then, we shall select those nodes with occurrence under the support threshold \(\tau\) to merge to retain the frequent consecutive patterns, and identify the sparse patterns from subfrequent consecutive ones. We thus need to scan the number of occurrences at each node and compare it to the threshold \(\tau\), starting from the root. If the number of occurrences is smaller than \(\tau\), the node will be pruned, since its descendants’ occurrence will never be greater than \(\tau\) according to Apriori algorithm; otherwise the children of that node are classified into two groups: U (occurrence greater than \(\tau\)) and V (occurrence less than \(\tau\)). Technically, \(U \cup V=\Sigma\), \(\lbrace {a\rbrace } \in U\) and \(\lbrace {b, c\rbrace } \in V\)) as shown in Fig. 2b. Then, we keep the node in U and merge the node in V if necessary. Actually, the node merging process is subtrees merging, since each node in V can be regarded as the root of a subtree. Thus, we merge all the subtrees in V with their descendants, and then update the occurrences at each node and label the edge by \(*\). For instance, the two subtrees in blue and the two in green in Fig. 2b will be merged to yield the results shown in Fig. 2c.
Finally, we obtain the final SPD by node pruning. We compare the number of occurrences in each children of the merged subtree with the threshold \(\tau\) in order to determine whether or not any sparse pattern has been detected after the node merging process. If the number of occurrences at the root of the merged subtree is still below the threshold \(\tau\) and no new child with occurrence above \(\tau\) has been produced, the merged subtree will be pruned (subtree in green in Fig. 2c; otherwise the merged subtree is kept and the edge is labeled by \(*\) (new child, pink node, with occurrence 290 in Fig. 2c, and the final SPD is shown in Fig. 2d. However, the parameter \(\tau\) for construction here is simply set to the number of sequences in the corresponding cluster, which is based on the frequent pattern assumption that a frequent pattern should occur at least once in each sequence in the cluster. The thresholds change dynamically according to the number of sequences in each clusters during the iterative process of clustering refinement, which can overcome the hindrance to pattern detection caused by a fixed threshold for all clusters since some patterns may be only frequent in one cluster, but not necessary frequent in another. Moreover, if we use the same fixed threshold for all clusters, most of the patterns may be filtered by the same threshold, especially for the unbalance clusters. The construction process of SPD is given in Algorithm 1:
Clustering optimizer
In this subsection, we design a clustering optimizer to refine the clusters based on the DOMM. The procedure of clustering optimizer can be divided into three steps: firstly, we need to build a SPD on each cluster, which is used to detect the sparse and consecutive pattern for DOMM, and then we update the model parameters of each cluster based on the patterns detected by SPD from each clusters. Finally, we relocate sequences to clusters iteratively based on the dissimilarity measure Eq. (11) until no membership changes. The process of the clustering optimizer are detailed in Algorithm 2.
The DMSC algorithm
In this subsection, we describe the divisive algorithm of DMSC in details. The process of DMSC involves the sequence vectorization and an iterative phase. The sequence vectorization is to transform each sequence in the dataset into a fixed length vector by the first order Markov chain. The iterative phase includes two steps: initialization and clustering refinement. The initialization includes bisecting one cluster in two initial clusters based on the sequence vector matrix obtained from sequence vectorization phase, and optimizing them according to Chisquare distance. The clustering refinement step refines the two optimized clusters by the clustering optimizer shown in the Algorithm 2 in “Clustering optimizer” subsection, and then choose one cluster for the next iteration until the predefined K clusters achieved. The dissimilarity measure in Eq. (11) can also be used as the estimation of compactness for a cluster, we thus can choose the cluster with maximal compactness for the bisecting process in the initialization phase. The overview of DMSC is shown in Fig. 3.
Sequence vectorization
In this phase, we use the first order Markov chain model to transform each categorical sequence into a numerical vector with fixed length (\(\Sigma ^{2}\)), which is referred to as Markov Vector of Sequence (\(\mathbf {MVS}\)). Thus, the sequence set will be converted into a sequence \(\mathbf {MVS}\) matrix that will be used for initialization. As example for the transformation process of a sequence to an \(\mathbf {MVS}\), we select a sequence \(S=ababababbcbabcabacbacbacacaa\) (\(\Sigma = \{a, b, c\}\) and \(\Sigma  = 3\)) from the previous dataset whose length n is 28. The symbol frequency vector \(\mathbf {V_{F}}\) and transition occurrence matrix \(M_{O}\) are obtained from the frequency of symbol and the state transition by the first order Markov model from sequence S respectively, which are shown as follows:
where \(F_{S_{a}}\) is the frequency of a in sequence S, and \(O_{S_{ij}}\) is the occurrence of state transition from i to j in sequence S. However, the occurrence of observed state transition shown in matrix \(M_{O}\) is not fair enough to serve as the representation for a stochastic Markov process of a categorical sequence. We thus introduce the Bayesian approach to model the transition probability among states, other than just use the state transition occurrence dividing by total occurrence, which can guarantee the stochastic process of a Markov model (for more details, please read the reference [43]). According to the Bayesian estimation, the state transition probability thus can be defined as follows:
where \(\eta _{ij}\) is a hyperparameter and \(O_{ij}\) is the state transition occurrence of \(s_{j}\) given \(s_{i}\), \(\eta _{i} = \sum _{j} \eta _{ij}\), \(f_{i} = \sum _{j} O_{ij}\), \(\eta = \Sigma \) and \(\eta _{ij} = \frac{1}{\Sigma }\). Thus, the occurrence transition matrix \(M_{O}\) can be transformed into the state transition probability as \(M_{P}\) by Bayesian estimation in Eq. (14) as follows:
Using the symbol frequency as weight, the state probability transition can be further estimated as \(M_{FP}\) by the state probability transition \(M_{P}\) and symbol frequency \(\mathbf {V_{F}}\) as follows:
Finally, sequence S is transformed into an \(\mathbf {MVS}\) by flattening the state probability transition matrix \(M_{FP}\) into one dimension vector as follows:
Therefore, any sequence set can be transformed into a sequence vector matrix composed of the \(\mathbf {MVS}\) of each sequence, which can be used for cluster splitting and optimization in the initialization phase.
Initialization
In this phase, we need to bisect one cluster (initial dataset denoted as one cluster), and optimize the two clusters according to the Chisquare distance. They will be used as input for further optimization in the phase of clustering refinement. As presented above, each sequence in the cluster has been converted into a sequence \(\mathbf {MVS}\). However, such \(\mathbf {MVS}\) is highly sparse in dimensionality as 9 dimensions with \(\Sigma =3\) and 400 dimension for a protein sequence (\(\Sigma =20\)). We therefore utilize the principal component analysis (PCA) to reduce the dimensionality on the \(\mathbf {MVS}\) matrix of cluster. To obtain two initial robust clusters with maximal distance, we take first component of PCA, explains the largest portion of the variance on the sequence matrix, to bisect the cluster into two initial clusters, namely, sequence with positive value in first component will be assigned to one cluster; otherwise it will be assigned to the other. We then obtain two initial clusters and use the Chisquare distance to optimize the two initial clusters on their \(\mathbf {MVS}\) to obtain two robust optimal \(C_{1}\) and \(C_{2}\) for clustering refinement.
Clustering refinement
In this phase, we need to refine the two optimal clusters from the initialization phase. Since we have obtained two optimal clusters \(C_{1}\) and \(C_{2}\) in initialization phase, and then we can obtain the parameter \(\tau\) for SPD construction whose value is equal to the number of sequences in the corresponding cluster as discussed before. Thus, two SPDs will be built on \(C_{1}\) and \(C_{2}\), which serves as the sparse and consecutive pattern detector shown in Algorithm 1 for the similarity measure in the clustering optimizer in Algorithm 2. The clustering refinement is an iterative process that operates by reassigning sequences to the cluster with minimal dissimilarity measure in Eq. (11). After the clustering refinement, as aforementioned, we need to choose one cluster to repeat for the next iteration until the desired K clusters obtained. The procedure of algorithm DMSC is shown in Algorithm 3.
Experiment
In this section, we will compare the results of our model with the stateoftheart algorithms on six realworld datasets, and evaluate its performance based on the empirical results.
Datasets
Clustering biological sequences poses significant challenges for biological sequence analysis in bioinformatics, we therefore choose two biological sequence datasets, one speech sequence dataset and other three datasets to evaluate the performance of our model. The details of the six datasets are shown as follows:

Protein: This dataset contains 2074 protein sequences from three families, which is available from the SWISSPROT protein sequence data bank. These sequences are arbitrarily chosen from the globin, immunoglobulin, and 3\(\beta\)HSD families, with respective sizes of 1500, 313 and 261. Sequence lengths vary from 10 to 1709, with an average length of 310.

RNA: This dataset comprises three classes of RNA sequence from RFAM database of noncoding RNA families: transfer messenger RNA (abbreviated tmRNA), signal recognition particle RNA (SRP) and transfer RNA (tRNA). The respective sizes of classes are 390, 254 and 640 sequences. Sequence lengths vary from 54 to 380, and the average length is 201.

Speech: This dataset includes 500 sequences from five classes (100 sequences for each class). All the sequences were generated from the pronunciation of five isolated French vowels (i.e., ‘a’, ‘e’, ‘i’, ‘o’, ‘u’) and each pronunciation is characterized with a categorical sequence with 20 different categorical values [44]. Sequence lengths vary from 701 to 3753, with an average length being 1898.

Robot: This dataset consists of two classes: MoveForward with 2205 samples and SharpRightTurn with 2097 samples. Each sequence consists of ultrasound readings collected by the sensors of a robot, and is discretised into 95 categorical values, which is used by [45]. Sequence lengths vary from 4 to 100, with an average length being 40.

Pioneer: this dataset is derived from the pioneer1 dataset in UCI repository. The time series is manually discretized into three bins (referred to as gripper, move, or turn) based on exploratory data analysis. The numerical time series thus can be categorized by one of three scenarios: gripper, move, or turn [46]. Each sequence is with the same lengths being 24.

Pattern: this dataset is also comporsed of time interval sequence converted into categorical sequence based on exploratory data analysis. Each time interval in the sequence is characterized by one of five categorical value such as street or meeting [47]. Sequence lengths vary from 22 to 246, with an average length being 88.
The statistics of the datasets are shown in Table 1.
Baselines
To demonstrate the performance of DMSC, we present our results in comparison with following baselines on the six test datasets. The description of the baselines is as follows:

SigClust [20] is a ngram based approach, in which each sequence is represented by a binary document signature vector, and uses kmeans for clustering on the sequence vector matrix.

CLUSEQ [27] is a Markov chain model and PSTbased method, which uses PST to capture the frequent consecutive patterns as entities of the conditional pattern for the similarity calculation.

SCS [28] is a variation of ngram that allows mismatches between patterns within length L by detecting the frequent consecutive patterns with variable length for sequence representation.

WCPD [19] is also a variable order Markov model, which employs the PST to detect consecutive, statistically significant pattern for similarity estimation and model representation.

SMCD [37] is a fixed Markov chain model for sequence classification, utilizes Delaunay triangulation to cluster the possible preceding pattern in invariance classes with identical conditional distributions, but here we modify it as a sequence clustering method by using the entites of a fixed Markov model to predict the next categorical value for the generative probability of a given sequence.
Performance metrics
As for the performance metric, besides accuracy, we also employ the F1measure and Normalized Mutual Information (NMI) to evaluate the quality of the clustering results. They are extensively used metrics for evaluating the performance of clustering analysis [48]. Given two sets of the cluster results \({\mathcal {C}}\) = {\(C_{1},\ldots ,C_{K}\)} and the ground truth \({\mathcal {G}}\) = {\(G_{1},\ldots ,G_{K}\)}, the accuracy, F1measure and NMI are defined respectively as follows:
where the \(L_{S}\) and \({\mathcal {L}}_{S}\) is the clustering label and ground truth of sequence S respectively, and the accuracy, precision and recall are also calculated as the fraction of sequences correctly clustered against the ground truth [49], while \(MI({\mathcal {C}},{\mathcal {G}})=\sum _{i=1}^{K}\sum _{j=1}^{K}\left(\frac{C_{i}\cap G_{j}}{N}\right)\log \left(\frac{N\times C_{i}\cap G_{j}}{C_{i}\times G_{j}}\right)\) is the mutual information, and \(H({\mathcal {C}})=\sum _{k=1}^{K}\left(\frac{C_{k}}{N}\right)\log \left(\frac{C_{k}}{N}\right)\) is the entropy. In fact, all those measures are derived from clustering results whose cluster number is the same as the actual class number, which are used to reveal how well the clustering results match with the actual class distribution. Larger value means better clustering performance.
Results and analysis
To show the performance of DMSC, three performance metrics accuracy, F1measure and NMI obtained on the test datasets by our model and the baselines are demonstrated in Tables 2, 3 and 4, respectively. It is easy to see that DMSC shows much better performance than that of the baselines on the six test datasets. DMSC achieves an improvement of 6.3%, 7.0% and 14.6% on average than that of secondplace SMCD model on accuracy, F1measure and NMI on all the test dataset, respectively. The significant improvement on the NMI means that DMSC outperforms the baselines in terms of the clustering quality and stability.
For the accuracy in Table 2, DMSC achieves the highest accuracy overall on the six test datatsets with average value is 91.7%; 10.6%, 6.6% and 4.8% higher than the secondplace SMCD model on the protein, robot and pioneer dataset, and also betters the secondplace SCS model by 3.2%, 3.3% and 3.0% on the pattern dataset, RNA and speech dataset respectively. In terms of the clustering quality and stability performance: the F1measure in Table 3 and NMI in Table 4. It can be readily concluded that DMSC also outperforms the other baselines. DMSC earns the highest F1measure and NMI value on all test datasets, especially for pioneer and robot dataset, where it achieves highest NMI over the secondplace SMCD with a significant improvement of 24.5% and 13.5%, while yields 9.3% and 7.2% higher than that of the secondplace SCS model on RNA and speech dataset. In short, DMSC achieves better performance over other baselines in terms of the clustering quality and stability based on the accuracy, F1measure and NMI on the six test datasets.
As we can see the results in Table 2, 3 and 4, SigClust achieves the lowest performance on all the test datasets due to the fact that it utilizes fixedlength patterns as signatures for sequence representation, assuming the pattern with fixed length is not realistic, since the chronological dependent structures or patterns varies differently in practice. SCS allows mismatching with fixed ngram patterns, which allows to discover grams in variable length to make use of the subfrequent patterns within ngram for sequence representation. Thereby, it is reasonable that SCS outperforms SigClust. The CLUSEQ can detect variable length patterns by PST but suffers from underfitting that ignores the statistical information of individual sequences so that CLUSEQ shows better performance than that of SigClust but is inferior to SCS due to the variable length patterns and statistical information underlying sequences.
However, the WCPD model takes into account both the frequent consecutive patterns of variable length and the statistical information of individual sequences, but loss the information hidden in sparse patterns, which suffers from underfitting or overfitting because of the improper support threshold. Both of them thus show good performance on the test dataset which are better than CLUSEQ, but still inferior to DMSC due to the fact that they can not take advantage of the statistical information in subfrequent patterns. SMCD, a Markov model, uses data compression technique to clustering the possible preceding subsequences with same length into invariance classes, which are used as the patterns of a fixed order Markov model, to predict the next categorical value for generative probability of a given sequence. It can take the information from the subfrequent patterns into account, but still suffers from both the issue of fixed length pattern and underfitting (frequent consecutive patterns GTGT are represented by pattern \(G*GT\) or \(GT*T\) as argued above), which may explain the reason why SMCD performs better than others, but is also inferior to DMSC. As for DMSC, it neither suffers from overfitting nor underfitting, which not only retains the information hidden in consecutive patterns, but also makes use of the statistical information underlying subfrequent patterns by the form of sparse pattern during the pattern detection process. All in all, the results also prove that DMSC outperforms the other baselines with respect to the performance metric of accuracy, F1measure and NMI in Tables 2, 3 and 4 respectively.
Scalability analysis
DMSC is scalable to the number of sequences and average length of the sequences, and the time complexity can be analysed by three parts of DMSC. The first part is data vectorization. The time complexity of this phase is mainly about of vectorizing all sequence in the dataset into an \(\mathbf {MVS}\) matrix. The time of vectorizing sequence \(S=s_{1}\ldots s_{n}\) to an \(\mathbf {MVS}\) is \({\mathcal {O}}(n\Sigma +\Sigma ^{2})\). Thus, the time complexity of sequence vectorization is \({\mathcal {O}}(N{\overline{n}}\Sigma +N\Sigma ^{2})\), where \({\overline{n}}\) is the average length of the sequences in dataset, and \(\Sigma\) is the number of categorical value set. The second part is initialization. The time complexity of this phase can be divided into two part: one is time complexity of PCA on the \(\mathbf {MVS}\) matrix: \({\mathcal {O}}(N\Sigma R)\), where \(R\le \sqrt{\min (n,\Sigma ^{2})}\). The other is optimization by Chisquare distance \({\mathcal {O}}({\mathcal {I}}NM\log K)\), where \({\mathcal {I}}\) is the average number of iterations for the whole optimization process, M is the dimension of \(\mathbf {MVS}\) after processed by PCA and K number of final clusters. Thus, the complexity of initialization is \({\mathcal {O}}(N\Sigma R+{\mathcal {I}}NM\log K)\). The final part is the clustering refinement. The time complexity of this phase also includes three part. The first part is the construction of SPD: \({\mathcal {O}}(N{\overline{n}}L\Sigma )\), where L is the maximal depth of SPD, the second is to detect patterns from SPD: \({\mathcal {O}}(\Sigma ^{l})\), where \(l \le L\) is the average length of pattern and involves the number of wildcards in the pattern. The final is clustering refinement, updating model parameters for one iteration is \({\mathcal {O}}(C_{1}\Sigma ^{l})\), the complexity of clustering refinement itself is \({\mathcal {O}}(N{\overline{n}}L\Sigma +{\mathcal {I}}NM\log K)\). Consequently, the time complexity of DMSC is linear with number of sequence N and sublinear in terms of the average sequence length \({\overline{n}}\).
To examine the scalability of DMSC with respect to number of sequences and average length of sequences, we generated eight datasets with two class from two RNA families, and each class shares the same number of sequences. We use four datasets with size of \(10^{2}\), \(10^3\), \(10^4\) and \(10^5\) whose sequence length is 100 for the scalability of sequence numbers, while the rest four datasets with 1000 sequences whose average length sequences are 10, \(10^2\), \(10^3\) and \(10^4\), respectively, which are used to test scalability of the average sequence length sequences. The running time are shown in Fig. 4. It can be clearly seen that the running time of DMSC increases linearly with respect to the number of sequences and the average length of sequences, respectively.
Further analysis
Despite the performance against the baselines, we also present a indepth analysis to investigate whether the sparse pattern is helpful for the clustering analysis or not by experiments. We have done a series of experiment to explore above question in the following two domains:

SPD method can rich in pattern structure and cluster characterization?

Sparse pattern can really improve the clustering performance?
Pattern richness and cluster characterization analysis
As for the pattern richness, we use SPD to detect the patterns from each cluster on the RNA dataset compared with consecutive detection methods. However, apart from the frequent consecutive patterns, SPD also detected the sparse patterns from each cluster, and top five frequent sparse patterns are shown in Table 5. What is most interesting is that these unique sparse patterns detected from one cluster can not be found in another cluster, which means that they are never cooccurred in two clusters and can be used to characterize the properties of cluster in view point of knowledge representation for clustering analysis. However, we will not focus on the biological analysis of these detected patterns like \(UGGU**U\) in cluster \(C_{3}\), and that will a part of our future work to investigate in the near future.
We also compared the detected patterns using for clustering against the baselines from each cluster on test datasets. The patterns detected from the final clustering results are shown in Fig. 5. We thus can learn that DMSC has the largest number of patterns due to the fact that SPD detects sparse patterns from subfrequent patterns by containing wildcards, while also remains the frequent consecutive patterns simultaneously. It can be obviously concluded that such pattern richness is of great help for sequence analysis and knowledge representation in view point of a statistic model. Moreover, these sparse patterns are capable of taking advantage of the statistical information underlying the subfrequent patterns, a common suboptimization problem of pattern detection of sequence mining, which may explain the reason why DMSC has the largest number of patterns and outperforms the other baselines in terms of clustering results.
Clustering improvement of sparse pattern analysis
As for the clustering performance improvement, we experiment the DMSC without the sparse patterns on test datasets, named DMSCSP. Such implementation can directly validate the usefulness of the sparse patterns whether they can improve the performance of clustering result or not. As the results shown in Fig. 6, it can easily be concluded that the clustering performance of DMSC on the test dataset outperforms that of DMSCSP without sparse patterns, as results of DMSC is obviously higher than that of DMSCSP. Actually, DMSC improves significantly on three performance metrics on text dataset, around 6.0% improvement on accuracy, F1measure, and NMI on average, which validates that the sparse pattern is of great importance to improve the performance of clustering. Thereby, we can make sure that those sparse patterns detected by SPD can be of great help to characterizing clusters for clustering analysis.
In summary, DMSC, a based statistical model for categorical sequence clustering, not only takes the consecutive patterns with variable length and statistical information, such as the number of sequence in each cluster and sequence length, into account, but also makes use of the information underlying in sparse patterns detected from the subfrequent patterns for the clustering analysis. SPD is a powerful pattern detector in terms of the pattern richness in the domain of pattern recognition, since it not only detects consecutive pattern, but also retains the possibility of discovering frequent sparse patterns from the subfrequent patterns. Moreover, it also avoids overfitting and underfitting by retaining the some subfrequent patterns, thus sidestepping the suboptimal pattern detection problem, a common limitation of pattern mining. Finally, the experimental results on realworld datasets also exhibit the superiority of the DMSC over a number of stateoftheart competitors.
Conclusion
In this paper, we propose a Dynamic order Markov model (DOMM) that is capable of dealing with the sparse patterns and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern, which allows deriving a similarity measure between a sequence and a set of sequences/cluster. To implement DOMM, we propose a sparse pattern detector (SPD) to discover sparse and consecutive patterns based on the PST containing wildcards, which can take advantage of significant statistical information, specially the information veiled by the noise in the form of sparse pattern. We then develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for Categorical Sequence Clustering. Experimental results on realworld datasets demonstrate the promising performance of the proposed model.
Our future work may focus on the following aspects: on one hand, we would like to improve the performance of the for sparse pattern detection, and investigate how this kind of sparse patterns would be useful for knowledge discovery and model representation in the field of bioinformatics, such as the mutation patterns and the evolution of biological sequence. On the other hand, we would like to apply the SPD on financial data or event data to detect complex patterns for decision making or behavior analysis. In short, we see the significant challenges for our future work, but we are confident that the proposed method has great potential in real applications.
Availability of data and materials
Protein dataset is available at: http://www.uniprot.org/. RNA dataset is available at: http://rfam.xfam.org. Speech dataset is available from the corresponding author on reasonable request. Robot dataset is available at: https://archive.ics.uci.edu/ml/datasets/WallFollowing+Robot+Navigation+Data. Pioneer dataset is availabe at: https://archive.ics.uci.edu/ml/datasets/Pioneer1+Mobile+Robot+Data. Pattern dataset is available at http://www.cis.hut.fi/jhimberg/patterndata/index.shtml.
Notes
 1.
A pattern is statistically significant whose occurrence is no less than the userdefined support threshold \(\tau\), otherwise the pattern is or subfrequent or nonstatistically significant.
 2.
In this paper, the terminology of context and pattern are used interchangeably.
 3.
The * denotes a wildcard. To compared with the consecutive pattern, we call the pattern containing wildcard/s as sparse pattern. Sparse pattern \(GT*T\) consists of subfrequent pattern GTAT and GTIT, and the wildcard * only matches with I, A.
Abbreviations
 DOMM:

Dynamic order Markov model
 SPD:

Sparse pattern detector
 PST:

Probability suffix tree
 DMSC:

Dynamic order Markov model for categorical sequence clustering
 DMSCSP:

DMSC without sparse pattern
 MVS:

Markov vector of sequence
 PCA:

Principal component analysis
 NMI:

Normalized mutual information
References
 1.
Agrawal R, Srikant R. Mining sequential patterns. In: Data engineering, 1995. Proceedings of the eleventh international conference on, 1995. IEEE. p. 3–14.
 2.
Mobasher B, Dai H, Luo T, Nakagawa M. Using sequential and nonsequential patterns in predictive web usage mining tasks. In: Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE international conference on, 2002. IEEE. p. 669–72.
 3.
Petridou SG, Koutsonikola VA, Vakali AI, Papadimitriou GI. A divergenceoriented approach for web users clustering. In: International conference on computational science and its applications. Springer; 2006. p. 1229–38.
 4.
Gebser M, Guyet T, Quiniou R, Romero J, Schaub T. Knowledgebased sequence mining with asp. In: IJCAI 201625th international joint conference on artificial intelligence, AAAI. 2016. p. 8.
 5.
He Z, Zhang S, Wu J. Significancebased discriminative sequential pattern mining. Expert Syst Appl. 2019;122:54–64.
 6.
Ibrahim R, Shafiq MO. Detecting taxi movements using random swap clustering and sequential pattern mining. J Big Data. 2019;6(1):1–26.
 7.
Lin Z, Lanchantin J, Qi Y. Mustcnn: a multilayer shiftandstitch deep convolutional architecture for sequencebased protein structure prediction. 2016. arXiv preprint arXiv:1605.03004.
 8.
Katoh K, Rozewicki J, Yamada KD. Mafft online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform. 2019;20(4):1160–6.
 9.
FournierViger P, Lin JCW, Kiran RU, Koh YS, Thomas R. A survey of sequential pattern mining. Data Sci Pattern Recogn. 2017;1(1):54–77.
 10.
Rathee S, Kashyap A. Streamaligner: a streaming based sequence aligner on apache spark. J Big Data. 2018;5(1):1–18.
 11.
Yuan L, Wang W, Chen L. Twostage pruning method for grambased categorical sequence clustering. Int J Mach Learn Cybern. 2019;10(4):631–40.
 12.
Wicker N, Perrin GR, Thierry JC, Poch O. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol. 2001;18(8):1435–41.
 13.
Watson JD, Laskowski RA, Thornton JM. Predicting protein function from sequence and structural data. Curr Opin Struct Biol. 2005;15(3):275–84.
 14.
Borges J, Levene M. Evaluating variablelength markov chain models for analysis of user web navigation sessions. IEEE Trans Knowl Data Eng. 2007;19(4):441–52.
 15.
Ki Y, Yoon JW. Pdfds: purchase density based online credit card fraud detection system. In: KDD 2017 workshop on anomaly detection in finance. 2018. p. 76–84.
 16.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
 17.
Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–53.
 18.
Liu L, Ho YK, Yau S. Clustering DNA sequences by feature vectors. Mol Phylogenet Evol. 2006;41(1):64–9.
 19.
Xiong T, Wang S, Jiang Q, Huang JZ. A novel variableorder markov model for clustering categorical sequences. IEEE Trans Knowl Data Eng. 2014;26(10):2339–53.
 20.
Chappell T, Geva S, Hogan J. Kmeans clustering of biological sequences. In: Proceedings of the 22nd Australasian document computing symposium, ACM. 2017. p. 2.
 21.
Ching WK, Fung ES, Ng MK. Higherorder markov chain models for categorical data sequences. Naval Res Logist. 2004;51(4):557–74.
 22.
Xu J, Wickramarathne TL, Chawla NV. Representing higherorder dependencies in networks. Sci Adv. 2016;2(5):1600028.
 23.
Garofalakis MN, Rastogi R, Shim K. Spirit: sequential pattern mining with regular expression constraints. In: VLDB 1999.
 24.
Zaki MJ. Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the ninth international conference on information and knowledge management, ACM. 2000. p. 422–9.
 25.
Pei J, Han J, MortazaviAsl B, Pinto H, Chen Q, Dayal U, Hsu MC. Prefixspan: mining sequential patterns efficiently by prefixprojected pattern growth. In: Icccn, IEEE. 2001. p. 0215 .
 26.
Ji X, Bailey J, Dong G. Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst. 2007;11(3):259–86.
 27.
Yang J, Wang W. Cluseq: efficient and effective sequence clustering. In: Data engineering, 2003. Proceedings. 19th international conference on, IEEE. 2003; p. 101–12.
 28.
Kelil A, Wang S, Jiang Q, Brzezinski R. A general measure of similarity for categorical sequences. Knowl Inf Syst. 2010;24(2):197–220.
 29.
Fowkes J, Sutton C. A subsequence interleaving model for sequential pattern mining. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM. 2016. p. 835–44.
 30.
Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):2542.
 31.
Mitra S, Sastry P. Summarizing event sequences with serial episodes: a statistical model and an application. 2019. arXiv preprint arXiv:1904.00516.
 32.
Martin DE. Minimal auxiliary markov chains through sequential elimination of states. Commun StatSimul Comput. 2019;48(4):1040–54.
 33.
Amer AA, Abdalla HI. A set theory based similarity measure for text clustering and classification. J Big Data. 2020;7(1):1–43.
 34.
Park S, Kang I, Shin C. MicroRNA clustering on the biogenesis of suboptimal microRNAs. Appl Biol Chem. 2021;64(1):1–9.
 35.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psiblast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
 36.
Xie F, Wu X, Hu X, Gao J, Guo D, Fei Y, Hua E. Sequential pattern mining with wildcards. In: Tools with artificial intelligence (ICTAI), 2010 22nd IEEE international conference on, IEEE; vol. 1. 2010. p. 241–7.
 37.
Xiong J, Jääskinen V, Corander J, et al. Recursive learning for sparse markov models. Bayesian Anal. 2016;11(1):247–63.
 38.
Martin DE. Distributions of pattern statistics in sparse Markov models. Ann Inst Stat Math. 2020;72(4):895–913.
 39.
Hu J, Turin W, Brown MK. Language modeling using stochastic automata with variable length contexts. Comput Speech Lang. 1997;11(1):1–16.
 40.
Jääskinen V, Xiong J, Corander J, Koski T. Sparse markov chains for sequence data. Scand J Stat. 2014;41(3):639–55.
 41.
Rissanen J. A universal data compression system. IEEE Trans Inf Theory. 1983;29(5):656–64.
 42.
Xiong T, Wang S, Jiang Q, Huang JZ. A new markov model for clustering categorical sequences. In: 2011 IEEE 11th international conference on data mining, IEEE. 2011; p. 854–63.
 43.
Ramoni M, Sebastiani P. Bayesian methods. In: Berthold M, Hand DJ, editors. Intelligent data analysis. Berlin: Springer; 1999. p. 129–66.
 44.
Loiselle S, Rouat J, Pressnitzer D, Thorpe S. Exploration of rank order coding with spiking neural networks for speech recognition. In: Neural networks. Proceedings. 2005 IEEE international joint conference on, IEEE; vol. 4. 2005. p. 2076–80.
 45.
Zhou C, Cule B, Goethals B. Pattern based sequence classification. IEEE Trans Knowl Data Eng. 2015;28(5):1285–98.
 46.
Fradkin D, Mörchen F. Mining sequential patterns for classification. Knowl Inf Syst. 2015;45(3):731–49.
 47.
Flanagan JA, Mäntyjärvi J, Himberg J. Unsupervised clustering of symbol strings and context recognition. In: 2002 IEEE international conference on data mining, 2002. Proceedings. 2002. p. 171–8.
 48.
García S, Fernández A, Luengo J, Herrera F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci. 2010;180(10):2044–64.
 49.
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ. Modelbased overlapping clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, ACM. 2005. p. 532–7.
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and corrections.
Funding
This work is financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant No.3960972015, the National Natural Science Foundation of China under Grant No.U1805263 and No.61672157.
Author information
Affiliations
Contributions
RB performed the literature review, data collection, model building and experiment analysis, and prepared this manuscript. All authors had given substantial comments and suggestions on model building, and contributed to the final version of the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The authors Ethics approval and consent to participate.
Consent for publication
The authors consent for publication
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, R., Sun, H., Chen, L. et al. Dynamic order Markov model for categorical sequence clustering. J Big Data 8, 154 (2021). https://doi.org/10.1186/s40537021005472
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537021005472
Keywords
 Sparse pattern
 Pattern detection
 Dynamic order Markov model
 Categorical sequence clustering