Horizontally scalable probabilistic generalized suffix tree (PGST) based route prediction using map data and GPS traces
 Vishnu Shankar Tiwari^{1}Email author and
 Arti Arya^{1}
Received: 25 May 2017
Accepted: 9 July 2017
Published: 19 July 2017
Abstract
Route prediction is an essential requirement for many intelligent transport systems (ITS) services like VANETS, traffic congestion estimation, resource prediction in grid computing etc. This work focuses on building an endtoend horizontally scalable route prediction application based on statistical modeling of user travel data. Probabilistic suffix tree (PST) is one of widely used sequence indexing technique which serves a model for prediction. The probabilistic generalized suffix tree (PGST) is a variant of PST and is essentially a suffix tree built from a huge number of smaller sequences. We construct generalized suffix tree model from a large number of trips completed by the users. User trip raw GPS traces is mapped to the digitized road network by parallelizing map matching technique leveraging map reduce framework. PGST construction from the huge volume of data by processing sequentially is a bottleneck in the practical realization. Most of the existing works focused on timespace tradeoffs on a single machine. Proposed technique solves this problem by a twostep process which is intuitive to execute in the mapreduce framework. In the first step, computes all the suffixes along with their frequency of occurrences and in the second step, builds probabilistic generalized suffix tree. The probabilistic aspect of the tree is also taken care so that it can be used as a model for prediction application. Dataset used are road network spatial data and GPS traces of users. Experiments carried out on real datasets available in public domain.
Keywords
Introduction
After conversion of raw location GPS traces based trips into trips composed of a sequence of road network edges, suffix tree based model is constructed. The suffix tree is widely used in pattern recognition and machine learning [10]. Traditionally it is used in compression, text analytics, bioinformatics, genome sequence analysis, route prediction, speech and language modeling, text mining etc. [10–12]. Suffix trees facilitate improved performance of searching on the indexed string [10]. A compact TRIE data structure created from suffixes of a long sequence is known as suffix tree [11, 13]. Another variant created from a large number of subsequences is known as generalized suffix tree [11]. Since trip data is a large collection of smaller sequences called trips hence generalized probabilistic suffix tree (PGST) is more suitable instead of probabilistic suffix tree (PST). Most of the existing work in this area is focused on the efficient construction of suffix tree on a single machine [14–19]. In most of the existing techniques, scalability was achieved by leveraging multiprocessor systems and increasing internal memory [20, 21], thereby enhancing vertical scalability. Another better alternative is considering horizontal scalability where the process runs on distributed commodity hardware. In [13, 22, 23], authors have achieved parallelism by splitting input string into smaller units and processing them in parallel to produce intermediate trees on disk. Eventually, merging intermediate trees to get final suffix tree and this leads to a huge number of subtrees to merge [13].
 Step 1:

Suffixes of the input sequences along with their frequency of occurrence are computed.
 Step 2:

A probabilistic generalized suffix tree is computed.
In Map Reduce modelmappers computes all the suffixes along with their frequency count and reducer constructs final probabilistic generalized suffix tree (PGST).
Trajectory S is first converted to road network edges using map matching process \(S = e_{i} ,e_{i + 1} \ldots .e_{i + k}\) [4, 6, 7, 24]. This problem is an instance of Markov process. Given a partial trajectory S, it is required to ptrdict the next road segment σ. PGST here serves as Markov model of memory length equal to height (h) of the PGST trained over edges of road network.
In this case, edges of PGST is labeled with a probability of its occurrence in the corpus of historical trajectories during the training phase. Given a partial trajectory traveled, PGST is traversed from starting of the root node to find the partial matching branch with the highest probability. Remaining edge sequence in that branch is output of predicted path σ. This is described in detail with various example scenarios in upcoming sections. Implementation and horizontal scalability of model construction and route prediction module are evaluated on the real dataset and evaluation results are discussed. The major contribution of this work is the application of generalized suffix tree in route prediction and technique for distributed construction of probabilistic generalized suffix tree. Mapreduce based distributed construction of generalized suffix tree is proposed and evaluated.
 1.
Collection of historical travel location traces data. Huge GPS traces from Geolife project is used in this application which is publically available for research from Microsoft.
 2.
Preprocessing of raw GPS location traces data into trips (trip segmentation).
 3.
Road network spatial and nonspatial data is used. Open street map (OSM) road network data is used which is available as an open project.
 4.
Raw location GPS traces based trips are converted into trips composed of a sequence of road network edges. It is also a preprocessing step that helps in reducing storage space requirement and enhances processing speed.
 5.
PGST construction from trips composed of a sequence of road network edges. This serves as a model for route prediction.
 6.
Route prediction from partial trajectory using PGST.
All these steps deal with huge collection of historical travel data. Scalability is a major concern. This work focuses on building a horizontally scalable route prediction system by implementing above steps in parallel.
Some basic terminologies related to route prediction
Location traces of user are collected using vehicles equipped with GPS device. Each data point is a time stamped pair of latitude/longitude coordinates. If a user travels in the sequence home → bus stand → railway station → office then stays there for a day and travels back from office → railway station → bus stand → home then it is considered as two trips.
Definition 1
Definition 2
Definition 3
A road network is a graph G (V, E), where V is a set of vertices that are point features and represent road intersections and terminal points and E is a set of edges that represents road segments each connecting two vertices. Two edges are adjacent if they share a common vertex. If vertices that constitute edge are ordered then G is directed graph.
Definition 4
Probabilistic generalized suffix tree
Suffix tree related work and literature
 (a)
Partition long string into multiple subtrees and
 (b)
Write to disk and
 (c)
Merges them.
Complexity of building tree is reported to be \(O(n^{2} )\). None of them are fully parallel and distributed and merging phase requires lots of interprocessor communication. If everything fits into memory then they perform better than Ukkonen’s algorithm but as soon as it goes beyond memory capacity they become inefficient. Scalability is severe issue with these approaches. Recent two methods proposed wavefront [23] and B^{2}ST [31] actually made it possible to compute even when input size is larger than memory available. B^{2}ST [31] uses suffix arrays instead of partitioning suffix tree. A suffix array is an array containing all suffixes of the input sorted in lexographic order. Long input string is chopped in smaller units and builds suffix arrays in memory and dumps onto the disk. In next phase merging is performed. Focus is on reducing I/O, but parallelism is not addressed. It also takes complexity of \(O(n^{2} )\). Wavefront [23] focused on parallelism. Unlike others this works on whole string and partition is created on suffixes that shares common Sprefixes [13, 32]. Suffix subtree are constructed for each partition and then merged in the last phase. It requires two buffers all the time—one when the tree is getting merged which consumes around 50% of memory and another is the resultant tree. It is implemented and tested on IBM BlueGene/L super computer. However scalability cannot be achieved indefinitely because of tiling overhead [23]. ERA [13] took more intelligent approach that first partitions the single long input sequence into smaller segments and then subtrees are built for each independent partition. Then each subtree is divided horizontally and subtrees are merged to give final subtrees. It avoids multiple traversal of the subtree to reduce I/O costs. Parallelism is tested in multicore system as well as distributed and reported complexity of \(O(n^{2} )\). In order to achieve parallelism wherever applicable relies on partitioning input string into smaller segments and construct subtrees for each of such segment.
Comparison of the most important algorithms for suffix tree construction
Algorithms  Complexity  Memory locality  Parallel  Probabilistic 

McCreight [14]  \(O(n)\)  Poor  No  No 
Ukkonen [15]  \(O(n)\)  Poor  No  No 
Hunt [17]  \(O(n^{2} )\)  Good  No  No 
TRELLIS [30]  \(O(n^{2} )\)  Good  No  No 
TDD [28]  \(O(n^{2} )\)  Good  No  No 
STMerge [29]  \(O(n^{2} )\)  Good  No  No 
Wavefront [23]  \(O(n^{2} )\)  Good  Yes  No 
B2ST [31]  \(O(n^{2} )\)  Good  No  No 
ERA [13]  \(O(n^{2} )\)  Good  Yes  No 
MapReduce Ukkonen [22]  \(O(n^{2} )\)  Good  Yes  No 
Proposed PGST  \(O(n^{2} )\)  Good  Yes  Yes 
Suffix tree and generalized suffix tree basics
Let \(\mathop \sum \nolimits = \{ A,B, \ldots .\,Z\}\) denote a finite alphabet set of characters. \(\mathop \sum \nolimits^{*}\) denotes the set of all finite length strings formed using characters from ∑. Let \(X = x_{0} ,x_{1} , \ldots .,x_{n  1} \$ \) with \(x_{i} \in \mathop \sum \nolimits \& X \in \mathop \sum \nolimits^{*}\) denote an input string of length \(n = \left X \right\) and \(\$ \notin \mathop \sum \nolimits\). Concatenation of two strings X and Y denoted as XY has length X + Y and consists of alphabets from X followed by alphabets from Y such that \(XY = x_{0} ,x_{1} , \ldots .,x_{n  1} ,y_{0} ,y_{1} , \ldots .,y_{m  1}\). A string Y, is prefix of another string X, denoted as Y » X _{ i }, if \(X = YZ\) for some string \({\text{Z}} \in \mathop \sum \nolimits^{ *}\). Similarly a string Y, is suffix of another string X, denoted as Y » X _{ i }, if \(X = ZY\) for some string \({\text{Z}} \in \mathop \sum \nolimits^{ *}\).
For \(X = ABCB\) all prefixes are \(\emptyset , {\text{A}}, {\text{AB}}, {\text{ABC}}, {\text{ABCB}}\) and all possible suffixes are \(\emptyset ,{\text{B}},{\text{CB}},{\text{BCB}},{\text{ABCB}}\). Empty string \(\emptyset \in \mathop \sum \nolimits^{* }\) has length zero and is both prefix as well as suffix. Hence number of prefixes and suffixes of a string X is \(\left X \right\). Given a string X all prefixes as well suffixes can be computed in time \(\varTheta X\) each. The ordered arrangement of all X suffixes of string X in a compact TRIE is known as the suffix tree T of X.
Example
All suffixes for \(e_{1} ,e_{1} ,e_{2} ,e_{3} ,e_{3} ,{\$}\)
i  x_{i}  Suffix 

1  x _{1}  \(e_{1} ,e_{1} ,e_{2} ,e_{3} ,e_{3} ,{\$}\) 
2  x _{2}  \(e_{1} ,e_{2} ,e_{3} ,e_{3} ,{\$}\) 
3  x _{3}  \(e_{2} ,e_{3} ,e_{3} ,{\$}\) 
4  x _{4}  \(e_{3} ,e_{3} ,{\$}\) 
5  x _{5}  \(e_{3} ,{\$}\) 
6  x _{6}  $ 
Generalized suffix tree was initially proposed in [31]. Unlike other suffix trees which process one long sequence. The generalized suffix tree is constructed from a set of string. This work focuses on the construction of generalized suffix tree through distributed computing leveraging mapreduce processing framework. As part of processing frequency of occurrence of each suffix is taken care and probability of nodes are computed. This tree is known as probabilistic generalized suffix tree (PGST).
Proposed approach
Two phase probabilistic generalized suffix tree (PGST) construction basics
 1.
First phase computes all suffixes.
 2.
Second phase constructs tree from suffixes computed earlier.
Twophase implementation is very intuitive for implementation under distributed map reduce framework. First phase is implemented by mapper module and second phase is implemented by reducer. Process is described in next section. In the first phase, suffixes for each \(S_{i} \in {\text{S }}\) are calculated and put in a map where key is suffix itself and value is number of time it occurs in S. Processing is as described in Algorithm 1 below.
Suffixes and their occurrence count
i  xi  Suffix  Count 

1  x_{1}  \(e_{1} ,e_{2} ,e_{3} ,e_{4} ,{\$}\)  2 
2  x_{2}  \(e_{2} ,e_{3} ,e_{4} ,{\$}\)  2 
3  x_{3}  \(e_{3} ,e_{4} ,{\$}\)  4 
4  x_{4}  \(e_{4} ,{\$}\)  4 
5  x_{5}  \({\$}\)  4 
6  x_{6}  \(e_{1} ,e_{5} ,e_{3} ,e_{4} ,{\$}\)  1 
7  x_{7}  \(e_{5} ,e_{3} ,e_{4} ,{\$}\)  1 
At any point of time height of the tree is \(h \le \hbox{max} \;\left( {S_{i} \forall 1 \le i \le n } \right)\) is the length of longest suffix. Total number of suffixes to be inserted in tree is \(n = \mathop \sum \nolimits_{i = 1}^{n} \left {Si} \right\). Each insertion starts with root and can go till leaf i.e. total nos to be traversed can be up to \(h\). So complexity of this phase is \(O\;(nh) \le O\,(n^{2} )\) which dominates overall complexity.
Distributed construction of PGST tree
Mapreduce is a programming technique for processing large data sets with a parallel, distributed algorithm on a cluster. Mapreduce works in two phases: the Map phase and the reduce phase. Each phase has 〈key, value〉 pairs as input and output. Mapper is a function that iterates over input data set and transforms data into keyvalue pair [33]. In order to achieve parallelism, framework library splits the huge input data set into multiple chunks of typically 16–64 MB based on configuration provided [34]. Then multiple copies of mapper module are spawned over different nodes in the cluster each of which processes a chunk of the huge input data. Once the mappers finish computing keyvalue pairs of the input data, then it’s the responsibility of MapReduce framework to group all intermediate values associated with the same intermediate key and passes them to reducer function. Reducer is a function which is triggered after all the mappers are finished and intermediate (key, value) pairs are aggregated. Reducer function iterates over the key set and for a set of values associated with each key performs the reduction possibly smaller set of values [35]. In summary, Reducer merges all intermediate values associated with the same key. Map reduce runs on a huge cluster composed of commodity computing nodes and is highly scalable. This kind of scalability is known as horizontal scalability.
Two phase suffix tree construction technique described in the previous section is executed under map reduce framework to achieve horizontal scalability. The motive of designing twophase algorithm for suffix tree construction was that it makes technique easily portable to map reduce framework. The first phase where contexts are calculated is executed by mapper module and the second phase which does construction of suffix tree from context strings is executed by reducer module.
All suffixes and their counts computed by mapperx
i  x_{i}  Suffix  Count 

1  x_{1}  \({\text{e}}_{1} ,{\text{e}}_{2} ,{\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  1 
2  x_{2}  \({\text{e}}_{2} ,{\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  1 
3  x_{3}  \({\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  2 
4  x_{4}  \({\text{e}}_{4} ,{\$}\)  2 
5  x_{5}  \({\$}\)  2 
All suffix and their frequency count computed by mappery
j  y_{j}  Suffix  Count 

1  y_{1}  \({\text{e}}_{1} ,{\text{e}}_{2} ,{\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  1 
2  y_{2}  \({\text{e}}_{2} ,{\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  1 
3  y_{3}  \({\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  2 
4  y_{4}  \({\text{e}}_{4} ,{\$}\)  2 
5  y_{5}  \({\$}\)  2 
6  y_{6}  \({\text{e}}_{1} ,{\text{e}}_{5} ,{\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  1 
7  y_{7}  \({\text{e}}_{5} ,{\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  1 
Mapper function iterates over input sequence and prepares intermediate 〈key, value〉 pairs. Intermediate results are buffered in memory mapper worker and periodically written to distributed file system.
Result of merging of intermediate key/value pairs by MapReduce framework
i  z_{j}  Key (k)  Values (v)  〈K, 〈value_list〉〉 

1  z_{1}  \(e_{1} ,e_{2} ,e_{3} ,e_{4} ,{\$}\)  1, 1  〈\(e_{1} ,e_{2} ,e_{3} ,e_{4} ,{\$}\), 〈1, 1〉〉 
2  z_{2}  \(e_{2} ,e_{3} ,e_{4} ,{\$}\)  1, 1  〈\(e_{2} ,e_{3} ,e_{4} ,{\$}\), 〈1, 1〉〉 
3  z_{3}  \({\text{e}}_{3} ,{\text{e}}_{4} ,{\$}\)  2, 2  〈\(e_{3} ,e_{4} ,{\$}\), 〈2, 2〉〉 
4  z_{4}  \(e_{4} ,{\$}\)  2, 2  〈\(e_{4} ,{\$}\), 〈2, 2〉〉 
5  z_{5}  \({\$}\)  2, 2  〈\({\$}\), 〈2, 2〉〉 
6  z_{6}  \(e_{1} ,e_{5} ,e_{3} ,e_{4} ,{\$}\)  1, 0  〈\(e_{1} ,e_{5} ,e_{3} ,e_{4} ,{\$}\), 〈1, 0〉〉 
7  z_{7}  \(e_{5} ,e_{3} ,e_{4} ,{\$}\)  1, 0  〈\(e_{5} ,e_{3} ,e_{4} ,{\$}\), 〈1, 0〉〉 
Computation of sum of value_list associated with keys by reducer function
i  z_{j}  〈K, 〈value_list〉〉  〈K, sum (values)〉 

1  z_{1}  〈\(e_{1} ,e_{2} ,e_{3} ,e_{4} ,{\$}\), 〈1, 1〉〉  〈\(e_{1} ,e_{2} ,e_{3} ,e_{4} ,{\$}\), 〈2〉〉 
2  z_{2}  〈\(e_{2} ,e_{3} ,e_{4} ,{\$}\), 〈1, 1〉〉  〈\(e_{2} ,e_{3} ,e_{4} ,{\$}\), 〈2〉〉 
3  z_{3}  〈\(e_{3} ,e_{4} ,{\$}\), 〈2, 2〉〉  〈\(e_{3} ,e_{4} ,{\$}\), 〈4〉〉 
4  z_{4}  〈\(e_{4} ,{\$}\), 〈2, 2〉〉  〈\(e_{4} ,{\$}\), 〈4〉〉 
5  z_{5}  〈\({\$}\), 〈2, 2〉〉  〈\({\$}\), 〈4〉〉 
6  z_{6}  〈\(e_{1} ,e_{5} ,e_{3} ,e_{4} ,{\$}\), 〈1, 0〉〉  〈\(e_{1} ,e_{5} ,e_{3} ,e_{4} ,{\$}\), 〈1〉〉 
7  z_{7}  〈\(e_{5} ,e_{3} ,e_{4} ,{\$}\), 〈1, 0〉〉  〈\(e_{5} ,e_{3} ,e_{4} ,{\$}\), 〈1〉〉 
Route prediction
Route prediction related work and literature
Approaches for route prediction algorithms found in literature can be categorized into two groups: spatiotemporal correlation and statistical methods. Authors in [3] presented a Spatiotemporal correlation based route prediction technique. The trip similarity is established and clustered using hierarchical agglomerative clustering approach. Two route prediction approaches are proposed. In the first approach, the closest match to a partial trajectory is computed on traversal of each new edge and next link is forecasted as next edge from the closest matching historical trip.
In the second approach, threshold match algorithm returns route and confidence measure provided a partial trajectory traveled so far. Horizontal scalability aspect is not taken care. Authors in [36] proposed urban traffic amount prediction for route guidance systems. Prediction is based on spatiotemporal correlation of the road network. It does not make use of historical travel pattern of traffic but is purely based on current state of the system.
Laasonen [37] proposed route prediction system using cellular data. The geographical area is divided into cellular regions. The mobility of user is defined in terms of movement between cellular regions. Mobility vector is composed of Ids of cellular regions. User location, cellular region, is predicted based on historical travel pattern of the user between the cellular regions. It does not make use of digitized road networks. None of these algorithms address scalability issues.
Statistical approaches are based on Variations of Markov model (VMM). Various VMM approaches like context tree weighting (CTW), prediction by partial match (PPM), probabilistic suffix tree (PST) and Lempel–Ziv (LZ78) algorithms for prediction of sequences based on partial sequences are well explored [10]. Sequence predictions were evaluated for protein sequences, text data, music files etc. Route prediction is not covered by this work. Additionally, scalability issues are not addressed. Work presented in [38] summarized various statistical approaches for route prediction: Markov model (MM), hidden Markov model (HMM) and variable order Markov model (VMM). In the proposed approach, the issue of horizontal scalability is addressed.
Route prediction using probabilistic suffix tree (PST)
Problem statement
Given a partial trajectory traveled \(\left( {x_{{t^{0} }} ,y_{{t^{0,} }} t^{0} } \right),\left( {x_{{t^{1} }} ,y_{{t^{1,} }} t^{1} } \right) \ldots \left( {x_{{t^{n} }} ,y_{{t^{n,} }} t^{n} } \right),\) predict the next link σ ∈ E on road network based on historical travel pattern.
Given a partial trajectory, first step is to map the trajectory on road network using map matching (as described in earlier) \(f(\left( {x_{{t^{0} }} ,y_{{t^{0,} }} t^{0} } \right),\left( {x_{{t^{1} }} ,y_{{t^{1,} }} t^{1} } \right) \ldots \left( {x_{{t^{n} }} ,y_{{t^{n,} }} t^{n} } \right)) \to e_{i} ,e_{i + 1} \ldots e_{i + k}\) where \(e_{i} ,e_{i + 1} \ldots e_{i + k} \in E\).
As defined above, f is map matching function which maps ordered time stamped sequence of location data traces to edges of the road network. Hence route prediction is a function R that takes input trajectory \(e_{i} ,e_{i + 1} \ldots e_{i + k}\) and predicts next link σ on the road represented as \(R\left( {e_{i} ,e_{i + 1} \ldots e_{i + k} } \right) \to {{\upsigma }} \in E\).
Probabilistic suffix tree constructs all Markov sequences \((1:k)\) road segment sequence to predict next road segment. In this case, k is the length of longest trip. Length of longest branch in the tree and hence height of the tree is O (k).
 Scenario I::

Input trajectory is ε. This case represents when input trajectory is empty and none of the edges of the road network graph is yet traversed. According to the suffix tree candidate edges for travel are \(\{ e_{1} ,e_{2} ,e_{3} ,e_{4} ,e_{5} \}\). Probabilities for each candidate is as: \({\text{p}}(e_{1}  {{\upepsilon }}) = \frac{3}{10}\), \({\text{p}}(e_{2}  {{\upepsilon }}) = \frac{2}{10}\), \({\text{p}}(e_{3}  {{\upepsilon }}) = \frac{4}{10}\), \({\text{p}}(e_{5}  {{\upepsilon }}) = \frac{1}{10},\) Hence \(R\left( {{\upepsilon }} \right) \to e_{3}\) means \(e_{3}\) is predicted edge
 Scenario II::

Input trajectory is \(\{ e_{2} \}\). This case is when input trajectory is of length 1. According to the suffix tree only candidate edges for travel is \(\{ e_{3} \}\). Probability for candidate is \({\text{p}}(e_{3}  e_{2} ) = \frac{2}{2} = 1.\) Hence \(R(e_{2} ) \to e_{3}\) means e _{3} is predicted edge
 Scenario III::

Input trajectory is \(\{ e_{1} \}\). According to the suffix tree candidate edges for travel are \(\{ e_{2} ,e_{5} \}\). Probabilities for each candidate are \({\text{p}}(e_{2}  e_{1} ) = \frac{2}{3}\), \({\text{p}}(e_{5}  e_{1} ) = \frac{1}{3}\). Hence \(R\left( {e_{1} } \right) \to e_{2}\) means \(e_{2}\) is predicted edge
 Scenario IV::

Input trajectory is \(\{ e_{1} ,e_{5} \}\). According to the suffix tree candidate edge for travel is \(\{ e_{3} \}\). Probabilities for candidate is \({\text{p(}}e_{3} e_{1} ,e_{5} ) = 1.\) Hence \(R\;(e_{1} ,e_{5} ) \to e_{3}\). means e_{3} is predicted edge
 Scenario V::

Input trajectory is \(\{ e_{3} ,e_{4} \}\). According to the suffix tree candidate edge for travel is \(\emptyset\). Hence \(R\;(e_{1} ,e_{5} ) \to {{\upepsilon }}\) means prediction not possible. It can happen in two scenarios:
 a.
Traveler has reached its destination.
 b.
Traveler has taken a new route which model has not observed earlier.
 a.
 Scenario VI::

All of the above cases focused on predicting one hop next edge. The same model can be used to predict an end to end path as well. Input trajectory is \({{\upepsilon }}\). As in case I, next edge selected is e_{3}. From e_{3} next probable edge is e_{2} and so on.
Implementation and evaluation
 1.
Map data and
 2.
GPS location traces.
Map dataspatial road network data

Open layers (maps formed by fetching data from data base) and

Base map images (Google Map Images or OSM Map Images can be used).
GPS location traces data
Future work
Proposed end to end route prediction application is based on probabilistic generalized suffix tree (PGST) model. Alternative statistical modeling approaches exist in literature and applied in different fields of study like text mining, speech recognition, bioinformatics, string matching etc. To name a few such methods are CTW (context tree weighting), PPM (prediction by partial match) and LZ78 etc. The problem with them is horizontal scalability is not resolved to best of our knowledge. These alternatives can be tried out and see whether they have potential to further enhance accuracy and computational efficiency. Other than statistical are approaches like clustering, association rule mining, neural networks etc. Very less research is done on their applicability in the field of route prediction as well their horizontal scalable versions are still a challenge.
Conclusion
In this research work, the focus was on building a horizontally scalable end to end application for route prediction application. Raw GPS traces are converted to trips composed of edges of the digitized road network by map matching. A Mapreduce based parallel version of map matching is implemented to overcome huge computational time required during this step. Hbase is used for storage and mapreduce framework for performing map matching. This reduces both—storage requirement as well as the computational time. Next phase is model construction for which probabilistic generalized suffix tree (PGST) is leveraged. In this work, PGST is used for route prediction. Challenging part is scalability of PGST construction. The mapreduce framework is employed to achieve horizontal scalability. A twostep process is followed, which is very intuitive to implementation in mapreduce framework. The endtoend horizontally scalable application is designed and implemented for route prediction. Data visualization is also equally important and hence implemented using an open source technology based spatial data visualization tool integrated with the application. All tools and technologies used are open source. Experiments are performed on real datasets and snapshots are taken from real data.
Declarations
Authors’ contributions
VST and AA discussed the idea of PST and PGST with respect to route prediction and its implementation aspects. VST has implemented the idea and contributed towards the first draft of the paper under the guidance of AA. AA thoroughly proofread the manuscript and made all vital corrections. Both authors read and approved the final manuscript.
Authors’ information
Vishnu Shankar Tiwari is a post graduate (Master of Technology M. Tech.) in Computer Engineering from Department of Computer Engineering, Indian Institute of Technology (IIT)Bombay, Mumbai, India. Also, holds M. Tech. (Computer Applications) from YMCA University of Science and Technology, India and Master of Computer Application (MCA) from Maharshi Dayanand University, India. Working in software industry for more than 8 years.
Arti Arya is Head of Department (HOD) and Professor at Department of Computer Application, PES Institute of Technology, Bangalore South Campus. She holds Ph. D in Computer Science from Faculty of Technology and Engineering, Maharshi Dayanand University, India. She has M. Tech in Computer Science from Allahabad Agricultural Institute, Master of Science (Mathematics) and Bachelor of Science (Mathematics) from Delhi University. Her areas of interests are Spatial Data Mining, Knowledge based systems, Machine Learning, Artificial Intelligence, Data Analysis. She has approx. 17 years of teaching experience (of which 10 years of research) at Undergraduate and Post Graduate level. She is Senior Member IEEE, Life Member CSI and Life Member IAENG.
Acknowledgements
We thank Geospatial Information Science and Engineering (GISE) Lab, Indian Institute of Technology, IITBombay, India for carrying out some initial part of the work at their lab. We would also like to thank anonymous reviewers, whom reviews helped us to bring this manuscript to the current form.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All data and material used is open source. Majorly, GPS data points are from GPS trajectory dataset collected in (Microsoft Research Asia) Geolife project. Dataset is made available for research from 2012 by Microsoft Research (https://geotime.com/general/geolifeproject/). Map data used is from Open street map (OSM) which is an open project (http://www.openstreetmap.org).
Consent for publication
Authors consent the right to publish this article by Springer Open.
Ethics approval and consent to participate
This is author’s own personal research work. Authors selfapprove ethical approval and provide consent for participation.
Funding
This work is purely authors own work and authors own funding required for publishing of this research work.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Tiwari VS, Arya A, Chaturvedi SS. Route prediction using trip observations and map matching. In: Advance computing conference (IACC), 2013 IEEE 3rd international, 22–23 Feb 2013, p. 583–7.Google Scholar
 Tiwari VS, Arya A, Chaturvedi S. Framework for horizontal scaling of map matching using mapreduce. In: IEEE, 13th international conference on information technology, ICIT 2014, 22–24 Dec 2014. http://www.icit2014.in/.
 Froehlich J, Krumm J. Route prediction from trip observations. In: Society of automotive engineers (SAE) 2008 World Congress, Apr 2008, Paper 2008010201.Google Scholar
 Zhou J, Golledge R. A threestep general map matching method in the gis environment: travel/transportation study perspective. Int J Geogr Inf Syst. 2006;8(3):243–60.Google Scholar
 Greenfeld JS. Matching GPS observations to locations on a digital map. In: Proceedings of the 81st annual meeting of the transportation research board, Washington DC; 2002.Google Scholar
 Meng Y, Chen W, Chen Y, Chao JCH. A simplified mapmatching algorithm for invehicle navigation unit. Research report, Department of Land Surveying and Geoinformatics, Hong Kong Polytechnic University; 2003.Google Scholar
 Li J, Fu M. Research on route planning and mapmatching invehicle GPS/dead reckoning/electronic map integrated navigation system. In: IEEE proceedings on intelligent transportation systems; 2003, 2, p. 1639–43.Google Scholar
 Bernstein D, Kornhauser A. An introduction to map matching for personal navigation assistants, technical report. NewJersey TIDE Center Technical Report, 1996.Google Scholar
 Li Y, Huang Q, Kerber M, Zhang L, Guibas L. Largescale joint map matching of GPS traces. In: Accepted for the 21st ACM SIGSPATIAL international conference on advances in geographic information systems (ACM SIGSPATIAL GIS 2013).Google Scholar
 Begleiter R, ElYaniv R, Yona G. On prediction using variable order Markov models. J Artif Intell Res. 2004;22:385–421.MathSciNetMATHGoogle Scholar
 Comin M, Farreras M. Efficient parallel construction of suffix trees for genomes larger than main memory. In: EuroMPI’13, Proceedings of the 20th European MPI user’s group meeting; 2013.Google Scholar
 Bieganski P, Riedl J, Carlis J, Retzel EF. Generalized suffix trees for biological sequence data. In: Proceedings of the twentyseventh Hawaii international conference on biotechnology computing. p. 35–44, 1994.Google Scholar
 Mansour E, Allam A, Skiadopoulos S, Kalnis P. ERA: efficient serial and parallel suffix tree construction for very long strings. Proc VLDB Endow (PVLDB). 2011;5(1):49–60.View ArticleGoogle Scholar
 McCreight EM. A spaceeconomical suffix tree construction algorithm. J ACM. 1976;23(2):262–72.MathSciNetView ArticleMATHGoogle Scholar
 Ukkonen E. Online construction of suffix trees. Algorithmica. 1995;14(3):249–60.MathSciNetView ArticleMATHGoogle Scholar
 FarachColton M, Ferragina P, Muthukrishnan S. On the sorting complexity of suffix tree construction. J ACM. 2000;47(6):987–1011.MathSciNetView ArticleMATHGoogle Scholar
 Hunt E, Atkinson MP, Irving RW. A database index to large biological sequences. In: VLDB; 2001, p. 139–48.Google Scholar
 Bedathur SJ, Haritsa JR. Engineering a fast online persistent suffix tree construction. In: ICDE; 2004, p. 720–31.Google Scholar
 Cheung CF, Yu JX, Lu H. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Trans Knowl Data Eng. 2005;17(1):90–105.View ArticleGoogle Scholar
 Hariharan R. Optimal parallel suffix tree construction. In: STOC; 1994, p. 290–9.Google Scholar
 Landau GM, Schieber B, Vishkin U. Parallel construction of a suffix tree (extended abstract). In: ICALP; 1987, p. 314–25.Google Scholar
 Satish UC, Kondikoppa P, Park S, Patil M, Shah R. Mapreduce based parallel suffix tree construction for human genome. In: 20th IEEE international conference on parallel and distributed systems ICPADS 2014 Hsinchu Taiwan, 2014, 201, p. 664–70.Google Scholar
 Ghoting A, Makarychev K. Indexing genomic sequences on the IBM blue gene. In: Proceedings of conference on highperformance computing networking, storage and analysis (SC); 2009, p. 1–11.Google Scholar
 Marchal F, Hackney J, Axhausen KW. Efficient mapmatching of large GPS data sets—Tests on a speed monitoring experiment in Zurich. Arbeitsbericht Verkehrs und Raumplanung, Institut f ̈ur Verkehrsplanung und Transportsysteme, ETH Zürich, Zürich; 2004.Google Scholar
 Quddus MA. High integrity Mapmatching algorithms for advanced transport telematics applications, Ph.D. Thesis. UK: Centre for Transport Studies, Imperial College London; 2006.Google Scholar
 Quddus MA, Noland RB, Ochieng WY. A high accuracy fuzzy logic based map matching algorithm for road transport. J Intell Transp Syst. 2006;10(3):103–15.View ArticleMATHGoogle Scholar
 Weiner P. Linear pattern matching algorithms. In: SWAT; 1973, p. 1–11.Google Scholar
 Tata S, Hankins RA, Patel JM. Practical suffix tree construction. In: Proceedings of VLDB; 2004, p. 36–47.Google Scholar
 Tian Y, Tata S, Hankins RA, Patel JM. Practical methods for constructing suffix trees. VLDB J. 2005;14(3):281–99.View ArticleGoogle Scholar
 Phoophakdee B, Zaki MJ. Genomescale diskbased suffix tree indexing. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD’07. New York, NY, USA: ACM, p. 833–44.Google Scholar
 Barsky M, Stege U, Thomo A. Suffix trees for inputs larger than main memory. Inf Syst. 2011;36(3):644–54.View ArticleGoogle Scholar
 Erciyes K. Distributed and sequential algorithms for bioinformatics. Cham: Springer International Publishing; 2015.View ArticleMATHGoogle Scholar
 Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on opearting systems design & implementation, 06–08 Dec 2004, San Francisco, CA, p. 10.Google Scholar
 Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS). 2008;26(2):1–26. doi:10.1145/1365815.1365816.View ArticleGoogle Scholar
 Lammel R. Google’s MapReduce programming model—Revisited. Sci Comput Program. 2008;70:1–30.MathSciNetView ArticleMATHGoogle Scholar
 Liang Z, Wakahara Y. Realtime urban traffic amount prediction models for dynamic route guidance systems. J Wirel Commun Netw. 2014;2014:85.View ArticleGoogle Scholar
 Laasonen K. Route prediction from cellular data. In: Workshop on contextawareness for proactive systems (CAPS); 2005.Google Scholar
 Nagraj U, Kadam N. Vehicular route prediction in city environment based on statistical models. J Eng Res Appl. 2013;3(5):701–4.Google Scholar
 https://www.openstreetmap.org.
 https://www.microsoft.com/enus/research/publication/geolifegpstrajectorydatasetuserguide/.