 Research
 Open Access
 Published:
STVG: an evolutionary graph framework for analyzing fastevolving networks
Journal of Big Data volume 6, Article number: 55 (2019)
Abstract
Sequence of graph snapshots have been commonly utilized in literature to represent changes in a dynamic graph. This approach may be suitable for smallsize and slowly evolving graphs; however, it is associated with high storage overhead in massive and fastevolving graphs because of replication of the entire graph from one snapshot to another at shorter temporal resolutions. This presents a drawback especially where efficient evolutionary analytics relies on the explanatory power of representing the dynamics of the graph across different temporal resolutions. In this paper, we propose a framework based on our Space–Timevarying graph (STVG) formalism which utilizes the Wholegraph approach to model the dynamics of a graph such that the evolution of the graph materializes in the timevarying changes of its Projected graphs. The STVG framework provides an approach to reduce high storage overhead in massively changing graph where new nodes and edges arrive every second. It affords the capability to extract Projected graphs at different timewindows and analyze their metrics across varying temporal resolutions. We demonstrate how the proposed STVG framework can be exploited to identify and extract evolutionary patterns in public bus transit graph using metrics such as graph density, volume and average path length. The results reveal evolutionary patterns in the overall network density, traffic congestion density as well as graph density with respect to bus movement at hourly, daily and monthly temporal resolutions.
Introduction
Evolutionary graph analytics have attracted attention from many research communities with the main purpose of understanding the changing pattern of realworld networks through evolutionary analysis of graph metrics and dynamic interactions between entities. Graphs of realworld networks evolve as new nodes and edges continually appear and disappear in the structure but, more importantly, their metrics such as density, average path length and network diameter also evolve. Uncovering and understanding hidden patterns in an evolving network requires evolutionary analysis of the network over different temporal resolutions. Evolutionary graph analytics have been explored for use in different types of networks including web citation and coauthorship networks [1,2,3,4], online social networks [5,6,7,8,9,10], biology and disease networks [11,12,13,14], as well as in communication networks [15,16,17,18,19,20]. All networks do not evolve at the same rate; some evolve swiftly while others evolve slowly [21]. For example, in transit and mobility networks, nodes and edges are added to their graph on the temporal resolution of seconds, whereas in coauthorship networks, nodes and edges are added to their graph on the temporal resolution of months or years. Correspondingly, these cases require different approaches for the evolutionary analysis.
Networks whose graphs evolve slowly over time and space can be effectively analyzed through a sequence of graph snapshots where snapshots of the graph at two distinct timestamps t_{1}, and t_{2}, can be the basis for change extraction. The entire graph can be replicated from one snapshot to another with few computational and storage limitations because of the coarse temporal resolution. The challenge is that for fastevolving networks, where the evolutionary analysis relies on the explanatory power of representing the changes in the network across smaller temporal resolutions, the snapshot method can become ineffective due to high computation and storage demands. An approach that reduces replication of the entire graph from one timewindow to another to avoid high storage overhead even as the graphs grow at different temporal resolutions. The Wholegraph approach presents a method where the entire evolution of a graph through time is captured in an unabridged, but complete, graph that evolves compactly as nodes and edges appear and disappear in the graph. Each node or edge has a timestamp depicting its valid time in the graph. The valid time captures when the insertion of new nodes, edges and attributes or deletion of nodes, edges and their attributes are made in the graph.
Longer processing time involved in extracting timedependent subgraphs from the Wholegraph can be a disadvantage unlike the convenience that snapshot method provides through versioning of smaller graph volume at each timewindow. This occurs majorly when the time dimension of a Wholegraph is represented by timestamps that are stored as attributes of the nodes or edges. Updates and query processing take longer processing time because there is a massive number of node and edge attributes to be scanned. This bottleneck can be reduced by representing the time dimension as a graph such as using a timegraph [22]. The advantages of the Wholegraph approach over the snapshot method include no replication of static nodes and edges across time which leads to less storage overhead. Also, longitudinal queries across time are less complex than in the snapshot approach because a simple timedependent query can run through the entire graph, retrieve and compare the state of the graph across different temporal resolutions.
In this paper, we present the spacetime varying graph (STVG) framework for evolutionary graph analytics of fastevolving realworld networks. The underlying graph G in our STVG is modeled based on the Wholegraph approach where the network structure evolves in space and time in such a way that evolutionary patterns are due to the changes in the connectivity and adjacency relationships among nodes in the network. The graph grows continuously as new nodes and edges emerge through time. The Wholegraph is composed of subgraphs including the timegraph. Subgraphs are used to facilitate the conceptual modelling of connectivity between entities in distinct spaces of the realword network. The subgraphs are connected to the timegraph to keep track of the evolution of the Wholegraph. Projected graphs retrieve the state of the Wholegraph at any given temporal resolution. The evolution of the Wholegraph becomes visible in the space–time varying changes of the Projected Graphs. Relevant graph metrics are used to retrieve Wholegraph evolution from its Projected Graphs. The STVG therefore, is actualized from the Wholegraph, Subgraphs and Projected graph formalism.
We applied the STVG framework for evolutionary graph analytics of dynamic transit network to uncover the evolving behavior of the network at various temporal resolutions. The dynamic interactions between places and mobility characteristics of moving vehicles in a transit network presents an evolving graph that changes swiftly in structure and size [23]. The analytical results show the potential of using the framework for transit and mobility trend analysis with possible application to timedependent transit recommendation systems.
Our research challenge consists of processing a vast volume of transit feeds continuously coming at high velocity from a large fleet of buses, but also making sure that our STVG framework captures the evolving behavior of subgraphs belonging to a Wholegraph at different time resolutions. Towards this challenge, our scientific contributions are as follows:

Previous research on evolving graphs has focussed on small complex networks. This paper contributes to the emerging research work on massive and fastevolving networks.

Realworld networks such as transit networks change at shorter temporal and spatial resolutions, and the proposed STVG framework explores how to capture and analyse the evolution of fastevolving graphs at different time resolutions.

This paper also contributes to the use of graph databases and graph queries to store and capture the evolutionary behavior of fastevolving networks at different spatial and temporal resolutions with less storage overhead and complex query processing.
The remaining of this paper is organized as follows: “Related work” section summarizes the previous research work on graph data analytics. “Modelling methods” describes the modeling process of our approach, including the overall framework and the details. “Experimental study” section presents the experimental study and the discussion of results. Finally, conclusions are presented in “Conclusion” section.
Related work
Previous research work has applied the snapshot approach to realworld networks such as the social, webcitation and coauthorship networks. In [24], a survey was carried out on timeevolving graphs of a social network with emphasis on the temporal metrics. They propose a general formalism to study the evolution of the temporal metrics based on a sequence of snapshots across timewindows. Graph temporal metrics such as density, average path, proximity, reachability time, centralities, clustering coefficient and conductance were discussed in this study.
In contrast, Huo and Tsotras [25] focused on the problem of efficient temporal shortest path queries on graph snapshots of a social network. The study utilized fixed timewindow partitioning on a slowly evolving graph to generate a sequence of snapshots on which the queries are applied, one at a time. They had a total of 165 daily graph snapshots of a social network graph and adjusted the query timewindows to 5day and 25day intervals. The concept of temporal shortest path in this study is noted to be useful for user’s historical trend analysis, for example, to discover how close two users were in the past and how their closeness has evolved over time. This work utilizes adhoc query over a sequence of graph snapshots, but the major drawback is high storage overhead.
Some strategies have been proposed to address the storage overhead issue by using a set of deltas that store only the changes that are needed to construct a snapshot. The most efficient compression approach is known as SMFVF which creates clusters of k snapshots, where a compressed graph has no redundancy among delta files of the same cluster, but it still has redundancy between delta files of different clusters [26]. Meanwhile, Gottumukkala et al. [24] proposes the possibility of integrating Neo4J and Spark’s GraphX in a workflow to optimize the storage, processing and visualization of big graphs. They utilized this method to carry out visual analytics of evolving social network in graph snapshots to track the evolutionary state of the graph including the structure and topology.
Lerman et al. [27] worked on a different formulation of centrality metric for evolving graph analysis that measures the number of paths that exist between source and destination nodes over time in a graph. Their model was based on sequence of graph snapshots and it was evaluated on a scientific papers’ citation graphs whose dynamics occurs slowly. It was concluded in this paper that centrality measures of some articles were under or over estimated by previous studies that did not take into account the evolutionary nature of the graph. The evolution of essential genes (central nodes) and their roles in a cell survival and development were studied in the work of Jalili et al. [14], utilizing node centrality evolution in a sequence of graph snapshots of a biological network. Quattociocchi et al. [28] utilized the timevarying graph framework proposed by Casteigts et al. [23] based on a sequence of graph snapshots to study the dynamics in coexistence of coauthorship and citation network. The graph snapshots reveal the evolution in a scientific network extracted from a portion of the arXiv repository covering a period of 10 years of publications in Physics. It shows that scientific networks are dynamic as nodes (the scientist) join, participate, attract, compete, cooperate and disappear in the network which affects the shape and strength of the graph’s connectivity. They discovered how the selection process of citations may affect the shape of the coauthorship network from a sparser and disconnected structure to a dense and homogenous one.
While the sequence of graph snapshot method is associated with disadvantages of high storage overhead and computational complexity, the Wholegraph approach provides an alternative. The Wholegraph approach represents graph evolution not in sequence of static snapshots but as an unabridged graph that evolves compactly as nodes and edges appear and disappear in the graph. This approach was utilized in Pereira et al. [8] to analyze evolving centrality metrics in a dynamic twitter graph. The study reveals how closeness and betweenness centrality measures evolve in the follower/followed relationship. They found that the twitter network is dynamic, users can assume or leave central roles in the graph through time. However, the timedimension of the graph is based on timestamp attributes of the nodes and edges of the graph. Each node and edge in the graph have a valid time at which they existed in the graph. Varyingtime windows of weekly, fortnightly and monthly windows were used in the analysis of the graph’s evolution. As earlier mentioned, having timestamps as attributes of the nodes and edges is inefficient in cases where there are billions of nodes and edges to scan through. Meanwhile, Kumar et al. [29] similarly examined the evolution of Flickr and Yahoo 360 social graph in parallel. They found out similarities in the evolution pattern and properties of the two different graphs. The wholegraph approach was also utilized in burst area discovery based on evolving topk changes in a stream of fast bipartite graph evolutions of users and stories on the Digg.com website [30]. The authors, however, utilized the Haar wavelet tree to reduce high computation complexity associated with a fastevolving graph stream. Topk burst areas are computed incrementally from small hop size to large hop size of time windows.
Recently, Qiangjuan et al. [1] proposed the supraadjacency matrix approach to define a temporal network used to analyze the evolution of important nodes in Enron email communication network and DBLP coauthorship network. They treated temporal networks as a special case of multilayer network where the emerging graph G is defined as a graph of N nodes divided into T timewindows which therefore, creates a supraadjacency matrix of NT x NT dimensions. This approach could be suitable for a slow evolving graph whose properties changes at longer time intervals, such as in coauthorship networks but will face massive computational changes in a mobility graph that evolves, for example every 5 s. Supraadjacency matrix created at seconds, minutes and even hourly timewindows for a period of 18 months or more will face huge computational overhead.
Overall, these studies represent a growing interest in evolutionary analytics of dynamic networks. However, the focus has been on slow evolving social graphs, dealing majorly with small snapshots of graphs. Realworld networks such the transit network and communication network however present fastevolving graphs that change at shorter temporal and spatial resolutions. We therefore, present an evolutionary graph analytics framework that is based on the Wholegraph approach to capture evolutionary patterns at various temporal resolutions. The STVG framework provides an approach to reduce high storage overhead and affords us the ability to extract Projected graphs at different timewindows and analyze their metrics across varying temporal resolutions.
Modelling methods
This section describes the main components of the STVG framework and the methodology for modeling and capturing the evolution of the graphs.
The main components of our STVG framework
In this section, we present our STVG framework which consists of three main components as illustrated in Fig. 1:

Wholegraph represents the entire content of a STVG in a given lifetime. The main purpose is to store all data based on an unabridged graph that grows and evolves through time rather than graph snapshots.

Subgraphs represent entities and their relationships in distinct nonmetric spaces (e.g. place, event, mobility, and social spaces) of the Wholegraph. The main purpose is to facilitate the conceptual modelling of the connectivity between entities in distinct spaces that generate the Wholegraph.

Projected graphs represent nodes and edges that are retrieved at varying timewindows and temporal resolution as well as based on subgraph of interest that will be used in the evolutionary analytics of the Wholegraph.
Figure 1 illustrates the main components of our STVG framework: Wholegraph, Subgraphs and Projected graphs. The wholegraph, G consists of all nodes and edges belonging to multiple subgraphs, \(\left( {S_{i}^{\prime } ,\;S_{j}^{\prime } , \ldots S_{n}^{\prime } } \right)\). In this case, \(S_{i}^{\prime }\) represents the nodes and edges of a mobility space, meanwhile \(S_{j}^{\prime }\) represents the nodes and edges of a geographical space. The Projected graphs at any timestamp can consist of one subgraph of interest as illustrated in timestamps t_{i} and t_{j}, as well as multiple subgraphs at the same timestamp such as shown in t_{n} in Fig. 1.
There are conceptual challenges in representing the dynamic behavior of a realworld network, especially where the evolutionary analysis relies on the explanatory power of representing interactions in the network across different spatial and temporal resolutions. An efficient modelling and analytical approach should be such that reduces replication of the entire graph from one timewindow to another to reduce high storage overhead. Our approach is such that combines the Wholegraph, subgraphs and Projected Graphs in a framework to model and analyze evolutions of fastevolving networks. Essential evolutionary assumptions of the Wholegraph, Subgraphs and the Projected graphs must be put into account to uncover the dynamic properties of the STVG using classical graph metrics. We therefore, present in “The main components of our STVG framework”, “Wholegraph” and “Subgraphs” sections the modeling concepts of the three components of the STVG framework and their corresponding evolutionary assumption as formulated in [23].
Wholegraph
The WholeGraph G represents the entire content in a given lifetime \(T(t_{1}  t_{n} )\) of the STVG, whose nodes and edges grow through time. G consists of all the nodes and edges through time, \(t_{1}  t_{n}\), [\(N_{{\left[ {t_{1}  t_{n} } \right]}} , E_{{\left[ {t_{1}  t_{n} } \right]}}\)], where \(N_{{\left[ {t1  tn} \right]}}\) and \(E_{{\left[ {t_{1}  t_{n} } \right]}}\) are sets of all node and edge instances of the Wholegraph stored in a database. Conceptually, the Wholegraph G has two dimensions namely, the Space dimension and the Time dimension which are represented as (S, T) where S is a digraph and T is a tree graph. S is composed of subgraphs \(\left( {S_{1}^{\prime } ,\;S_{2}^{\prime } , \ldots S_{n}^{\prime } } \right)\), and each subgraph is a subset of the wholegraph G in the STVG. Each node or edge in S has a timestamp depicting its valid time in the graph. This is used to capture the insertion of new nodes, edges and attributes or deletion of nodes, edges and their attributes and to track the evolution of the graph G.
The evolutionary assumptions underlying the Wholegraph G are described based on the following:

Vertexcentric evolution The evolution of the Wholegraph materializes in the dynamic changes within a node’s neighborhood. This is an important assumption to enable the capability to extract both global and local evolution of the graph’s properties.

Complete graph connectivity At any point in time, a node in G must have an edge connection with an existing node. This assumption is essential to support global reachability across the entire graph which is important in querying the global evolution of a realworld network at any point in time. This is one of the advantages of capturing evolution in a dynamic graph based on the Wholegraph formalism instead of as snapshots.

Periodicity of nodes The assumption holds in practice in many realworld networks. More specifically, it provides the ability to exploit STVG properties using nodal graph metrics that are generated by the dynamics of a network (e.g. recurrent existence of nodes). This assumption is also important in defining temporal resolutions at which the dynamics of graph can be discovered.
Subgraphs
Subgraphs \(\left( {S_{1}^{\prime } ,\;S_{2}^{\prime } , \ldots S_{n}^{\prime } } \right)\) are used to facilitate conceptual modelling of connectivity within and between entities in distinct spaces of a realworld network. However, in the database, the graph G is not stored in subgraphs but as a Wholegraph with complete connectivity between entities. It is also important to point out that subgraphs are not graph snapshots since they require constant and complementary connectivity (at any point in time, the nodes belonging to one subgraph are linked to the nodes belonging to another subgraph) as well as eventual adjacency (at any point in time, a target node will be sequentially linked to a source node in the same subgraph).
Multiple subgraphs \(S^{\prime}\) can be defined such that \(S = \left( {S_{1}^{\prime } ,\;S_{2}^{\prime } , \ldots S_{n}^{\prime } } \right)\) consisting of nodes and edges that belong to a specific type entity in a realworld network such as places, people, events and so on. Nodes belonging to different subgraphs \(S_{i}^{\prime } \subset S\) and \(S_{j}^{\prime } \subset S\) are linked by complementary connectivity edges since for every node in a subgraph there is always a corresponding node in another subgraph it is related to. However, only one subgraph \(S_{i}^{\prime }\) is connected to T which is another subgraph of G (timegraph) that represents the natural levels of temporal resolutions. Nodes belonging to the same subgraph \(S_{i}^{\prime } \subset S\) are linked by eventual adjacency edges (“NEXT”) between an existing node (source node) and a future node (target node) as described in Fig. 2.
Essentially, T is a hierarchical (tree) structure where a root node is followed by year nodes, month nodes, day nodes, and further temporal resolutions. Such that each node of a subgraph \(S_{i}^{\prime }\) links to a leafnode of T. This is an important constraint to support the capturing of vertexcentric evolution of the graph G at any defined temporal resolution. Similarly, every node in the subgraphs has a label and properties which are essential for labelbased projections. Labels assign roles (e.g. “Moving” representing the state of a moving entity) and objects (e.g. “Street” and “Intersection”). Multiple properties can also be attached to the nodes and edges of a subgraph \(S_{i}^{\prime }\) which are attributes (\(a_{1} \ldots a_{n}\)) containing semantic information about the entity.
The evolutionary assumptions underlying the Subgraphs \(S_{i}^{\prime }\) are described based on the following:

Node connectivity over time At any point in time, at least one node belonging to a subgraph \(S_{i}^{\prime }\) can be reached by all the other nodes in S. This assumption is essential to support any graph query/algorithm whose input is spread over all nodes generated at different time instances, with at least one node capable of generating the output.

No recurrence of edges over time An edge is created only once at any point in time. This assumption is critical to maintain the eventual adjacency relationships thus to perform any computation of a graph metric such as shortest path and centrality.

Constant connectivity over time Subgraphs are in constant connectivity between themselves, however, only one subgraph \(S_{i}^{\prime }\) and T are always linked at any point in time. There must exist an edge between a node belonging to a subgraph \(S_{i}^{\prime } \subset S\) and a leafnode of T.

Eventual adjacency There is always a future target node in subgraph \(S_{i}^{\prime } \subset S\) that a source node will be linked to.
Conceptually, the subgraphs are essential for convenient definition nodes and edges that are being projected for evolutionary analysis. The Projected Graphs can be locally extracted from a single subgraph or globally from all the subgraphs.
Projected graphs
Projected graphs are vital for retrieving nodes and edges that define the state of the Wholegraph at varying temporal resolutions, timewindows and based on subgraphs of interest. Projected graphs present the key graphs on which graph metrics are computed for evolutionary analytics of the Wholegraph G. From the vertexcentric standpoint, the evolution of the Wholegraph G materializes in the timevarying changes of its Projected graphs, PG_{(Ʈ)}. Depending on the network and application, varying resolutions of time can be defined to appropriately capture the evolutionary behavior of the graph. At the extreme, each interval could correspond to the smallest time unit in the dataset or to the time between any two consecutive modifications of the graph [31]. In some cases, every Projected graph of the wholegraph G becomes equivalent to an evolving graph model [32].
Time, Ʈ of the Projected graph can represent temporal resolutions of the graph such as in seconds, minutes, hours, days, weeks, months and years. It can also represent timewindows such as a 2week window, onemonth window, and 2year window. Timewindows can be defined as eventbased such as to capture changes in the network through the period of a snow storm or other event.
The timewindow can be a sliding window or a tumbling window depending on the application. A sliding window defines a time period that goes back in time from the present. For instance, a sliding window of 1 h includes nodes and edges that have appeared in the last 1 h. In a tumbling window, nodes and edges are grouped in a single window based on time. For example, in a tumbling window with a length of 2 h, the first window (w_{1}) contains nodes and edges that appeared between the zeroth and second hours. The second window (w_{2}) contains nodes and edges that appeared between the second and fourth hours and the third window (w_{3}) contains nodes and edges that appeared between the fourth and sixth hours. In this case, the Wholegraph G is projected every 2 h, and none of the windows overlap; each Projected graph represents a distinct and unique state of the Wholegraph.
The evolutionary assumptions underlying the Projected graphs can be described based on the following:

Recurrence of nodes a node of a Projected Graph PG_{(Ʈ)} can reoccur in different Projected Graphs at different time windows, \(w_{i} , w_{j} \ldots w_{k}\), even in the case of tumbling timewindows. This assumption is important because some entities (nodes) in a realworld network (e.g. places) do not change but other entities and connectivity around them do change. The nodes will appear in more than one timewindow because its neighborhood changes across time.

No recurrence of edges in tumbling timewindows, an edge in a Projected graph \(PG_{{\left( {t_{i} } \right)}}\) cannot reoccur in another Projected Graph because connectivity in each Projected graph is distinct and unique.
Capturing the evolution of STVG
Graph metrics are generally used to uncover the dynamics of the Wholegraph globally or dynamics in the graph locally over the Projected graphs. The process of capturing the STVG evolution involves a pipeline of steps that can be described as follows:

Step 1: Define a timewindow, Ʈ ⊂ T for analysis, where Ʈ is the selected timewindow and T is the entire lifetime of the Wholegraph.

Step 2: Define subgraph(s) of interest, S′ ⊂ S where, S′ is the selected subgraph (s) and S is the entire spacedimension of the wholegraph.

Step 3: Define the temporal resolution \(t_{i}\) ⊂ Ʈ for iterative projections and for each \(t_{i}\) in Ʈ.

Step 4: Project Graph, \(PG_{{\left( {t_{i} } \right)}}\) where \(PG_{{\left( {t_{i} } \right)}}\) is a project graph at temporal resolution, \(t_{i}\) and for each Projected graph \(PG_{{\left( {t_{i} } \right)}}\).

Step 5: Compute graph metrics, \(M_{i}\), where \(M_{i}\) is a graph metric (e.g. graph Density, Volume and Avg. path length) between consecutive \(M_{i}\) s.

Step 6: Compare and correlate consecutive \(M_{i}\) values and compute delta ∆.

Step 7: Identify and analyse evolutionary patterns in ∆ values.
These steps allow a user to define y timewindow within the lifetime of the Wholegraph where the evolutionary analysis is required. Subgraphs of interest can also be defined such that the evolutionary analysis involves only one or more subgraphs in the space dimension of the network. After defining the timewindows and subgraphs of interest, the temporal resolution (such as seconds, minutes, hours, days, months and years) is defined. This is temporal resolution at which the Wholegraph is projected, and the evolution is captured using selected graph metrics. These steps are captured in an algorithm described in Table 1.
The algorithmic workflow can be implemented in a graph query to project the state of the Wholegraph and compute the graph metrics at varying temporal resolutions within any given timewindow. Timeseries analysis methods can be used thereafter to analyze the computed graph metrics values across time for evolutionary behavior of the graph [33]. For example, if we have the graph Density at time \(t_{i}\) as \({\text{D}}t_{i}\), at \(t_{j}\) time \({\text{D}}t_{j}\) and it continually evolves throughout the entire graph lifetime, (\({\text{D}}t_{i} , \ldots , {\text{D}}t_{n} )\), we can use time series analysis method to depict the evolution of the graph’s density at different temporal resolutions.
Graph metrics
Graph metrics such as network density, network diameter and average path length have been utilized in this case study to retrieve the evolutionary pattern of the transit network structure at different temporal resolutions over monthly and yearly timewindows.
Graph density
One of the most commonly used metrics for observing evolution in network structure is the graph density. In this experimental study, graph density of each Projected graph \(PG_{{\left( {t_{i} } \right)}}\) in the STVG measures how close a given \(PG_{{\left( {t_{i} } \right)}}\) is to a complete graph. That is, the ratio between the number of its edges and the number of all possible edges between n nodes.
However, in real practical sense, true graph density ρ(G) = Mass(G)/Volume (G), where the Mass of the graph G is a total mass of its edges and nodes, and Volume (G) is a sizelike graph characteristic of the graph [34].
Mass(G) = \(\left E \right\) which approximately estimates the Mass of the graph while;
Volume (G), = \(\left N \right*\left( {\left N \right  1} \right)\) which approximately estimates the Volume (Size) of the graph which is dependent on the number of nodes. The evolution of the density of the wholegraph can be observed by computing and observing the trend over the Projected graph \(PG_{{\left( {t_{i}  t_{n} } \right)}}\) through time, that is
Average path length
Average path length otherwise known as characteristic path length is also one of most robust measures of network topology [34]. It is denoted by \(\langle {\text{d}}\rangle\) and can be used as a measure of the efficiency of travel or mass transit efficiency in the network. For each \(PG_{i}\), Average Path Length can be measured as:
where \(d_{ij}\) is the number of edges for the shortest path from node i to node j. And the evolution of \(\langle {\text{d}}\rangle\) can be observed through \(PG_{i}\) over time \(\left( {ti  tn} \right)\); where the \(\langle {\text{d}}\rangle\) of \(PG_{{\left( {ti  tn} \right)}}\) can be defined as
Experimental study
In this section, we describe an experimental study and implementation, utilizing the STVG framework to carry out evolutionary graph analytics of a bus transit network.
Data description
The datasets used for the implementation are the Automatic Vehicle locations (AVL) feeds from the CODIAC bus transit network in Greater Moncton, New Brunswick, Canada and GIS shapefile dataset containing information about the bus routes, bus stops, street segments, street intersections, civic addresses obtained from GeoNB online service (http://www.snb.ca/geonb1/e/indexE.asp). We extracted a total of 59,617 completed bus trips from the AVL feeds pulled from the buses every 5 s into a PostgreSQL/Post GIS database for a period of 18 months (from June 2016 December 2017). The raw data set from the AVL feeds consist of the following attributes:

vlr id: The ID of the data point in the vehicle location reports table.

route id vlr: The route ID in the vehicle location reports table.

route name: The route name.

route id rta: The route ID in the route transit authority table.

route nickname: The abbreviate of the route.

trip id br: The trip ID in the bid route table.

transit authority service time id: Transit authority service time ID.

trip id tta: Transit authority trip ID.

trip start: Start time of the trip.

trip finish: Finish time of the trip.

lat: Latitude.

lng: Longitude.

timestamp.
Automated preprocessing tasks
The data preprocessing tasks involve the steps that have been used to transform the raw datasets into a suitable format for building the STVG. These tasks were implemented by an automated data preprocessing algorithm developed by [35] which includes the following steps:
Step 1: Transformation of GPS points into moves/stops The algorithm utilizes a fixed distance value between two consecutive points to determine if the bus was moving, “Moves” or stopped, “Stops”. This fixed distance was empirically determined as being 15 m for a transit network. If the distance between the previous point and the current point is more than 15 m, the current point is annotated as a “Move”. In contrast, if the distance is less than 15 m, the current point is annotated as a “Stop”. These distances are Euclidian distances; network distance could be considered but for a short distance of 15 m, Euclidian distance can be used efficiently for this purpose.
Step 2: Street name annotation The algorithm annotates the “Moves” and “Stops” computed from the previous step with the street names if the event was performed on a street. For this task, a query on a PostgreSQL database is run to retrieve the names of the streets where a “Move” or “Stop” is located. This is a nontrivial step because the GPS coordinates usually have 10 m of accuracy in urban areas [36].
Step 3: Bus stop identification This step annotates the “Moves” and “Stops” with the “Bus Stop” names if the event took place within a bus stop in a 30 m buffer zone. It is important to point out that the algorithm also needs to verify the direction of a moving bus (e.g. eastbound and westbound) to identify the “Bus Stop” that a “Stop” or “Move” is located at.
Step 4: Street intersection identification This step also annotates the “Moves” and “Stops” with the “Street intersection” names if this event took place within a street intersection. To determine if the mobility action is within a street intersection, the algorithm creates a buffer zone of a 30 m radius (determined empirically) for each street intersection. The “Stops” and Moves” that are located inside a given buffer zone are annotated with the intersection identifier.
Step 5: Origin/destination trip identification The algorithm in this step identifies the origin and destination of each trip. The first GPS point of a bus trip located at a bus stop or station is tagged as the origin, and the last point of that same trip ID located at a bus stop or station is tagged as the destination. The GPS points between the origin and destinations points are sequentially indexed in order of occurrence.
Step 6: Bus trip labeling The algorithm labels the trips in terms of (Route NumberRun NumberRundayRun MonthRun Year), where (501023122016) would represent Route #50 during the 10th run of the 23rd day of 12th month of 2016. The date had to be concatenated to the trip IDs, so they could have unique IDs in the database. A “Trip” here is a completed journey of a bus from an origin to a final destination.
At the end of this preprocessing pipeline, a CSV file is automatically generated containing all the data needed to build the STVG of the bus transit network in Neo4j graph database.
Building the subgraphs
We define the Platial and Mobility subgraphs for generating the Wholegraph of the bus transit network which represents the connectivity between mobility events and places over time. The preprocessed data sets in the CSV file are used to build the subgraphs of the bus transit network which generates a Wholegraph. The smallest temporal resolution at which nodes and edges were inserted into the subgraphs was 5 s, corresponding to the timeinterval of the AVL feeds. This means changes in the subgraphs can be tracked to the detail of every 5 s. We utilized the Neo4j graph database for the implementation. Neo4j is currently the most popular native graph database widely used for graph data management and analytics. The language in building and processing the graph in Neo4j is called Cypher, which we have used to write the nodes and edges continuously into the graph and used for encoding graph metrics for evolutionary graph analytics.
Platial subgraph
The Platial subgraph consists of a sequence of physical geographical places that represent a “Trip” of the bus within the transit network. In this case, a “Trip” is represented as a bus route that consists of a static “Bus Stop” (i.e. a designated place for passengers to board or alight from a bus), “Street Segment” (i.e. the transit segment between two bus stops belonging to a bus route), and “Street Intersection” (e.g. existing 3way intersections or 4way intersections of a bus route). More geographical places can be defined depending on the availability of dataset and application. In this case study, nodes are created in the Platial subgraph according to the actual locations of the bus trips at any given time. The “NEXT” edges represent adjacency relationships between two consecutive geographical places in a trip. Figure 3 illustrates the network structure of the Platial subgraph in the STVG of the bus transit network. A node in the platial subgraph can reoccur in more than one Projected graph even in tumbling timewindows.
The Cypher query as shown in Table 2 was used to import the bus stop, intersection street segments points as Bus stop nodes, Intersection nodes and Street segment nodes respectively to the graph database. This step also includes the nodes properties such as intersection ID, intersection name and street segment ID, and street segment name.
Mobility subgraph
The Mobility subgraph represents the discrete sequence of mobility events (moves and stops) of a moving bus in space and time. In this subgraph, as shown in Fig. 4, the primary entity is the “Trip” that represents the trajectory of a moving bus within a bus route of a transit network. A “Trip” node is a composite node which is created by the sequence of “Move” (i.e. a node representing the location where a bus is in motion), and “Stop” (i.e. a node representing the location where the bus is not in motion), nodes from an “Origin” (i.e. a node representing the first location of a trip) to a “Destination” (i.e. a node representing the last location of a trip). The connectivity between these nodes is represented by the “NEXT” adjacency relationship which is a space–time relation such as OriginMove, Stop–Stop, StopMove, StopDestination, MoveDestination, and MoveStop.
Each node in the Mobility subgraph occurs at a specific point in time and adds a new node and their respective NEXT relation to the graph. These nodes are connected to the timetree’s leaf nodes using their date and timestamps, starting from the year to the second leaf nodes. This is important for tracking the evolution of the subgraph. The “Origin” and “Destination” nodes of a trip have relationships to the “Trip” node labelled as “START_AT” and “ENDS_AT” respectively which gives meaning to the edges and useful for the evolutionary analysis of the trips. This subgraph is majorly responsible for the dynamics of the graph. In this case study, every 5 s, a node or more from the mobility subgraph is added to the wholegraph.
The Mobility subgraph nodes were continuously loaded while concurrently establishing the connectivity with the Platial subgraph nodes. The cypher statements for importing the “Moves” and “Stops” nodes in the wholegraph are shown in Table 3.
Connectivity between mobility and platial subgraphs
Ideally, each node in the Mobility subgraph has a corresponding node in the Platial subgraph where it is located at in a given time instance. Platial and Mobility subgraph nodes are complementary in space and time (Fig. 5). They are complementary because for each node in the Mobility subgraph there is always a corresponding node in the Platial subgraph on which a connection is established using a cypher statement as shown in Table 4.
The connectivity between the two subgraphs is semantically represented as the “LOCATED_AT” edges that exists between the corresponding nodes. Independent of the number of subgraphs being represented in the graph model, all nodes are connected to a timegraph based on the temporal relationship between the nodes in only one of the subgraphs and the leaf nodes of the timegraph. The Mobility subgraph nodes are connected to the timegraph from top to the lowest level of temporal resolutions using, for example the Cypher statement in Table 5 to create the sequential “HAPPEN_AT” edges between the “Move” nodes and their corresponding timegraph leaf nodes.
The timegraph is the timedimension of the Wholegraph which contains all possible time instances of every node in the subgraphs throughout the graph lifetime. We built a timegraph of a 2year lifetime (2016–2017) in hierarchical and sequential order of temporal resolutions as described in the Cypher statements in Table 6.
The Wholegraph
The wholegraph is composed of the subgraphs and the timegraph. It presents the complete connectivity between the space and time dimensions of the network. Figure 6 represents the overview of two dimensions of the graph model, the space dimension, which consists of the subgraphs, and the time dimensions that consists of the timegraph. The connectivity between these dimensions enables the evolutionary analytics of the transit network graph. The timegraph does not only track the evolution of the space dimension, but it also facilitates efficient retrieval of timedependent Projected graphs. Table 7 describes the statistics of the Wholegraph in the graph database.
Results and discussion
From a global perspective, the trend in the graph density and volume as shown in Fig. 7a, b reflect the bus transit network’s topology variations within a 2year time window at hourly temporal resolution. This trend could be useful in observing increase or decrease in bus transit mobility activities over time, as well as trip density within the entire network through time. We could also observe if there is a significant correlation between global network density and congestion density in the network through time. Congestions in our graph model represents where there are many of “Stops” on a street with a “Move”.
Figure 7a also reveals an evolutionary trend in the Wholegraph of the transit network such that the hourly network density, graph density with respect to traffic congestion and bus movement have the lowest values at peak transit hours (8 a.m. and 4 p.m.). In a practical sense, the trend reveals the highest volume of transit mobility activities (“Moves” and “Stops”), trips and traffic congestion at these peak hours. In other words, between 8 am and 9 am in the mornings as well as 4 p.m. and 5 p.m. in the evenings, more nodes and edges are added to the Wholegraph of the transit network signifying a lot of mobility activities within the network during these times. The inverse relationship between graph density and graph volume is clearly depicted in Fig. 8a, b where an increase in volume means decrease in density and vice versa. The evolutionary trend shows a strong correlation between the global network density of the graph, graph density with respect to traffic congestion and density with respect to the bus movement. Similar behavior is seen in the network volume, congestion and movement volume, as one increases, others increase as well.
In monthly temporal resolution, Fig. 8a, b reveal an evolutionary behavior with peaks in August 2016, December 2016 and May 2017 in terms of movement, traffic congestion and network volume as well as density of the graph.
From June to December 2016, graph density with respect to the entire network, movement and congestion reveal a zigzag (up and down) alternate behavior different from what is observed in the same months in 2017. The graph volume with respect to congestion, movement as well as that of the entire network from June to December 2017, did not experience significant changes. The peaks observed in August 2016 and December 2016 did not reoccur in the following year, 2017. Figure 9a, b provides a closer look at the behavior of the transit network in a monthly timewindow (December 2016 peak) on a daily temporal resolution. The daily behavior of transit activity in December 2016 is compared to that of the previous month, November 2016 to depict the differences in daily transit pattern and the reason for such a peak in December.
The daily graph volume and density behavior for the December 2016 bus transit network activity peak depicts high volume of bus transit activity in many days of the month except the lows observed between Tuesday 6th to Friday 9th as well as on Monday 12th. There is also the weekend low such as on Sunday 18th, Saturday 24th and 31st of December 2016. Comparing December 2016 daily network volume with that of the previous month, November 2016, as shown in Fig. 10, we could deduce the reason for the December peak. There is more transit network low volume in November with respect to that of December. From 12th of December to 30th, there are more mobility events in the network when compared to those of November. In graph topological sense, more nodes and edges are added to the wholegraph in December than in November 2016.
The evolutionary trend in the Average path length of the network would reflect the average trip length (travel time) over time. One analysis of interest would be to observe how evolving average path length correlates with evolving trip density/volume over time in the network.
It can be observed in Fig. 11a, b, that Average path length has a very strong correlation with graph density and volume. It increases with increase in graph volume and decreases with an increase in graph density. In a practical sense, average path length in this case study is the average trip length (travel time) from an origin to destination within the network. It is as expected that travel time (trip length) would increase when volume of congestion and mobility activity in the network increase, as observed in Fig. 11a. We can also observe in Fig. 11b that the highest average trip lengths are seen at peak hours of the day (8am and 4 pm) when it shows the highest traffic congestion and network volume.
Conclusions
Evolutionary graph analytics based on sequence of graph snapshots have been commonly utilized in literature mainly because of the convenience it affords the users to manage smallsize graph in discrete versions and compare differences between snapshots across consecutive time windows as this approach may be suitable for small graphs and for cases where changes do not occur at shorter temporal resolutions. It is associated with high storage overhead in proportion to the size of the evolving graph and the time intervals between snapshots because the entire graph is usually replicated from one snapshot to another. Also, computing evolving queries across the snapshots is computationally intensive and complicated. These are major disadvantages especially, in cases where the graph is massive and changes frequently at shorter temporal resolutions and where the evolutionary analysis relies on the explanatory power of representing the dynamics across different temporal resolutions.
We therefore, propose a framework based on our Space–Timevarying graph (STVG) formalism which utilizes the Wholegraph approach to model the dynamics of the graph whose evolutions materialize in the timevarying changes in its Projected graphs. The STVG framework provides an approach to reduce high storage overhead of massively changing graphs where new nodes and edges arrive every second. It affords the capability to extract Projected graphs at different timewindows and analyze their metrics across varying temporal resolutions.
The framework was implemented for a transit case study using the AVL feeds of the bus transit network of Greater Moncton, New Brunswick, Canada which generated a Wholegraph of 44.2 GB in the database. In contrast, using the Snapshot method created 7280 hourly snapshots, 732 daily snapshots, 18 monthly snapshots and 2 yearly snapshots in the graph database where the smallest snapshot amounted to 1.3 GB of graph. The total storage cost for over 18,000 snapshots of graph needed for our evolutionary graph analytics amounted to 23.4 TB, because the total storage and computational cost increase linearly with the number of snapshots, that is, total cost is equal to, cost per snapshot multiplied by the number of snapshots. Using the Snapshot method presents over 500 times increase in storage overhead when compared to our Wholegraph approach.
Our evolutionary analysis was based on graph density, volume and average path length on the Projected graphs at varying temporal resolution across different time windows. The analysis reveals evolutionary patterns in the overall network density of the graph, traffic congestion density as well as graph density with respect to bus movement at hourly, daily and monthly temporal resolutions. Similar patterns are observed in the evolutionary pattern of network volume, congestion and movement volumes. This type of analytics for any transit networks potentially provides an efficient way to uncover dynamics of the network as well as the dynamics in the network over space and time. The evolutionary pattern of the transit network properties such as average paths, network density and volume as a result of the dynamics in human mobility pattern may become vital for transit network optimization. Potential applications can be found in transit trend analysis as well as in timedependent transit recommendation systems.
Availability of data and materials
The data is not available to public because CODIAC bus transit agency allowed us to use the data on the condition that the data shall not be distributed or shared.
Abbreviations
 AVL:

automatic vehicle
 CSV:

comma separated values
 GIS:

geographic information systems
 GPS:

global positioning system
 ID:

identity
 NB:

New Brunswick
 PG:

Projected graphs
 STVG:

Space–Timevarying graph
References
Huang Q, Zhao C, Zhang X, Wang X, Yi D. Centrality measures in temporal networks with time series analysis. Europhys Lett. 2017;118(3):36001.
Khurana U, Deshpande A. Storing and analyzing historical graph data at scale. 2016; 65–76.
Khurana U. An introduction to temporal graph data management. 2016. http://citeseerx.ist.psu.edu. p. 1–11.
Tong H, Papadimitriou S, Yu PS, Faloutsos C. Proximity tracking on dynamic bipartite graphs: problem definitions and fast solutions. In: Tong H, Papadimitriou S, Yu PS, Faloutsos C, editors. Link mining: models, algorithms, and applications. New York: Springer; 2010. p. 211–36.
Rossi RA, Gallagher B, Neville J, Henderson K. Modeling dynamic behavior in large evolving graphs. In: Proc. sixth ACM Int. Conf. Web search data Min.—WSDM’13. 2013. p. 667.
Yang Y, Yu JX, Gao H, Pei J, Li J. Mining most frequently changing component in evolving graphs. World Wide Web. 2014;17(3):351–76.
Yang J, Leskovec J. Patterns of temporal variation in online media. In: Proc. fourth ACM Int. Conf. Web search data Min.—WSDM’11. 2011. p. 177.
Pereira FSF, de Amo S, Gama J. Evolving centralities in temporal graphs: a twitter network analysis. In: 2016 17th IEEE Int. Conf. Mob. Data Manag. 2016. p. 43–48.
Koloniari G, Pitoura E. Partial view selection for evolving social graphs. In: GRADES’13 first international workshop on graph data management experiences and systems. 2013. p. 4503–2188.
Aridhi S, Montresor A, Velegrakis Y. BLADYG: a graph processing framework for large dynamic graphs. Big Data Res. 2017;9:9–17.
Leventhal GE, Hill AL, Nowak MA, Bonhoeffer S. Evolution and emergence of infectious diseases in theoretical and realworld networks. Nat Commun. 2015;6:6101.
Magnien C, Tarissan F. Time evolution of the importance of nodes in dynamic networks. In: IEEE/ACM international conference on advances in social networks analysis and mining. 2015.
Starnini M, Machens A, Cattuto C, Barrat A, PastorSatorras R. Immunization strategies for epidemic processes in timevarying contact networks. J Theor Biol. 2013;337:89–100.
Jalili M, SalehzadehYazdi A, Gupta S, Wolkenhauer O, Yaghmaie M, ResendisAntonio O, Alimoghaddam K. Evolution of centrality measurements for the detection of essential proteins in biological networks. Front. Physiol. 2016;7:375.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;10:2008.
Yarlagadda R, Pinnaka S, Etinkaya EKÇ. A Timeevolving weightedgraph analysis of global petroleum exchange. In: 7th international workshop on reliable networks design and modeling (RNDM), 2015.
Von Landesberger T, Brodkorb F, Roskosch P, Andrienko N, Andrienko G, Kerren A, Member S. Mobility graphs: visual analysis of mass mobility dynamics via spatiotemporal graphs and clustering. IEEE Trans Vis Comput Graph. 2015;22:11–20.
Qi X, Liu L, Cai G, Xie M. A topology evolution model based on revised PageRank algorithm and node importance for wireless sensor networks. Hindawi. 2015.
Glacet C, Fiore M, Gramaglia M. Temporal connectivity of vehicular networks: The power of storecarryandforward. In: IEEE Vehicular Networking Conference, VNC. 2016.
Küçükkeçeci C, Yazıcı A. Big data model simulation on a graph database for surveillance in wireless multimedia sensor networks. Big Data Res. 2018;11:33–43.
Aggarwal C, Subbian K. Evolutionary network analysis: a survey. ACM Comput Surv. 2014;47:10.
Ivarsson T, Kollegger A, Neubauer P, Svensson J, Webber J. The Neo4j manual v1. 3. USA: NeoTechnology. 2014.
Casteigts A, Flocchini P, Quattrociocchi W, Santoro N. TimeVarying graphs and dynamic networks. Ad hoc Mob Wirel Netw. 2011;6811:346–59.
Gottumukkala RN, Venna SR, Raghavan V. Visual analytics of time evolving largescale graphs. IEEE Intell Inf Bull. 2015;16(1):10–6.
Huo W, Tsotras VJ. Efficient temporal shortest path queries on evolving social graphs. In: Proceedings of the 26th international conference on scientific and statistical database management—SSDBM’14. 2014. p. 1–4.
Ren C, Lo E, Kao B, Zhu X, Cheng R. On Querying Historical Evolving Graph Sequences. Proc VLDB Endow. 2011;4(11):726–37.
Lerman K, Ghosh R, Kang JH. Centrality Metric for Dynamic Networks. Inf. Sci. (Ny). 2010;354(Pt 3):70–7.
Quattrociocchi W, Amblard F, Galeota E. Selection in scientific networks. Soc Netw Anal Min. 2012;2(3):229–37.
Kumar R, Novak J, Tomkins A. Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining—KDD’06. 2006. p. 611.
Liu Z, Yu JX. Discovering burst areas in fast evolving graphs. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010, vol. 5981 LNCS, no. PART 1, p. 171–185.
Santoro N, Quattrociocchi W, Flocchini P, Casteigts A, Amblard F. Timevarying graphs and social network analysis : temporal indicators and metrics. In: 3rd AISB Soc. networks multiagent Syst. Symp. 2011. p. 32–38.
Ferreira A. Building a reference combinatorial model for MANETs. IEEE Netw. 2004;18(5):24–9.
Fenn DJ, Porter MA, Williams S, McDonald M, Johnson NF, Jones NS. Temporal evolution of financialmarket correlations. Phys Rev E Stat Nonlinear Soft Matter Phys. 2011;84(2):1–13.
Edwards B, Hofmeyr S, Stelle G, Forrest S. Internet topology over time. 2012. Arxiv, p. 6.
Cao H, Wachowicz M. The design of an IoTGIS platform for performing automated analytical tasks. Comput Environ Urban Syst. 2019;74:23–40.
Salarian M, Manavella A, Ansari R. Accurate localization in dense urban area using Google street view images. In: 2015 SAI intelligent systems conference (IntelliSys), 2015. p. 485–490.
Acknowledgements
We want to acknowledge CODIAC for giving us access to the dataset.
Funding
This research was supported by TETFund and NSERC/Cisco Industrial Research Chair.
Author information
Authors and Affiliations
Contributions
MW and IM developed the conceptual framework. IM carried out every implementation. All the authors contributed in the writing of the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Maduako, I., Wachowicz, M. & Hanson, T. STVG: an evolutionary graph framework for analyzing fastevolving networks. J Big Data 6, 55 (2019). https://doi.org/10.1186/s405370190218z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s405370190218z
Keywords
 Space–Time varying graph
 Evolutionary graph analytics
 Fastevolving networks
 Transit networks