S-RASTER: Contraction Clustering for Evolving Data Streams

Contraction Clustering (RASTER) is a very fast algorithm for density-based clustering, which requires only a single pass. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. Yet, RASTER is limited to batch processing. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step in the sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. It is therefore very well suited to real-world scenarios where clustering does not happen continually but only periodically. We describe the algorithm, including a discussion of implementation details.


Introduction
Clustering is a standard method for data analysis. Many clustering methods have been proposed [11]. Some of the most well-known clustering algorithms are DBSCAN [7], k-means clustering [10], and CLIQUE [1] [2]. Yet, they have in common that they do not perform well when clustering big data, i.e. data that far exceeds available main memory. Based on a real-world challenge we faced in industry, i.e. clustering of large amounts of geospatial data, we developed Contraction Clustering (RASTER), a very fast linear-time clustering algorithm for identifying approximate density-based clusters. The original presentation was focused on sequential processing of batch data [13] and was followed by a description of a parallel version of this algorithm [14]. A key aspect of RASTER is that it does not exhaustively cluster its input but instead identifies their approximate location in linear time. As it only requires constant space, it is eminently suitable for clustering big data. The variant RASTER retains its input and still runs in linear time while requiring only a single pass. Of course, it cannot operate in constant memory.
A common motivation for stream processing is that data does not fit into working memory and therefore cannot be retained. This is not a concern for RASTER as it can process an arbitrary amount of data in limited working memory. One could therefore divide a stream of data into discrete batches and consecutively cluster them. Yet, this approach does not address the problem that, in a given stream of data, any density-based cluster may only be temporary. In order to solve this problem, this paper presents Contraction Clustering for evolving data streams (S-RASTER). This algorithm has been designed for identifying density-based clusters in infinite data streams within a sliding window. S-RASTER is not a replacement of RASTER, but a complement, enabling this pair of algorithms to efficiently cluster data, regardless of whether it is available as a batch or a stream.
In the remainder of this paper, we provide relevant background in Sec. 2, which contains a brief recapitulation of RASTER and the motivating use case for S-RASTER, i.e. identifying evolving hubs in streams of GPS data. In Sect. 3 we provide a detailed description of S-RASTER, including a complete specification in pseudocode. 3 After a brief theoretical evaluation in Sec. 4, we highlight related work in Sec. 5 and future work in Sec. 6.

Background
In this section, we give a brief presentation of the sequential RASTER algorithm in Subsect. 2.1. This is followed by a description of the motivating problem behind S-RASTER, i.e. the identification of so-called hubs within a sliding window, in Subsect. Only tiles to which more than a predefined threshold number τ of data points have been projected are retained. These are referred to as significant tiles σ, which are subsequently clustered by exhaustive lookup of neighboring tiles in a depth-first manner. Clustering continues for as long as there are significant tiles left. To do so, the algorithm selects an arbitrary tile as the seed of a new cluster. This cluster is grown iteratively by looking up all neighboring tiles within a given Manhattan or Chebyshev distance δ. This takes only O(1) as the location of all potential neighbors is known due to their location in the grid. Only clusters that contain more than a predefined number µ of significant tiles are kept. The projection operation consists of reducing the precision of the input by scaling a floating-point number to an integer. For instance, take an arbitrary GPS coordinate (34.59204302, 106.36527351), which is already truncated compared to the full representation with double-precision floating-point numbers. GPS data is inherently imprecise, yet stored in floating-point format with the maximum precision, which is potentially misleading, considering that consumergrade GPS is only accurate to within about ten meters under ideal conditions [6].
Consequently, these data suggest a level of precision they do not possess. Thus, by dropping a few digit values, we do not lose much, if any, information. Furthermore, vehicle GPS data is sent by vehicles that may, in the case of trucks with an attached trailer, be more than 25 meters long. To cluster such coordinates, we can truncate even more digit points. In our example, if we concluded that a precision of around 11.1 meters is sufficient, we would transform the input to (34.5920, 106.3652). To avoid issues pertaining to working with floating-point numbers, the input is instead scaled to (345920, 1063652). Before generating the final output, all significant tiles of the resulting clusters are scaled back to floating-point numbers.
RASTER is a single-pass linear time algorithm. However, in a big data context, its biggest benefit is that it only requires constant memory, assuming a finite range of inputs. This is the case with GPS data. It is therefore possible to process an arbitrary amount of data on a resource-constrained workstation with this algorithm. We have also shown that it can be effectively parallelized [14]. A variation of this algorithm that retains its inputs is referred to as RASTER . It is less suited for big data applications. However, it is effective for general-purpose density-based clustering and very competitive against standard clustering methods; cf. Appendix A in [14].

Identifying Evolving Hubs
RASTER was designed for finite batches of GPS traces of commercial vehicles. The goal was to identify hubs, i.e. locations where many vehicles come to a halt, for instance vans at delivery points or buses at bus stops. After identifying all hubs in a data set, it is possible to construct vehicular networks. However, what if the data does not represent a static reality? It is a common observation that the location of hubs changes over time. A bus stop may get moved or abolished, for instance. This motivates adapting RASTER so that it is able to detect hubs over time and maintaining hubs within a sliding window W . The length of W depends on the actual use case. With GPS traces of infrequent but important deliveries, many months may be necessary. Yet, with daily deliveries, a few days would suffice to detect new hubs as well as discard old ones.

S-RASTER
This section starts with a concise general description of S-RASTER in Sect. 3.1, followed by a detailed specification in Sect. 3.2. Afterwards, we highlight some implementation details in Sect. 3.3 and outline, in Sect. 3.4, how S-RASTER has to be modified to retain its inputs, which makes this algorithm applicable to different use cases.

Idea
S-RASTER (cf. Fig. 2) performs, for an indefinite amount of time, projection and accumulation continually, and clustering periodically. Projection nodes π The flow graph of RASTER with evolving data streams. The input source node s distributes values arbitrarily to projection nodes π. In turn, they send projected values to accumulation nodes α. These determine significant tiles for the chosen sliding window. Should a tile become significant or a once significant tile no longer be significant, a corresponding update is sent to the clustering node κ. Node κ periodically performs clustering of significant tiles, which is a very fast operation.
receive their input from the source node s. Each incoming (x, y), where x, y ∈ R, is surjected to (x π , y π ), where x π , y π ∈ Z. Together with a period indicator ∆ z , these values are sent to accumulation nodes α. The identifier ∆ z ∈ N 0 designates a period with a fixed size, e.g. one day, and is non-strictly increasing. Each α-node maintains a sliding window W of length c, which is constructed from c multisets W ∆z . Each such multiset W ∆z keeps running totals of how many times the input was surjected to any given tuple (x π , y π ) in the chosen period. The sliding window starting at period ∆ i is defined as W ∆i+c ∆i = i+c z=i W ∆z . It contains the set of significant tiles σ = {(x π , y π ) d ∈ W ∆i+c ∆i | d ≥ τ }, where d indicates the number of appearances in the multiset and τ the threshold for a significant tile. If a tile becomes significant, its corresponding value (x π , y π ) is forwarded to the clustering node κ. Whenever W advances to the next period ∆ i+1 , the oldest entry W ∆i is removed from W . Furthermore, all affected running totals are adjusted, which may lead to some significant tiles no longer being significant. If so, node κ receives corresponding updates to likewise remove those entries. Node κ keeps track of all significant tiles, which it clusters whenever W advances (cf. Alg. 1, lls. 12-24). Call the resulting set of clusters ks. The resulting clusters are defined as {k ∈ ks | k > µ}, where µ is the minimum cluster size. Each (x π , y π ) in each k is finally projected to (x π , y π ), where x π , y π ∈ R. Together with cluster and period IDs, these values are sent to the sink node t.

Detailed Description
S-RASTER processes the data stream coming from source s in Fig. 2 and outputs clusters to sink t. There are three different kinds of nodes: projection nodes π project points to tiles, accumulation nodes α determine significant tiles within each sliding window W ∆i+c ∆i , and one clustering node κ outputs, for each W ∆i+c ∆i , all identified clusters. Below, we describe the three nodes in detail.
Algorithm 2 Projection node π input: stream of tuples (x, y, ∆z) where x, y are coordinates and ∆z is the current period; precision prec output: stream of tuples (xπ, yπ, ∆z) 1: xπ := projection of x with prec 2: yπ := projection of y with prec 3: send (xπ, yπ, ∆z) to node α Projection. Nodes labeled with π project incoming values to tiles with the specified precision (cf. Alg. 2). In general, the input (x, y, ∆ z ) is transformed into (x π , y π , ∆ z ). The period ∆ z is a non-strictly increasing integer, for instance uniquely identifying each day. Projection is a stateless operation that can be executed in parallel. It is irrelevant which node π performs projection on which input value as they are interchangeable. However, the input of the subsequent accumulator nodes α needs to be grouped by values, for instance by assigning values within a certain segment of the longitude range of the input values. Yet, this is not strictly necessary as long as it is ensured that every unique surjected value (x π , y π ) is sent to the same α-node.
Accumulation. The accumulator nodes α keep track of the number of counts per projected tile for the sliding window W ∆i+c ∆i (cf. Alg. 3). The input consists of a stream of tuples (x π , y π , ∆ z ) as well as the size of the sliding window c and the threshold value τ . A global variable ∆ j keeps track of the current period. Each α-node maintains two persistent data structures. For reasons of efficiency, the hashmap totals records, for each tile, how many points were projected to it in W ∆i+c ∆i . Tiles become significant once their associated count reaches τ . In addition, the hashmap window records the counts per tile for each period W ∆z in W ∆i+c ∆i . Given that input ∆ z is non-strictly increasing, there are two cases to consider: (i) ∆ z = ∆ j , i.e. the period is unchanged. In this case, the counts of tile (x π , y π ) in totals and its corresponding value ∆ z in window are incremented by 1. If the total count for (x π , y π ) has just reached τ , an update is sent to the κ-node, containing (x π , y π ) and the flag 1. (ii) ∆ z > ∆ j , i.e. the current input belongs to a later period. Now the sliding window needs to be advanced, which means that the oldest entry gets pruned. But first, an update is sent to the κ-node with ∆ j and the flag 0. Afterwards, the entries in the hashmap window have to be adjusted. Entry ∆ z−c is removed and for each coordinate pair and its associated counts, the corresponding entry in the hashmap totals gets adjusted downward as the totals now should no longer include these values. In case the associated value of a coordinate pair in totals drops below τ , an update is sent to κ, consisting of (x π , y π ) and the flag -1. Should a value in totals reach 0, the corresponding key-value pair is removed. Afterwards, the steps outlined in the previous case are performed.
Regarding significant tiles, only status changes are communicated from α nodes to κ, which is much more efficient than sending a stream with constant status updates for each tile. send (0, ∆j) to node κ Re-cluster 7: ∆j := ∆z 8: key := ∆j − c Oldest Entry 9: if key ∈ keys of window then Prune 10: vals := window [key] 11: for (a, b) in keys of vals do 12: old Clustering. The clustering node κ (cf. Alg. 4) takes as input a precision value prec, which is identical to the one that was used in the α-node, the minimum cluster size µ, and a stream consisting of tuples of a flag ∈ {−1, 0, 1} as well as a value val ∈ {(x π , y π ), ∆ j }. This node keeps track of the significant tiles σ of the current sliding window, based on updates received from all α nodes. If val = 1, the associated coordinate pair is added to σ. On the other hand, if val = −1, tile (x π , y π ) is removed from σ. Thus, σ is synchronized with the information stored in all α nodes. Lastly, if val = 0 the associated value of the input tuple represents a period identifier ∆ j . This is interpreted as the beginning of this period and, conversely, the end of period ∆ j−1 . Now κ clusters, taking µ into account, the set of significant tiles (cf. Alg. 1, lls. 12 -24) and produces an output stream that represents the clusters found within the current sliding window. In this stream, each projected coordinate (x π , y π ) is assigned period and cluster identifiers, which leads, after re-scaling the coordinate pairs (x π , y π ) to floating point numbers (x π , y π ), to an output stream of tuples of the format (∆ j , cluster id , x π , y π ). clusters := cluster(σ, µ) cf. Alg. 1, lls. 12 -24 12:

Implementation
id := 0 Cluster ID 13: for cluster in clusters do 14: for (xπ, yπ) in cluster do 15: (x π , y π ) := rescale(xπ, yπ, prec) Int to Float 16: send (∆j, id , x π , y π ) 17: id += 1 The previous description is idealized. Yet, our software solution has to take the vagaries of real-world data into account. Below, we therefore highlight two relevant practical aspects that the formal definition of our algorithm does not capture.
Out-of-Order Processing. Tuples are assigned a timestamp at the source, which is projected to a period identifier ∆ z . Assuming that the input stream is in-order, parallelizing the α operator could nonetheless lead to out-of-order input of some tuples at the κ node, i.e. the latter could receive a notification about the start of period ∆ j , cluster all significant tiles as they were recorded up to the seeming end of ∆ j−1 , but receive further tuples pertaining to it from other α-nodes afterwards. One solution is to simply ignore these values as their number should be minuscule. Commonly used periods, e.g. one day, are quite large and the expected inconsistencies are confined to their beginning and end. Thus, it may make sense to set the start of a period to a time where very little data is generated, e.g. 3:00 a.m. for a 24-hour period when processing data of commercial vehicles. Alternatively, clustering could be triggered with a delay of one full period, where the chosen length of the period exceeds the expected delay.
Interpolation. The sliding window also has to advance when there are missing period identifiers in the data. It is not sufficient to simply remove the oldest key-value pair. Instead, missing periods need to be interpolated. If a gap greater than one period between the current and the last encountered period is detected, the algorithm advances the sliding window as many times as needed, one period at a time. This is omitted from Alg. 3 for the sake of brevity but part of our published implementation.

Retaining data points with S-RASTER
There are use cases where it is desirable to not only identify clusters based on their significant tiles but also on the data points that were projected to those tiles (cf. Figs. 1c and 1d). In the following, we refer to the variant of S-RASTER that retains relevant input data as S-RASTER . The required changes are minor.
With the goal of keeping the overall design of the algorithm unchanged, first the π nodes have to be modified to produce a stream of tuples (x π , y π , x, y, ∆ z ), i.e. it retains the original input coordinates. In the α nodes the hashmaps totals and window have to be changed to retain sets of unscaled coordinates (x, y) per projected pair (x π , y π ). Counts are given by the size of those sets. This assumes that determining the size of a set is an O(1) operation in the implementation language, for instance due to the underlying object maintaining this value as a variable. In case a tile becomes significant, each α-node sends not just the tile (x π , y π ) but also a bounded stream to κ that includes all coordinate pairs (x, y) that were projected to it. After a tile has become significant, every additional point (x π , y π ) that maps to it also has to be forwarded to κ. Lastly, in the κ node, the set tiles has to be turned into a hashmap that maps projected tiles (x π , y π ) to their corresponding points (x, y) and the output stream modified to return, for each point (x, y) that is part of a cluster, the tuple (∆ j , cluster id , x π , y π , x, y).

Evaluation
As we have shown previously, RASTER is very fast and generates results more quickly than competing algorithms. Of course, the caveat is that the resulting clusters may not include elements at the periphery that other methods include. Yet, the big benefit of our algorithm is its very fast throughput, being able to process an arbitrary amount of data in linear time and constant memory. This makes it possible to use a standard workstation, while comparable workloads with other algorithms would necessitate a much more expensive and timeconsuming cloud computing setup. The same holds true for S-RASTER as it is an adaptation of RASTER that does not change its fundamental properties. Consequentially, S-RASTER, just like RASTER, requires only constant memory, needs only a single pass, i.e. every element of the input is only processed once, and performs clustering in linear time. We have made a prototypal implementation of S-RASTER publicly available. However, a thorough comparative evaluation is currently missing. That being said, based on the specification we provided, there is no reason to assume that the performance benefits of RASTER for data batches are not also evident in S-RASTER for evolving data streams. Our reasoning is outlined below.
Runtime. RASTER is a single-pass linear time algorithm. The same is true for S-RASTER. Data is continually processed as a stream, in which every input is only processed one single time, implying linear time. Clustering does not happen continually, but periodically. Furthermore, clustering significant tiles is a very fast procedure. As the cost of this operation is quite small, it can be viewed as a constant. On a related note, the performance impact of periodic clustering in the κ node can be mitigated by running this node on a separate CPU core, which is straightforward in the context of stream processing.
Memory. RASTER needs constant memory M to keep track of all counts per tile of the entire (finite) grid. In contrast, S-RASTER uses a sliding window of a fixed length c. In the worst case, the data that was stored in M with RASTER requires cM memory in S-RASTER, which is the case if there is, for each discrete period ∆ z , at least one projected point per tile for each tile. Thus, S-RASTER maintains the key property of RASTER of using constant memory.

Related Work
There are several prominent streaming algorithms for density-based clustering. DUC-STREAM performs clustering based on dense unit detection [8]. Its biggest drawback, compared to S-RASTER, is that it has not been designed for handling evolving data streams. D-Stream I is a combined online/offline algorithm [5].
While it also uses a grid, the computations performed are more computationally intensive than the ones S-RASTER performs. Its runtime and memory requirements are sub-optimal. DD-Stream likewise performs density-based clustering in grids and likewise uses computationally expensive methods for clustering [9]. Both D-Stream I and DD-Stream are not suited for handling big data. D-Stream II can handle evolving data and clusters more efficiently than D-Stream [12], but it is not clear if it is suitable for big data. What furthermore distinguishes S-RASTER is that it can be effectively parallelized. This is not the case with some other standard algorithms in this domain. None of the aforementioned clustering algorithms are quite comparable to S-RASTER, however, as they retain their input. Also, their clustering methods are generally more computationally costly. Thus, S-RASTER requires less time and memory. S-RASTER is closer to those algorithms as it retains relevant input data. As it does not retain all input, S-RASTER is only inefficient with regards to memory use in highly artificial scenarios. This is the case where there is at most one point projected to any square in the grid, which implies that the algorithm parameters were poorly chosen as the basic assumption is that many points are surjected to each significant tile. Ignoring pathological cases, it can thus be stated that S-RASTER is very memory efficient. On top, it benefits from a more efficient approach to clustering. S-RASTER and S-RASTER arguably are, for the purpose of identifying density-based clusters in evolving data streams, more memory efficient than competing algorithms, and also less computationally intensive. Yet, these advantages come at the cost of reduced precision.

Future Work
We may release an implementation of S-RASTER for use in Massive Online Analysis (MOA) [3]. In addition, we may release a complete stand-alone implementation of this algorithm that can be fully integrated into a standard stream processing engine such as Apache Flink [4] or Apache Spark [15] [16]. We are keen on performing a comparative evaluation of S-RASTER versus some other algorithms for clustering evolving data streams, preferably in a standard framework such as the aforementioned MOA. This work should also include an evaluation of the parallel speed-up on many-core CPUs.