The classical relational databases show their limitations facing the inevitable data set size, information connectivity and semi structured incoming data leading to sparse tables [8]. Cassandra DB could be an appropriate candidate to handle the data storage with the aforementioned characteristics. The challenge in this choice is the missing of spatial query feature within Cassandra query language (CQL). An overview of Cassandra and CQL along with the proposed approach to extend its capabilities are discussed in the following subsections.
Cassandra DB and CQL
Cassandra is fully distributed, share nothing and highly scalable database, developed within Facebook and open-sourced on 2008 on Google code and accepted as Apache Incubator project on 2009 [16]. Cassandra DB is built based on Amazon’s Dynamo and Google BigTable [3]. Since that, continuous changes and development efforts have been carried out to enhance and extend its features. DataStax, Inc. and other companies provide customer support services and commercial grade tools integrated in an enterprise edition of Cassandra DB. Cassandra clusters can run on different commodity servers and even across multiple data centers. This propriety gives it a linear horizontal scalability. Figure 1 depicts an example of a possible physical deployment of Cassandra cluster, however, a logical view of the cluster is much more simple. Indeed, the nodes of the cluster are seen as parts of a ring where each node contains some chunks of data. The rows of data are partitioned based on their primary key. This latter could be composed from two parts: the first is called \(partition\ key\) based on it the hash function of the partitioner picks the receiving node to store data. The second part of the key is reserved for clustering and sorting the data within a given partition. A good spreading of data over a cluster should make a balance between two apparently conflicting goals [17]:
Spreading data evenly requires a fairly high cardinality in the partition key and the hash function output space. However, since data is scattered by partitions within the cluster nodes set, having a wide range of partition keys may lead to visiting more nodes for even simple queries and as a result, longer query response delay. In the opposite side, reducing too much the number of partitions may affect the load balancing between nodes and create hot-spot nodes degrading their responsiveness.
The philosophy behind Cassandra is different from a traditional relational database. Indeed, the need for a fast read and write combined with huge data handling are primordial in Cassandra. Hence, normalizing the data model by identifying the entity/relationship entities are not top priorities. However, designing the data access pattern and the queries to be executed against the data are more valuable in the data model design [17]. With this perspective, the data model accepts generally redundancy and denormalization of stored data and allows the storage of both simple primitives data types and composed data types such as collections (list, set, map). Recently, CQL gives the user the ability to extend the native data types with customized user data types (UDT). Even though these UDTs are stored as blobs strings in the database, they could be formatted, parsed and serialized within the client side using the custom codec interface [18].
The CQL version 3 offers a new feature of user defined functions (UDF). It’s a user defined code that can be run into Cassandra cluster. Once created, a UDF is added to cluster schema and gossiped to all cluster nodes. The UDFs are executed on queries result set in the coordinator node row by row. Aggregations over result set rows are also possible. Leveraging the UDFs to filter results returned by Cassandra engine at different nodes is still a challenging task. Indeed, the intention of the UDF feature developers was to delegate a light treatment at the cluster level on resulted rows. So far no possibilities to filter based on a user defined reference or value are available. Extending UDFs with this feature adds more flexibility of their usage and enables pushing some computation blocks from the client to the cluster side.
Geohashing and spatial search
The geohash is a geocoding system which consists in mapping latitude/longitude to a string by interleaving bits resulting from latitude and longitude iterative computation. The resulting bitstring is split into substrings of 5-bit length and mapped to 32-base character dictionary. Finally, a string of an arbitrary length is obtained which represents a rectangular area. The longer the geohash string, the higher is the precision of the rectangle. The successive characters represent nested addresses that converge to around 3.7 by 1.8 cm for 12 characters geohash string length [19]. For the best of our knowledge, Cassandra DB and its query language CQL don’t support spatial queries. Even though the literature presents some generic indexing and search libraries, such as Lucene-based elasticsearch [20] and Solr [21] java libraries. In this contribution, we tried to leverage the geohash technique to label and efficiently retrieve Cassandra stored rows within a user defined area of interest. This behavioral extension of Cassandra and CQL is illustrated in Fig. 2 which details the spatial data storage and retrieval phases accomplished through the following three steps:
-
First, every row, when being stored, is labeled with a numeric geohash value computed based on its latitude/longitude values. The computing of numeric geohash is required in the proposed approach because of the limitation of CQL in terms of operators applied on the String type. Indeed, the WHERE clause of native CQL presents only equality operators for textual types. However, the range queries are not possible. Events associated to a zone may be addressed in a future work.
-
Second, the queried area is decomposed into geohashes of different precision levels. Indeed, the biggest geohash box that fits into the queried area is the first to be computed. Then, the remaining area is filled with smaller geohashes until they are fully covering the area of interest. Since range query is not straight-forward within space-filling Z-order curve [22], a query aggregation algorithm for grouping neighbor geohashes is developed so that the number of generated queries is optimized.
-
Finally, the original query, defined via a new spatial CQL syntax is decomposed to a number of queries and executed sequentially or in parallel before aggregating result sets and returning them to the client.
Let’s further explain the above steps. We assume that in a Cassandra cluster database, the stored data scheme is under the form of events generated from distributed connected devices or external systems. The events are time-stamped and location-tagged. During the storage phase, the event’s timestamp and location are parsed to extract day, month, year and compute the corresponding numeric geohash (\(gh\_numeric\)). The event is stored and these attributes are passed as primary key attributes: day as partition key, month and year as clustering keys and \(gh\_numeric\) as either part of clustering keys or a secondary indexed column value. The definition of a such primary key is efficient in this specific-context where events are looked-up generally within time ranges and area of interest. In different context, the primary key structure may be different while being inline with Cassandra pattern data model design. The conversion of geohash value from its string representation to numeric representation enables querying ranges of events based on their \(gh\_numeric\) attributes. Indeed, since geohash values represent rectangles rather than exact geopoints location, the binary string of the geohash value is appended in the right side with a string of ‘0’ to get the \(min\_value\) and and a list of ‘1’ to get the \(max\_value\) representing the south–west and north-east rectangle corners values, respectively. Every location within this bounding rectangle has a \(gh\_numeric\) between \(min\_value\) and \(max\_value\). Hence, the spatial range query is reduced to a numeric range query instead of string-based query (which is allowed by CQL).
As depicted in Fig. 3, the binary representation of string geohash is extended with list of 0 or 1 to a predefined bit-depth to get respectively the geohash \(min\_value\) and \(max\_value\). For example, the ths geohash has a binary representation over 15 bits (each character is mapped to 5 bits). If we consider the binary extension is done over 52 bits, the geohash value has respectively a \(min\_value\) = 3592104487944192 and a \(max\_value\) = 3592241926897663. With this approach, each geohash that fits inside ths bounding box, will have a geohash numeric value between the \(min\_value\) and \(max\_value\). Nevertheless, the queries are not always over perfectly adjacent and ordered fences. A queried area could be of an arbitrary shape and the list of covering geohashes are of different precision.
Once a spatial query is received from the user, the original query is decomposed into sub-queries based on the resulting Gehashes bounding boxes. The number of resulting queries might be relatively high which may increase the query response time. To reduce the number of queries, a query aggregation algorithm is developed, Algorithm 1, and its integration in the query path is illustrated in Fig. 4.
The main function of the algorithm is the optimization of the generated sub-queries to be sent to the cluster and search space reduction. The optimization is derived from the removal of redundant bounding boxes or nested ones. For example, if a geohash is contained in another one, the upper one is kept and the smallest is removed because the result will be in the query of the containing geohash as depicted in Fig. 5a. Also, geohashes of same length could be reduced to a single geohash if their union fills its total content as illustrated in Fig. 5b. The last step is the aggregation of neighbor geohashes that could not be reduced to an upper geohash because they partly fill its content or may belong to different upper geohashes but they keep a total order of their global \(min\_value\) and \(max\_value\); This means that no other geohash, belonging to the list of geohashes to be aggregated and not grouped yet, belongs to the range limited by \(min\_value\) and \(max\_value\). This latter case is illustrated in the Fig. 5c where the green area composed of seven geohashes are aggregated into three ordered sets (yellow, orange, and purple). At this stage, queries could be sent to the coordinator nodes based on the partition key in either parallel or sequential scheme.
Spatial search queries
For analytics purpose, several types of spatial queries could be looked up. Some of them cover simple area shapes, others may target more complicated ones. In the following, we pick up the basic three queries which can be seen as a base to compute others. But let’s first define the table on which we are going to write queries:
In this schema, \(gh\_numeric\) is part of the clustering key where Cassandra creates an index over it by default. Following, we consider three examples of spatial queries:
-
Around_me looking on a focal point is a recurrent scenario. Events around an important point of interest where no predefined geometric shape is known could be discovered by only giving a geocode location and a range or radius of circle centered at the point of interest. Fine or coarse grained lookup could be adjusted through the radius value and resulting output are retrieved accordingly. For discovery and update purposes, querying events around a location gives a 360° view of surrounding area. Following is a spatial query illustration of \(Around\_me\) type where SFUNC represents the spatial function taking two parameters; the first is a of String type defining the query type. The second parameter \(circle\_params\) is a container (list) of center coordinates, radius length and a maximum geohash precision value:
-
In a given area predefined zones like countries, districts or municipalities are subject of frequent queries in a spatial database context. Also non-predefined areas such as arbitrary polygon zones could be queried. Interested scientists may want to quantify, aggregate and visualize statistics of some criteria in the supervised area. By providing a list of geocode locations forming a closed polygon, our present framework cares of the rest. The following query example illustrates such syntax where \(polyg\_params\) is a container of list of latitude/longitude pairs and a maximum geohash precision length:
-
In my path another type of spatial queries may be of particular importance is within a road segment, in the path of a tracked vehicle or maybe in a traffic stream. Trying to limit the path with a polygon may be not the best option. Hence, providing way-points and a precision level of search could be more self-explained. This kind of queries may be very useful for analytics purposes but also for real time support of emergency fleets. An example of \(In\ my\ path\) query syntax is following where \(path\_params\) is a container of list of waypoints coordinates and geohash precision value:
The syntax is simple and intuitive and it conserves the native CQL syntax, which means native non-spatial queries could be executed either directly or through the developed spatial framework.