Skip to main content

Table 4 Experimental dataset summary sorted by the dataset’s file size

From: Efficient spatial data partitioning for distributed \(k\)NN joins

Dataset Name

Short Name

Summary

Observations

OSM POI  [63]

POI

\(\bullet ~38.814\) MB

\(\bullet ~119,319\) Points

\(\bullet \) Open Street Map (OSM) points of interest

\(\bullet \) New York City (NYC) only

\(\bullet \) GPS location of buildings, restaurants, shops ...

NYC Bus Trip Records [64]

BUS

\(\bullet ~22.147\) GB

\(\bullet ~221.715\) Mil. Points

\(\bullet \) Similar format to the TAXI dataset but denser (Buses run over fewer city streets)

\(\bullet \) Non-uniform distribution (Fig. 1a)

\(\bullet \) Good for testing the behavior with locations significantly overloaded than others.

NYC Taxi Trip Records [65]

TAXI

\(\bullet ~27.738\) GB

\(\bullet ~165.114\) Mil. Points

\(\bullet \) Non-uniform distribution (Fig. 1b)

\(\bullet \) Ideal for testing techniques that cannot handle the LARGE dataset

TLC TPEP and LPEP [65]

TLC

\(\bullet ~141.99\) GB

\(\bullet ~3.78\) Bil. Points

\(\bullet \) Non-uniform distribution (Fig. 1c)

\(\bullet ~10.9\) Mil duplicate records.

\(\bullet ~158.9\) Mil unmatchable records.