Skip to main content

Table 4 Experimental dataset summary sorted by the dataset’s file size

From: Efficient spatial data partitioning for distributed \(k\)NN joins

Dataset Name Short Name Summary Observations
OSM POI  [63] POI \(\bullet ~38.814\) MB
\(\bullet ~119,319\) Points
\(\bullet \) Open Street Map (OSM) points of interest
\(\bullet \) New York City (NYC) only
\(\bullet \) GPS location of buildings, restaurants, shops ...
NYC Bus Trip Records [64] BUS \(\bullet ~22.147\) GB
\(\bullet ~221.715\) Mil. Points
\(\bullet \) Similar format to the TAXI dataset but denser (Buses run over fewer city streets)
\(\bullet \) Non-uniform distribution (Fig. 1a)
\(\bullet \) Good for testing the behavior with locations significantly overloaded than others.
NYC Taxi Trip Records [65] TAXI \(\bullet ~27.738\) GB
\(\bullet ~165.114\) Mil. Points
\(\bullet \) Non-uniform distribution (Fig. 1b)
\(\bullet \) Ideal for testing techniques that cannot handle the LARGE dataset
TLC TPEP and LPEP [65] TLC \(\bullet ~141.99\) GB
\(\bullet ~3.78\) Bil. Points
\(\bullet \) Non-uniform distribution (Fig. 1c)
\(\bullet ~10.9\) Mil duplicate records.
\(\bullet ~158.9\) Mil unmatchable records.