Efficient spatial data partitioning for distributed $$k$$ NN joins

Journal of Big Data

Table 4 Experimental dataset summary sorted by the dataset’s file size

Dataset Name	Short Name	Summary	Observations
OSM POI [63]	POI	$\bullet ~38.814$ MB $\bullet ~119,319$ Points	$\bullet $ Open Street Map (OSM) points of interest $\bullet $ New York City (NYC) only $\bullet $ GPS location of buildings, restaurants, shops ...
NYC Bus Trip Records [64]	BUS	$\bullet ~22.147$ GB $\bullet ~221.715$ Mil. Points	$\bullet $ Similar format to the TAXI dataset but denser (Buses run over fewer city streets) $\bullet $ Non-uniform distribution (Fig. 1a) $\bullet $ Good for testing the behavior with locations significantly overloaded than others.
NYC Taxi Trip Records [65]	TAXI	$\bullet ~27.738$ GB $\bullet ~165.114$ Mil. Points	$\bullet $ Non-uniform distribution (Fig. 1b) $\bullet $ Ideal for testing techniques that cannot handle the LARGE dataset
TLC TPEP and LPEP [65]	TLC	$\bullet ~141.99$ GB $\bullet ~3.78$ Bil. Points	$\bullet $ Non-uniform distribution (Fig. 1c) $\bullet ~10.9$ Mil duplicate records. $\bullet ~158.9$ Mil unmatchable records.

Dataset Name	Short Name	Summary	Observations
OSM POI [63]	POI	\(\bullet ~38.814\) MB \(\bullet ~119,319\) Points	\(\bullet \) Open Street Map (OSM) points of interest \(\bullet \) New York City (NYC) only \(\bullet \) GPS location of buildings, restaurants, shops ...
NYC Bus Trip Records [64]	BUS	\(\bullet ~22.147\) GB \(\bullet ~221.715\) Mil. Points	\(\bullet \) Similar format to the TAXI dataset but denser (Buses run over fewer city streets) \(\bullet \) Non-uniform distribution (Fig. 1a) \(\bullet \) Good for testing the behavior with locations significantly overloaded than others.
NYC Taxi Trip Records [65]	TAXI	\(\bullet ~27.738\) GB \(\bullet ~165.114\) Mil. Points	\(\bullet \) Non-uniform distribution (Fig. 1b) \(\bullet \) Ideal for testing techniques that cannot handle the LARGE dataset
TLC TPEP and LPEP [65]	TLC	\(\bullet ~141.99\) GB \(\bullet ~3.78\) Bil. Points	\(\bullet \) Non-uniform distribution (Fig. 1c) \(\bullet ~10.9\) Mil duplicate records. \(\bullet ~158.9\) Mil unmatchable records.