Skip to main content

Table 1 Description of algorithm steps

From: Comparison of sort algorithms in Hadoop and PCJ

Step

Description

Reading pivots

Pivots are read evenly from a specific portion of the input file by each thread. Then PCJ Thread-0 performs the reduce operation for gathering pivots data from other threads. The list is being sorted using standard Java sort algorithm [49]. The possible duplicate records are removed from the list. Then the evenly placed pivots are taken from the list and broadcasted to all the threads. A thread starts reading the input file when it receives the list

Reading input

Pivots are the records that divide input data into buckets. Each thread has to have its own set of buckets that will be used for exchanging data between threads. Each bucket is a list of records. While reading input, the record’s bucket is deducted from its possible insert place in pivots list by using Java built-in binary search method. The record is added to the right bucket

Exchanging buckets

After reading the input file, it is necessary to send the data from the buckets to the threads that are responsible for them. The responsibility here means sorting and writing to the output file. After sending buckets data to all other threads, it is necessary to wait for receiving data from all of them

Sorting

After receiving every buckets’ data it is time to sort. Each bucket is shredded into smaller arrays—one array per source thread. It is necessary to flatten the array and then sort the whole big array of records. Standard Java sort algorithm [49] for non-primitive types is used for sorting the array. The sort algorithm, called timsort, is a stable, adaptive, iterative mergesort, which implementation is adapted from Tim Peters’s list sort for Python [50], that uses techniques from [51]

Writing output

Writing buckets data to a single output file in the correct order is the last step. This is the most sequential part of the application. Each thread has to wait for its turn to write data to the output file