The performance results have been obtained using the Cray XC40 system at ICM (University of Warsaw, Poland) and HLRS (University of Stuttgart, Germany). The computing nodes (boards) are equipped with two Intel Xeon E5-2690 v3 (ICM) or Intel Haswell E5-2680 (HLRS) processors, each processor contains 12 cores. In both cases, there is hyperthreading available (2 threads per core). Both systems have Cray Aries interconnect installed. The PCJ library has been also tested on the other architectures such as Power 8 or Intel KNL [27]. However, we decided to present here results obtained using Cray XC40 systems since one of the first exascale systems will be a continuation of such architecture [28]. We have used Java 1.8.0_51 from Oracle for PCJ and Oracle JDK 10.0.2 for APGAS. For the C/MPI we have used Cray MPICH implementations in version 8.3 and 8.4 for ICM and HLRS machines respectively. We have used OpenMPI in version 4.0.0, that gives Java bindings for the MPI, to collect data for the Java/MPI execution.
HPC workloads
2D stencil
As an example of a 2D stencil algorithm we have used Game of Life which can be seen as a typical 9-point 2D stencil—the 2D Moore neighborhood. The Game of Life is a cellular automaton devised by John Conway [29]. In our implementation [30] the board is not infinite—it has its maximum width and height. Each thread owns a subboard—a part of the board divided in a uniform way using block distribution. Although there are known fast algorithms and optimizations that can save computational time generating the next universe state, like Hashlife or memorization of the changed cells, we have decided to use a straightforward implementation with a lookup of the state for each cell. However, to save memory, each cell is represented as a single bit, where 0 and 1 mean that the cell is dead and alive respectively.
After generating the new universe state, the border cells of subboards are exchanged asynchronously between proper threads. The threads that have cells on the first and last columns and rows of the universe are not exchanging the cells state to the opposite threads. The state of neighbour cells that would be behind the universe edge is treated as dead.
We have measured the performance in the total number of cells processed in the unit of time (\(cells / s\)). For each test, we performed 11 time steps. We warmed up the Java Virtual Machine to allow the JVM to use Just-in-Time (JIT) compilation to optimize the run instead of execution in interpreted mode. We also ensured that the Garbage Collector (GC) had not much impact on the gained performance. To do so we took peak performance (maximum of steps performance) for the whole simulation. We have used 48 working threads per node.
Figure 5 presents performance comparison of Game of Life applications for \({604,800}{\times }{604,800}\) cells universe. The performance for both implementations (PCJ and Java/MPI) is very similar and results in almost ideal scalability. C/MPI version presents 3-times higher performance and similar scalability. The performance data shows scalability up to 100,000 threads (on 2048 nodes). For a larger number of threads, the parallel efficiency decreases due to the small workload run on each processor compared to the communication time required for halo exchange. The scaling results obtained in the weak scaling mode (i.e. with a constant amount of work allocated to each thread despite the thread number) show good scalability beyond 100,000 thread limit [10]. The ideal scaling dashed line for PCJ is plotted for reference. Presented results show ability of running large scale HPC applications using Java and the PCJ library.
Inset in Fig. 5 presents the performance statistics calculated based on 11 time steps of the Game of Life application executed on 256 nodes (12,288 threads). The ends of whiskers are minimum and maximum values, a cross (\(\times\)) represents an average value, a box represents values between 1st and 3rd quartiles, and a band inside the box is a median value. In the case of C/MPI, the box and whiskers are not visible, as the execution shows the same performance for all of the execution steps. In the case of JVM executions (PCJ and Java/MPI), minimum values come from the very first steps of execution, when the execution was made in interpreted mode. However, the JIT compilation quickly optimized the run and the vast majority of steps were run with the highest performance. It is clearly visible that Java applications, after JIT compilation, has very stable performance results as the maximum, median and 1st and 3rd quartiles data are almost indistinguishable in the figure.
Fast Fourier Transform
The main difficulty in efficient parallelization of FFT comes from the global character of the algorithm, which involves an extensive all to all communication. One of the efficient distributed FFT implementations available is based on the algorithm published by Takahashi and Kanada [31]. It is used as a reference MPI implementation in the HPC Challenge Benchmark [32], a well-known suite of tests for assessing the HPC systems performance. This implementation is treated as a baseline for the tests of the PCJ version described herein (itself based on [33]), with the performance of all-to-all exchange being the key factor.
In the case of PCJ code [34] we have chosen, as a starting point, PGAS implementation developed for Coarray Fortran 2.0 [33]. The original Fortran algorithm uses a radix 2 binary exchange algorithm that aims to reduce interprocess communication and is structured as follows: firstly, a local FFT calculation is performed based on the bit-reversing permutation of input data; after this step all threads perform data transposition from block to cyclic layout, thus allowing for subsequent local FFT computations; finally, a reverse transposition restores data to is original block layout [33]. Similarly to Random Access implementation, inter-thread communication is therefore localized in the all-to-all routine that is used for a global conversion of data layout, from block to cyclic and vice verse. Such implementation allows one to limit the communication, yet makes the implementation of all-to-all exchange once again central to the overall program’s performance.
The results for complex one-dimensional FFT of \(2^{30}\) elements (Fig. 6) show how the three alternative PCJ all-to-all implementations compare in terms of scalability. Blocking and non-blocking ones iterate through all other threads to read data from their shared memory areas (PcjFutures are used in a non-blocking version). Hypercube-based communication utilizes a series of pairwise exchanges to avoid network congestion. While nonblocking communication achieved the best peak performance, the hypercube-based solution exploited the available computational resources to the greatest extent, reaching peak performance for 4096 threads when compared to 1024 threads in the case of nonblocking communication. Java/MPI code uses the same algorithm as PCJ for calculation and all-to-all exchange. It is implemented using the native MPI primitive. The scalability of the PCJ implementation follows the results of reference C/MPI code as well as those of Java/MPI. Total execution time for Java is larger when compared to all-native implementation irrespective of the underlying communication library. Presented results confirm, that performance and scalability of PCJ and Java/MPI implementations are similar. The PCJ library is easier to use, less error prone and does not require libraries external to Java such as MPI. Therefore it is good alternative to MPI. Java implementations are slower than HPCC which is implemented using C. This comes from that different ways of storing and accessing data.
Data analitycs
WordCount
WordCount is traditionally used for demonstrative purposes to showcase the basics of the map-reduce programming paradigm. It works by reading an input file on a line-by-line basis and counting individual word occurrences (map phase). The reduction is performed by summing the partial results calculated by worker threads. Full source code of the application is available at GitHub [35]. Herein the comparison between PCJ’s and APGAS’s performance is presented with the C++/MPI version shown as a baseline. \(\hbox {APGAS}_{stat}\) is the basic implementation, \(\hbox {APGAS}_{dyn}\) is a version enhanced with dynamic load-balancing capabilities. The APGAS library, as well as its implementation of WordCount code, are based on the prior work [25]. APGAS code was run using SLURM in Multiple Programs, Multiple Data (MPMD) mode, with commands used to start computations and remote APGAS places differing. A range of the number of nodes used to run a given number of threads was tested and the best-achieved results are presented. Due to APGAS’s requirements, Oracle JDK 10.0.2 was used in all cases. The tests use 3.3 MB UTF-8 encoded text of English translation of Tolstoy’s War and Peace as a textual corpus for word counting code. They were performed in a strong scalability regime, with the input file being read 4096 times and all threads reading the same file. The file content is not preloaded into the application memory before the benchmark.
The performance of the reduction phase is key for the overall performance [10] and the best results in case of PCJ are obtained using binary tree communication. APGAS solution uses the reduction as implemented in [25] (this work reports the worse performance of PCJ, due to the use of simpler and therefore less efficient reduction scheme).
The results presented in Fig. 7 show good scalability of the PCJ implementation. PCJ’s performance was better when compared to APGAS, which can be tracked to the PCJ’s reduction implementation. Regarding native code, C++ was chosen as a better-suited language for this task than C, because of its built-in map primitives and higher level string manipulation routines. While C++ code scales ideally, its poor performance when measured in absolute time can be traced back to the implementation of line-tokenizing. All the codes (PCJ, APGAS, C++), in line with our earlier works [5], consistently use regular expressions for this task.
One should note, that different set of results obtained on the Hadoop cluster shows that PCJ implementation is at least 3 times faster than Hadoop one [5] and Spark one [25].
Artificial Intelligence workloads
AI is currently a vibrant area of research, gaining a lot from advances in the processing capabilities of modern hardware. The PCJ library was tested in the area of artificial intelligence to ensure that it provides AI workloads with sufficient processing potential, able to exploit the future exascale systems. In this respect, two types of workloads were considered. Firstly, stemming from the traditional mode of AI research aimed at discovering the inner working of real physiological systems, the library was used to aid the researchers in the task of modeling the C. Elegans neuronal circuity. Secondly, it was used to power the training of the modern artificial neural network, distributing the gradient descent calculations.
Neural networks—modeling the connectome of C. Elegans
Nematode C. Elegans is a model organism whose neuronal development has been studied extensively and remains the only organism with a fully known connectome. There are currently some experiments that aim to link its structure and actual worm’s behavior. In one of those experiments, worm’s motoric neurons were ablated using a laser, affecting the changes of its movement patterns [36]. The results of those experiments allowed to create a mathematical model of the relevant connectome fragment by a biophysics expert. The model was defined by a set of ordinary differential equations, with 8 parameters.
The value of those parameters was key to the model’s accuracy, yet they were impossible to calculate using the traditional numerical or analytical methods. Therefore a differential evolution algorithm was used to explore the solution space and fit the model’s parameters so that its predictions are in line with the empirical data. The mathematical model has been implemented in Java and parallelized with the use of the PCJ library [36, 37]. It should be noted that the library allowed to rapidly (ca. 2 months) prototype the connectome model and align it according to the shifting requirements of the biophysics expert.
In regards to the implementation’s performance, Fig. 8 can be consulted, where it is expressed as a number of tested configurations per second. The experimental dataset amounted to a population of 5 candidate vectors affiliated with each thread that was evaluated through 5 iterations in a weak scaling regime. A scaling close to the ideal was achieved both irrespective of the hyperthreading status, as its overhead in this scenario is minimal. The outlier visible in the case of 192 threads is most probably due to the stochastic nature of the differential evolution algorithm and disparities regarding model evaluation time for concrete sets of parameters.
Distributed neural network training
The PCJ library was also tested in workloads specific to modern machine learning applications. It was successfully integrated with TensorFlow for the distribution of gradient descent operation for effective training of neural networks [38], performing very well against the Python/C/MPI-based state-of-the-art solution, Horovod [39].
For presentation purposes, a simple network consisting of three fully connected layers (sized 300, 100 and 10 neurons respectively [40]) was trained for handwritten digits recognition for 20 epochs (i.e. for a fixed number of iterations) on MNIST dataset [41] (composed of 60,000 training images of which 5000 were set aside for validation purposes in this test), with mini-batch consisting of 50 images. PCJ tests two algorithms. The first one uses the same general idea for gradient descent calculations as Horovod (i.e. data-parallel calculations are performed process-wise, and the gradients are subsequently averaged after each mini-batch). The second one implements asynchronous parallel gradient descent as described in [42].
Implementation-wise, Horovod works by supplying the user with simple to use Python package with wrappers and hooks that allow enhancing existing code with distributed capabilities and MPI is used for interprocess communication. In the case of PCJ, a special runner was coded in Java with the use of TensorFlow’s Java API for the distribution and instrumentation of training calculations. Relevant changes had to be implemented in Python code as well. Our code implements the reduction operation based on the hypercube allreduce algorithm [43].
The calculations were performed using the Cray XC40 system at ICM with Python 3.6.1 installed alongside TensorFlow v. 1.130-rc1. Horovod was installed with the use of Python’s pip tool version 0.16.0. SLURM was used to start distributed calculations, with one TensorFlow process per node. We have used 48 working threads per node.
Results in strong scalability regime presented in Fig. 9 show that the PCJ implementation that facilitates asynchronicity is on a par with MPI-based Horovod. In the case of smaller training data sizes when a larger number of nodes is used, our implementation is at a disadvantage in terms of accuracy. This is because the overall calculation time is small and communication routines are not able to finish in time before thread finish local training. The datapoint for 3072 threads (64 nodes) was thus omitted for asynchronous case in Fig. 9. Achieving full performance of Horovod on our cluster was only possible after using non-standard configuration for available TensorFlow installation. This in turn allowed to fully exploit inter-node parallelism with the use of Math Kernel Library (MKL). TensorFlow for Java available as a Maven package did not exhibit the need for this fine-tuning, as it does not use MKL for computation.
Presented results clearly show that PCJ can be efficiently used for parallelization of AI workloads. Moreover, use of Java language allows for easy integration with existing applications and frameworks. In this case PCJ allowed for easier deployment of most efficient configuration of TensorFlow on HPC cluster.
Future work
From the very beginning, the PCJ library has been using sockets for transferring the data between nodes. This design was straightforward, however, it precludes the full utilization of the novel communication hardware such as Cray Aries or InfiniBand interconnects. There is ongoing work to use novel technologies in PCJ. This is especially important for network-intensive applications. However, we are looking for Java interfaces that can simplify integration. DiSNI [44] or jVerbs [45] seems to be a good choice, however, both are based on the specific implementation of communication and their usage by the PCJ library is not easy. There are also attempts to speed up data access in Java using Remote Direct Memory Access (RDMA) technology [46, 47]. We are investigating how to use it in the PCJ library.
Another reason for low communication performance is the problem of data copying during the send and receive process. This cannot be avoided due to the Java design: technologies based on the zero-copy and direct access to the memory do not work in this case. This is an important issue not only for the PCJ library but for Java in general.
As one of the main principles of the PCJ library is not to depend on adding any additional library, PCJ uses a standard Java object serialization mechanism to make a complete copy of an object. There are undergoing works that would allow using external serialization or cloning libraries, like Kryo, that could speed up making a copy of data.
The current development of the PCJ library is focused on the code execution on the multiple, multicore processors. Whilst Cray XC40 is representative for most of the current TOP500 systems, of which only 20% are equipped with Graphics Processing Units (GPUs), the peta- and exascale systems are heterogeneous, and in addition to the CPU’s nodes contains accelerators such as GPUs, Field-Programmable Gate Arrays (FPGAs), and others. The PCJ library supports accelerators through JNI mechanisms. In particular one can use JCuda to run Compute Unified Device Architecture (CUDA) kernels on the accelerators. This mechanism has been checked experimentally, the performance results are in preparation. Similarly, nothing precludes the already existing PCJ-TensorFlow code from using TensorFlow’s GPU exploitation capabilities.