An AppGallery for dataflow computing
© Trifunovic et al. 2016
Received: 11 November 2015
Accepted: 22 December 2015
Published: 18 February 2016
This paper describes the vision behind and the mission of the Maxeler Application Gallery (AppGallery.Maxeler.com) project. First, it concentrates on the essence and performance advantages of the Maxeler dataflow approach. Second, it reviews the support technologies that enable the dataflow approach to achieve its maximum. Third, selected examples of the Maxeler Application Gallery are presented; these examples are treated as the final achievement made possible when all the support technologies are put to work together (internal infrastructure of the AppGallery.Maxeler.com is given in a follow-up paper). As last, the possible impact of the Application Gallery is presented and the major conclusions are drawn.
KeywordsDataflow Supercomputing Maxeler Applications
A rule is that each and every paradigm-shift idea passes through four phases in its lifetime. In the phase #1, the idea is radicalized (people laugh on it). In the phase #2, the idea is attacked (some people aggressively try to destroy it). In the phase #3, the idea is accepted (most of those who were attacking it, now keep telling around that it was their idea, or at least that they always were supportive of that idea). In the final phase #4, the idea is considered as something that existed forever (those who played roles in initial phases are already dead, physically or professionally, by the time the fourth phase started).
The main question for each paradigm-shift research effort is how to make the first two phases as short as possible? In the rest of this text, this goal is referred to as the “New Paradigm Acceptance Acceleration” goal, or the NPAA goal, for short.
The Maxeler Application Gallery project is an attempt to achieve the NPAA goal in the case of dataflow supercomputing. The dataflow paradigm exists on several levels of computing abstraction. The Maxeler Application Gallery project concentrates on the dataflow approach that accelerates critical loops by forming customized execution graphs that map onto an reconfigurable infrastructure (currently FPGA-based). This approach provides considerable speedups over the existing control flow approaches, unprecedented power savings, as well as a significant size reduction of the overall supercomputer. This said, however, dataflow supercomputing is still not widely accepted, due to all kinds of barriers, ranging from the NIH syndrome (Not Invented Here), through 2G2BT (Too Good To Be True) till the AOC syndrome (Afraid Of Change) widely present in the high-performance community.
The technical infrastructure of the Maxeler Application Gallery project is described in . The paper at hand describes: (1) the essence of the paradigm that the Maxeler Application Gallery project is devoted to (2) the whereabouts of the applications used to demonstrate the superiority of the dataflow approach over the control flow approach, and (3) the methodologies used to attract as many educators and researchers as possible, to become believers into this “new” and promising paradigm shift.
The major references of relevance for this paper are:  about trading off latency and performance , about splitting temporal computing (with mostly sequential logic) and spatial computing (with combinational logic only) , about the Maxeler dataflow programming paradigm , about selected killer applications of dataflow computing , and  about the essence of the paradigm , about algorithms proven successful in the dataflow paradigm, and  about the impacts of the theories of Cutler, Kolmogorov, Feynman, and Prigogine. See [10–12] for issues in arithmetic, reconfigurability, and testing.
The dataflow supercomputing paradigm essence
At the time when the von Neumann paradigm for computing was formed, the technology was such that the ratio of arithmetic or logic (ALU) operation latencies over the communication (COMM) delays to memory or another processor t(ALU)/t(COMM) was extremely large (sometime argued to be approaching infinity).
In his famous lecture notes on Computing, the Nobel Laureate Richard Feynman presented an observation that in theory, ALU operations could be done with zero energy, while communications can never reach zero energy levels, and that speed and energy of computing could be traded. In other words, this means that in practice, the future technologies will be characterized with t(COMM)/t(ALU) extremely large (in theory, t(COMM)/t(ALU) approaching infinity), which is exactly the opposite of what was the case at the times of von Neumann. That is why a number of pioneers in dataflow supercomputing accepted to use the term Feynman Paradigm for the approach utilized by Maxeler computing systems. Feynman never worked in dataflow computing, but his observations made many to believe into the great future of the dataflow computing paradigm (along with his involvement with the Connection Machine design).
The main technology related question now is: “Where is the ratio of t(ALU) and t(COMM) now?”
According to the above mentioned sources, the energy needed to do an IEEE Floating Point Double Precision Multiplication will be only 10pJ around the year 2020, while the energy needed to move the result from one core to another core of a multi-core or a many-core machine is 2000pJ, which represents a factor of 200×. On the other hand, moving the same result over an almost-zero-length edge in a Maxeler dataflow machine, in some cases, may take less than 2pJ, which represents a factor of 0.2×.
Therefore, the times have arrived for the technology to enable the dataflow approach to be effective. Here the term technology refers to the combination of: the hardware technology, the programming paradigm and its support technologies, the compiler technology, and the code analysis, development, and testing technology. These technologies are shed more light at in the next section.
The paradigm support technologies
As far as the hardware technology, a Maxeler dataflow machine consists of either a conventional computer with a Maxeler PCIe board (e.g., Galava, Isca, Maia etc. all collectively referred as DataFlow Engines—DFEs), or is a 1U box1 that could be stacked into a 40U rack, or into a train of 40U racks.
As far as the programming paradigm and the support technologies, the Maxeler dataflow machines are programmed in space, rather than in time, which is the case with typical control flow machines. In the case of Maxeler (and the Maxeler Application Gallery project), the general framework is referred to as OpenSPL (Open Spatial Programming Language, http://www.OpenSPL.org), and its specific implementation used for the development of AppGallery.Maxeler.com applications is called MaxJ (shorthand for Maxeler Java).
As far as compilation, it consists of two parts: (a) Generating the execution graph, and (b) Mapping the execution graph onto the underlying reconfigurable infrastructure. The Maxeler compiler (MaxCompiler) is responsible for the first part. The synthesis tools from the FPGA device vendor are responsible for the second part. As interface between these two distinct phases VHDL is used.
As far as the synergy of the compiler and the internal “nano-accelerators” (small add-on at the hardware structure level, invisible at the level of the static dataflow concept, and invisible to the system designer, but under the tight control of the compiler), the most illustrative is the fact that the compiled code for a board based on Altera or Xilinx chips is much slower compared to the code compiled for a Maxeler board having the same type and the same number of Altera or Xilinx chips. This is due to many different “nano-accelerators” present at the Maxeler board. These are invisible on the concept level and invisible to the dataflow programmer, but extremely important for the concept implementation, visible to the Maxeler compiler, and it knows how to utilize them. These add-ons are here referred to as “nano-accelerators”.
As far as the tools for analysis, development, and testing, they have to take care of the following issues: (a) When the paradigm changes, the algorithms that previously were the best ones, according to some criterion, now are not any more, so the possible algorithms have to be re-evaluated, or new ones have to be created; (b) When the paradigm changes, the usage of the new paradigm has to be made simple, so the community accepts it, with as short as possible delays, and (c) When the paradigm changes, the testing tools have to be made compatible with the needs of the new paradigm, and the types of errors that occur most frequently.
The best example of the first is bitonic sort, which is considered among the slowest in control flow and among the fastest in dataflow. The best example of the second is WebIDE.Maxeler.com with all its features to support programmers. The most appropriate example of the third is related to the on-going research aimed at including a machine learning assistant that analyses past errors and offers hints for the future testing-oriented work.
All the above support technologies have to be aware of the following: If the temporal and the spatial computing are separated (hybrid computing with only combinational logic in the spatial part), and if latency is traded for better performance (maximal acceleration even for the most sophisticated types of data dependencies between different loop iterations), benefits occur only if the support technologies are able to utilize them!
Presentation of the Maxeler application gallery project
The next section of this paper contains selected examples from the Maxeler Application Gallery project, visible at the web (AppGallery.Maxeler.com). Each example is presented using the same template.
The presentation template, wherever possible, conditionally speaking, includes the following elements: (a) The whereabouts (b) The algorithm implementation essence supported with a figure, (c) The coding approach changes made necessary by the paradigm changes, supported by GitHub details, (d) The major highlights, supported with a figure, (e) The advantages and drawbacks, (f) The future trends, and (g) Conclusions related to complexity, performance, energy, and risks leading to possible problems in the domains of complexity, performance, energy, and creation of new risks that recursively open a new round of all the above.
Discussion and evaluation
Selected Maxeler application gallery examples
The following algorithms will be briefly explained: brain network, correlation, classification, dense matrix multiplication, fast Fourier transform 1D, fast Fourier transform 2D, motion estimation, hybrid coin miner, high speed packet capture, breast mammogram ROI extraction, linear regression, and n-body simulation. These twelve applications were selected, based on the speed-up provided. All these were published recently; however, they represent results of decade-long efforts.
Although various methods help in transforming algorithms from control flow to dataflow, a programmer has to be somehow aware of the main characteristics of the underlying hardware in order to produce optimal code. For example, if one has to process each row of a matrix separately using dataflow engines (DFEs), where each element depends on the previous one in the same row, the programmer should reorder the execution, so that other rows are being processed before the result of processing an element is given to the next element in the same row.
Therefore, one can say that changing the paradigm requires changing the hardware in order to produce results quicker with lower power consumption, but also changing our way of thinking. In order to do that, we have to think big, as Samuel Slater, known as “The Father of the American Factory System”. Instead of thinking how we would solve a single piece of the problem, we should rather think of how a group of specialized workers would solve the same problem, and then focus on the production system as a whole, imagining the worker in a factory line. For state of the art, see [15–16].
Dense matrix multiplication
Fast Fourier transform 1D
Fast Fourier transform 2D
Hybrid coin miner
High speed packet capture
Breast mammogram ROI extraction
The major assumption behind the strategic decision to develop the AppGallery.Maxeler.com is to help the community to understand and to adopt the dataflow approach. The best way to achieve this goal is to provide developers with access to many examples that are properly described and that clearly demonstrate benefits. Of course, the academic community is the one that most readily accepts new ways of thinking and is the primary target here.
The impact of the AppGallery.Maxeler.com is best judged by the rate at which the PhD students all over the world have started submitting their own contributions, after they had undergone a proper education, and after they were able to study from the examples generated with the direct involvement of Maxeler. The education of students is now based on the MaxelerWebIDE (described in a follow-up paper) and on the Maxeler Application Gallery. One prominent example is the Computing in space with the OpenSPL course at Imperial College taught at the Fall of 2014 for the first time (http://cc.doc.ic.ac.uk/openspl14/).
Another important contribution of the Maxeler Application Gallery project is to make the community understand that the essence of the term “optimization” in mathematics has changed. Before, the goal of the optimization was to minimize the number of iterations till convergence (e.g., minimizing the number of sequential steps involved) and to minimize the time for one loop iteration (e.g., eliminating time consuming arithmetic operations, or decreasing their number). This was so because, as indicated before, the ratio of t(ALU) over t(COMM) was very large.
Now that the ratio of t(ALU) over t(COMM) is becoming very small, the goal of optimization is to minimize the physical lengths of edges in the execution graph, which is a challenge (from the mathematics viewpoint) of a different nature. Another challenge is also to create execution graph topologies that map onto FPGA topologies with minimal deteriorations. This issue is visible from most of the Maxeler Application Gallery examples.
Finally, the existence of Maxeler Application Gallery enables different machines to be compared, as well as different algorithms for the same application to be studied comparatively. Such a possibility creates an excellent ground for effective scientific work.
This paper presented the Maxeler Application Gallery project, its vision and mission, as well as a selected number of examples. All of the examples were presented using a uniform template, which enables an easy comprehension of the new dataflow paradigm underneath.
This work, and the Maxeler Application Gallery it presents, could be used in education, in research, in development of new applications, and in demonstrating the advantages of the new dataflow paradigm. Combined with the MaxelerWebIDE, the Maxeler Application Gallery creates a number of synergistic effects (the most important of which is related the possibility to learn by trying ones own ideas).
Future research directions include more sophisticated intuitive and graphical tools that enable easy development of new applications, as well as their easy incorporation into the Maxeler Application Gallery presentation structure. In addition, significant improvement of the debugging tools is underway.
1U refers to the minimal size in rack-mount servers; the number (×U) indicates the size of the rack-mount.
new paradigm acceptance acceleration
field-programmable gate array
not invented here
too good to be true
afraid of change
arithmetic logic unit
latencies over the communication
peripheral component interconnect express
the minimal size in rack-mount servers
central processing unit
read only memory
Fast Fourier transform
compute unified device architecture
Juniper dataflow engine
Maxeler DRAM memory
region of interest
NT has developed website http://appgallery.maxeler.com/#/. He has also developed some of the presented applications. VM coordinated the work and has written all chapters up to 3.1.1, as well as impact and conclusions sections. NK described presented applications. GG participated in organization of the AppGallery project and applications available on the web site. All authors read and approved the final manuscript.
Nemanja Trifunovic is with the Maxeler, London, UK. He has diploma from the School of Electrical Engineering, University of Belgrade, Serbia. He is within the 1 % of fastest studying students at the faculty. He was an intern in Google, NY, USA, Microsoft Development Center, Serbia, as well as Maxeler, London, UK.
Prof. Veljko Milutinovic, Fellow of the IEEE, received his PhD from the University of Belgrade, spent about a decade on various faculty positions in the USA (mostly at Purdue University), and was a codesigner of the DARPAs first GaAs RISC microprocessor. Now he teaches and conducts research at the University of Belgrade, in EE, MATH, and PHY/CHEM. His research is mostly in datamining algorithms and dataflow computing, with the stress on mapping of data analytics algorithms onto fast energy efficient architectures. For 7 of his books, forewords were written by 7 different Nobel Laureates with whom he cooperated on his past industry sponsored projects. He has over 60 IEEE or ACM journal papers, and about 4000 Google Scholar citations (including all misspellings of his name). He is a member of Academia Europea. He has taught DataFlow courses at Purdue, Indiana, Harvard, and MIT.
Nenad Korolija is with the faculty of the School of Electrical Engineering, University of Belgrade, Serbia. He received an Msc degree in electrical engineering and computer science in 2009. He was a teaching assistant at Software project management and Business communication: practical course. His interests and experience include developing software for high performance computer architectures and dataflow architectures. During 2008, he worked on HIPEAC FP7 project at the University of Siena, Italy. In 2013, he was an intern at the Google Inc., Mountain View, California, USA.
Georgi Gaydadjiev is a VP of Maxeler Technologies and a professor of the Chalmers University of Technology in Sweden. He is a Senior IEEE and ACM member. He is a general and program chair of many IEEE conferences. His awards include: ACM/SIGARCH 24th International Conference on Supercomputing (ICS’10) Tsukuba, Japan, June 2010 best paper award and Ramon y Cayal incorporaci.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Trifunovic N, et al. The MaxGallery Project. Advances in computers, vol 104. Springer. 2016.
- Milutinovic V. Trading latency and performance: a new algorithm for adaptive equalization. IEEE Transactions on Communications. 1985.
- Milutinovic V. Splitting temporal and spatial computing: enabling a combinational dataflow in hardware. Italy: The ISCA ACM Tutorial on Advances in SuperComputing, Santa Margherita Ligure; 1995.Google Scholar
- Mencer O, Gaydadjiev G, Flynn M. OpenSPL: the Maxeler programming model for programming in space. UK: Maxeler Technologies; 2012.Google Scholar
- Flynn M, Mencer O, Milutinovic V, Rakocevic G, Stenstrom P, Valero M, Trobec R. Moving from PetaFlops to PetaData. communications of the ACM. 2013. pp. 39–43.
- Milutinovic V. The dataflow processing. Amsterdam: Elsevier; 2015.Google Scholar
- Milutinovic V, Trifunovic N, Salom J, Giorgi R. The guide to dataflow supercomputing. USA: Springer; 2015.View ArticleGoogle Scholar
- The Maxeler dataflow supercomputing. The IBM TJ Watson Lecture Series. 2015.
- The Maxeler dataflow supercomputing: guest editor introduction. Advances in computers, vol 104. Springer. 2016.
- Vassiliadis S, Wong S, Gaydadjiev G, Bertels K, Kuzmanov G. The molen polymorphic processor. IEEE Trans Comput. 2004;53(11):1363–75.View ArticleGoogle Scholar
- Dou Y, Vassiliadis S, Kuzmanov GK, Gaydadjiev GN. 64-bit floating-point FPGA matrix multiplication. In: Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. 2005. pp. 86–95.
- van de Goor AJ, Gaydadjiev GN, Mikitjuk VG, Yarmolik VN, March LR. A test for realistic linked faults. VLSI Test Symposium, 1996. In: Proceedings of 14th. 1996. pp. 272–280.
- Miranker V. Space-time representations of computational structures. IEEE Comput. 1984; 32.
- Kuhn P. Transforming algorithms for single-stage and VLSI architectures. In: Proceedings of Workshop on Interconnection Networks for Parallel and Distributed Processing. 1980.
- Oriato D, Girdlestone S, Mencer O. Dataflow computing in extreme performance conditions. In: Hurson A, Milutinovic V (eds) Dataflow processing. Advances in computers, vol 96. Waltham: Academic Press; 2015. p. 105–138.Google Scholar
- Stojanović S, Bojić D, Bojović M. An overview of selected heterogeneous and reconfigurable architectures. In: Hurson A, Milutinovic V (eds) Dataflow processing. Advances in computers, vol 96. Waltham: Academic Press; 2015. p. 1–45.Google Scholar