Paradigm Shift in Big Data SuperComputing: DataFlow vs. ControlFlow
© Trifunovic et al.; licensee Springer. 2015
- Received: 22 April 2014
- Accepted: 13 November 2014
- Published: 10 May 2015
The paper discusses the shift in the computing paradigm and the programming model for Big Data problems and applications. We compare DataFlow and ControlFlow programming models through their quantity and quality aspects. Big Data problems and applications that are suitable for implementation on DataFlow computers should not be measured using the same measures as ControlFlow computers. We propose a new methodology for benchmarking, which takes into account not only the execution time, but also the power and space, needed to complete the task. Recent research shows that if the TOP500 ranking was based on the new performance measures, DataFlow machines would outperform ControlFlow machines. To support the above claims, we present eight recent implementations of various algorithms using the DataFlow paradigm, which show considerable speed-ups, power reductions and space savings over their implementation using the ControlFlow paradigm.
- Power Dissipation
- TOP500 Ranking
- Suffix Array
- Sorting Network
- Reverse Time Migration
Big Data is becoming a reality in more and more research areas every year. Also, Big Data applications are becoming more visible as they are slowly entering areas concerning the general public. In other words, Big Data applications that were up to now present mainly in the highly specialized areas of research, like geophysics , and financial engineering , are making its way into more general areas, like medicine and pharmacy , biology, aviation , politics, acoustics , etc.
In the last years the ratio of data volume increase is higher than the ratio of processing power increase. With the growing adoption of data-collecting technologies, like sensor networks, Internet of Things, and others, the data volume growth ratio is expected to continue to increase.
Among others, one important question arises: how do we process such quantities of data. One possible answer lies is in the shift of the computing paradigm and the programming model. With Big Data problems, it is many times more reasonable to concentrate on data rather than on the process. This can be achieved by employing DataFlow computing paradigm, programming model, and computers.
The strength of DataFlow, compared to ControlFlow computers is in the fact that they accelerate the data flows and application loops from 10× to 1000×. How many orders of magnitude depends on the amount of data reusability within the loops. This feature is enabled by compiling down to levels much below the machine code, which brings important additional effects: much lower execution time, equipment size, and power dissipation.
The above strengths can prove especially important in Big Data applications that can benefit from one or more of the DataFlow advantages. For instance:
A daily periodic Big Data application, which would not finish in time, if executed on a ControlFlow computer, executes in time on a DataFlow computer of the same equipment size and power dissipation,
A Big Data application with limited space and/or power resources (remote locations such as ships, research stations, etc.) executes in a reasonable amount of time,
With Big Data applications, where execution time is not a prime concern, DataFlow computers can save space and energy.
The previous paper  argues that time has come to redefine TOP500 benchmarking. Concrete measurement data from real applications in geophysics ,, financial engineering , and some other research fields ,,-, shows that a DataFlow machine (for example, the Maxeler MAX series) rates better than a ControlFlow machine (for example, Cray Titan), if a different benchmark is used (e.g., a Big Data benchmark), as well as a different ranking methodology (e.g., the benchmark execution time multiplied by the number of 1U boxes needed to accomplish the given execution time - 1U box represents one rack unit or equivalent - it is assumed, no matter what technology is inside, the 1U box always has the same size and always uses the same power).
In reaction to the previous paper , scientific community insists that more light is shed on two issues: (a) Programming paradigm and (b) Benchmarking methodology. Consequently the stress of this viewpoint is on these two issues.
What is the fastest, the least complex, and the least power consuming way to do (Big Data) computing?
Answer: Rather than writing one program to control the flow of data through the computer, one has to write a program to configure the hardware of the computer, so that input data, when it arrives, can flow through the computer hardware in only one way - the way how the computer hardware has been configured. This is best achieved if the serial part of the application (the transactions) continues to run on the ControlFlow host and the parallel part of the application is migrated into a DataFlow accelerator. A DataFlow part of the application does (parallel) Big Data crunching and execution of loops.
The early works of Dennis  and Arvind  could prove the concept, but could not result in commercial successes for three reasons: (a) Reconfigurable hardware technology was not yet ready. Contemporary ASIC was fast enough but not reconfigurable, while reconfigurable FPGA was nonexistent; (b) System software technology was not yet ready. Methodologies for fast creation of system software did exist, but effective tools for large scale efforts of this sort did not; and (c) Applications of those days were not of the Big Data type, so the streaming capabilities of the DataFlow computing model could not generate performance superiority. Recent measurements show, that, currently, Maxeler can move internally over 1 TB of data per second .
Quantitatively speaking, the complexity of DataFlow programming, in the case of Maxeler, is equal to 2n + 3, where n refers to the number of loops migrated from the ControlFlow host to the DataFlow accelerator. This means, the following programs have to be written:
One kernel program per loop, to map the loop onto the DataFlow hardware;
One kernel test program per loop, to test the above;
One manager program (no matter how many kernels there are) to move data:
Into the DataFlow accelerator,
In between the kernels (if more than one kernel exists), and
Out of the DataFlow accelerator;
One simulation builder program, to test code without the time consuming migration into the binary level;
One hardware builder program, to exploit the code on the binary level.
In addition, in the host program (initially written in Fortran, Hadoop, MapReduce, MathLab, Matematika, C++, or C), instead of each migrated loop, one has to include a streaming construct (send data + receive results), represented by automatically generated C function Calc(x, DATA_SIZE, see Figure 1 (a).
Research design and methodology
Bare speed is definitely neither the only issue of importance nor the most crucial one . Consequently, the TOP500 ranking should not concentrate on only one issue of importance, no matter if it is speed, power dissipation, or size (the size includes hardware complexity in the widest sense); it should concentrate on all three issues together, at the same time.
Note that the hardware design cost is not encompassed by the parameter A, which encompasses only the hardware production cost, and causes that the above defined H formula represents an upper bound for ControlFlow machines and a lower bound for DataFlow machines. This is due to the fact that ControlFlow machines are built using the Von Neumann logic, which is complex to design (execution control unit, cash control mechanism, prediction mechanisms, etc.), while the DataFlow machines are built using the FPGA logic, which is simple to design; mostly because the level of design repetitiveness is extremely high, etc. The latter is beneficiary for many Big Data problems, where a large amount of data is continuously processed through the use of relatively simple operations.
As indicated in the previous paper , the performance measure H puts PetaFlops out of date, and brings PetaData into the focus. Consequently, if the TOP500 ranking was based on the performance measure H, DataFlow machines would outperform ControlFlow machines. This statement is backed up with performance data presented in the next section.
A survey of recent implementations of various algorithms using the DataFlow paradigm can be found in . Future trends in the development of the DataFlow paradigm can be found in . For comparison purposes, future trends in the ControlFlow paradigm can be found in .
Lindtjorn et al. , proved: (T) That one DataFlow node has the performance equivalent to about 70 twin server Nehelem CPU machines and to 14 two card Tesla GPU machines (application: Schlumberger, GeoPhysics), (P) Using a 150 MHz FPGAs, and (A) Packaged as 1U.
The algorithm involved was Reverse Time Migration (RTM).
Oriato et al. , proved: (T) That two DataFlow nodes have the performance equivalent to more than 1,900 3 GHz X86 CPU cores (application: ENI, The velocity-stress form of the elastic wave equation), (P) Using sixteen 150 MHz FPGAs, and (A) Packaged as 2U.
Mencer et al.  proved: (T) That one DataFlow node has the performance equivalent to more than 382 Intel Xeon 2.7 GHz CPU cores (application: ENI, CRS 4 Lab, Meteorological Modelling), (P) Using a 150 MHz FPGAs, and (A) Packaged as 1U.
Stojanović et al.  proved: (T) That one DataFlow node has the performance of about 10 i7 CPU cores, (P) Power reduction of about 17, and (A) Packaged as 1U.
Chow et al.  proved: (T) That one DataFlow node has the performance of about 163 quad core CPUs, (P) Power reduction of about 170, and (A) Packaged as 1U.
Arram et al.  proved: (T) That one DataFlow node has the performance of about 13 Intel X5650 20 core CPUs and about 4 NVDIA GTX 580 GPU machine, (P) Using one 150 MHz FPGAs, and (A) Packaged as 1U.
The algorithm involved (Genetic Sequence Alignment) was based on FM-index. This index combines the properties of suffix array (SA) with the Burrows-Wheeler transform (BWT).
Guo et al.  proved: (T) That one DataFlow node has the performance of about 517 Intel i3 2.93 GHz CPU cores and of about 28 GPU machines, (P) Using one 150 MHz FPGA, and (A) Packaged as 1U.
Kos et al.  proved: (T) That one DataFlow node has the performance of between 100 and 400 Intel Core2 Quad 2.66 GHz CPU cores, (P) using one 200 MHz FPGA, and (A) Packaged as 1U.
The achieved speed-up depended on N and the bit-size of numbers being sorted (between 8 and 64 bits).
The viewpoint presented in this paper sheds more light on the recent development of the DataFlow computing concept (more details can be found in -). The DataFlow computing paradigm requires new ways of thinking and new ways of programming. In general it redefines the subordination of program and data; instead of writing a program that controls how the data flows, the data flow defines the way a program is written.
DataFlow computing excells with applications which are having high repetettivenes of operation and some level of data reusability within the operations. The latter is particularly beneficiarry for many BigData problems, where a large amount of data is repetetively processed through the use of relatively simple operations.
The newly presented benchmarking methodology performance measure H (defined as the number of 1U boxes needed to accomplish the desired execution time using a given Big Data benchmark), would considerably reorder the TOP500 list. If the TOP500 ranking was based on the performance measure H, DataFlow machines would outperform ControlFlow machines. This statement is backed up with the presented performance results. The results show that when using DataFlow computers, instead od ControlFlow computers, time, energy, and/or space can be saved.
The above can be of great interest to those who have to make decisions about future developments of their Big Data centers. It also opens up a new important problem: The need for a development of a public cloud of ready-to-use Big Data applications.
The only remaining question is: can a Big Data application be broken in to a set of tasks and operations that are easily mappable into a DataFlow execution graph for a FPGA structure? We argue that for most Big Data applications the answer is positive!
This paper was prepared in response to over 100 e-mail messages with questions from the CACM readership inspired by our previous CACM contribution .
- Flynn M, Mencer O, Milutinovic V, Rakocevic G, Stenstrom P, Trobec R, Valero M: Moving from petaflops to petadata. In Communications of the ACM. 56th edition. ACM, New York, NY, USA; 2013:39–42.Google Scholar
- Dennis J, Misunas D, ‘A Preliminary Architecture for a Basic Data-Flow Processor’, Proceedings of the ISCA '75 the 2nd Annual Symposium on Computer Architecture, ACM, New York, NY, USA, 1975, pp. 126–132Google Scholar
- Agerwala T: Arvind, −., “data flow systems: guest Editors’ introduction”, IEEE . Computer 1982,15(2):10–13. 10.1109/MC.1982.1653937View ArticleGoogle Scholar
- Mencer O, “Multiscale Dataflow Computing”, Proceedings of the Final Conference Supercomputacion yeCiencia, Barcelona Supercomputing Center, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, 2013View ArticleGoogle Scholar
- Resch M, “Future Strategies for Supercomputing”, Proceedings of the Final Conference Supercomputacion yeCiencia, Barcelona Supercomputing Center, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, 2013Google Scholar
- Flynn M, “Area - Time - Power and Design Effort: The Basic Tradeoffs in Application Specific Systems”, Proceedings of the 2005 IEEE International Conference on Application-Specific Systems and Architecture Processors (ASAP'05), Samos, Greece, July 23-July 25, 2005Google Scholar
- Salom J, Fujii H, “Overview of Acceleration Results of Maxeler FPGA Machines”, IPSI Transactions on Internet Research, July 2013, Volume 5, Number 1, pp. 1–4Google Scholar
- Maxeler Technologies FrontPage, http://www.maxeler.com/content/frontpage/, London, UK, October 20, 2011., Maxeler Technologies FrontPage, , London, UK, October 20, 2011. http://www.maxeler.com/content/frontpage/ Maxeler Technologies FrontPage, , London, UK, October 20, 2011.
- Patt Y: Future microprocessors: what must We do differently if We Are to effectively utilize multi-core and many-core chips? IPSI Transactions on Internet Res 2009,5(1):1–9.Google Scholar
- Lindtjorn O, Clapp G, Pell O, Mencer O, Flynn M, Fu H, “Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications,“IEEE Micro, Washington, USA, March/April 2011, Vol. 31, No. 2, pp. 1–9Google Scholar
- Oriato D, Pell O, Andreoletti C, Bienati N, FD Modeling Beyond 70 Hz with FPGA Acceleration, “Maxeler Summary, , Summary of a talk presented at the SEG 2010 HPC Workshop, Denver, Colorado, USA, October 2010 http://www.maxeler.com/media/documents/MaxelerSummaryFDModelingBeyond70Hz.pdf Oriato D, Pell O, Andreoletti C, Bienati N, FD Modeling Beyond 70 Hz with FPGA Acceleration, “Maxeler Summary, , Summary of a talk presented at the SEG 2010 HPC Workshop, Denver, Colorado, USA, October 2010
- Mencer O, “Acceleration of a Meteorological Limited Area Model with Dataflow Engines”, Proceedings of the Final Conference Supercomputacion y eCiencia, Barcelona Supercomputing Center, SyeC, Barcelona, Catalonia, Spain, May 27 - May 28, 2013Google Scholar
- Stojanovic S et al., “One Implementation of the Gross Pitaevskii Algorithm”, Proceedings of Maxeler@SANU, MISANU, Belgrade, Serbia, April 8, 2013Google Scholar
- Chow G C T, Tse A H T, Jin O, Luk W, Leong P H W, Thomas D B, “A Mixed Precision Monte Carlo Methodology for Reconfigurable Accelerator Systems”, Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), ACM, Monterey, CA, USA, February 2012, pp. 57–66Google Scholar
- Arram J, Tsoi K H, Luk W, Jiang P, “Hardware Acceleration of Genetic Sequence Alignment”, Proceedings of 9th International Symposium ARC 2013, ACM, Los Angeles, CA, USA, March 25–27, 2013. pp. 13–24Google Scholar
- Guo C, Fu H, Luk W, “A Fully-Pipelined Expectation-Maximization Engine for Gaussian Mixture Models”, Proceedings of 2012 International Conference on Field-Programmable Technology (FPT), Seoul, S. Korea, 10–12 Dec. 2012, pp. 182 – 189Google Scholar
- Kos A, Rankovic V, Tomažic S, “Sorting Networks on the Maxeler Data Flow Super Computing Systems“, Advances in Computers, Elsevier, 225 Wyman Street, Waltham, MA 02451, USA, 2014Google Scholar
- Flynn M, Mencer O, Greenspon I, Milutinovic V, Stojanovic S, Sustran Z, “The Current Challenges in DataFlow Supercomputer Programming”, Proceedings of ISCA 2013, ACM, Tell Aviv, Israel, June 2013Google Scholar
- Seebode C, Ort M, Regenbrecht C, Peuker M: BIG DATA infrastructures for pharmaceutical research. IEEE International Conference on Big Data, California; 2013.View ArticleGoogle Scholar
- Ayhan S, Pesce J, Comitz P, Sweet D, Bliesner S, Gerberick G, “Predictive analytics with aviation big data“, Integrated Communications, Navigation and Surveillance ICNS Conference, 2013, Edinburgh, UKView ArticleGoogle Scholar
- Jinglan Zhang, Kai Huang, Cottman-Fields M, Truskinger A, Roe P, Shufei Duan, Xueyan Dong, Towsey M, Wimmer J, “Managing and Analysing Big Audio Data for Environmental Monitoring“, IEEE 16th International Conference on Computational Science and Engineering CSE Conference, 2013, Sydney, AustraliaView ArticleGoogle Scholar
- Convey Computer Web Page: , October 31 2014 http://www.conveycomputer.com/technology/ Convey Computer Web Page: , October 31 2014
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.