Largescale distributed LBFGS
 Maryam M. Najafabadi^{1}Email authorView ORCID ID profile,
 Taghi M. Khoshgoftaar^{1},
 Flavio Villanustre^{2} and
 John Holt^{2}
Received: 24 April 2017
Accepted: 6 July 2017
Published: 17 July 2017
Abstract
With the increasing demand for examining and extracting patterns from massive amounts of data, it is critical to be able to train large models to fulfill the needs that recent advances in the machine learning area create. LBFGS (Limitedmemory Broyden Fletcher Goldfarb Shanno) is a numeric optimization method that has been effectively used for parameter estimation to train various machine learning models. As the number of parameters increase, implementing this algorithm on one single machine can be insufficient, due to the limited number of computational resources available. In this paper, we present a parallelized implementation of the LBFGS algorithm on a distributed system which includes a cluster of commodity computing machines. We use open source HPCC Systems (HighPerformance Computing Cluster) platform as the underlying distributed system to implement the LBFGS algorithm. We initially provide an overview of the HPCC Systems framework and how it allows for the parallel and distributed computations important for Big Data analytics and, subsequently, we explain our implementation of the LBFGS algorithm on this platform. Our experimental results show that our largescale implementation of the LBFGS algorithm can easily scale from training models with millions of parameters to models with billions of parameters by simply increasing the number of commodity computational nodes.
Keywords
Largescale LBFGS implementation Parallel and distributed processing HPCC systemsIntroduction
A wide range of machine learning algorithms use optimization methods to train the model parameters [1]. In these algorithms, the training phase is formulated as an optimization problem. An objective function, created based on the parameters, needs to be optimized to train the model. An optimization method finds parameter values which minimize the objective function. New advances in the machine learning area, such as deep learning [2], have made the interplay between the optimization methods and machine learning one of the most important aspects of advanced computational science. Optimization methods are proving to be vital in order to train models which are able to extract information and patterns from huge volumes of data.
With the recent interest in Big Data analytics, it is critical to be able to scale machine learning techniques to train largescale models [3]. In addition, recent breakthroughs in representation learning and deep learning show that large models dramatically improve performance [4]. As the number of model parameters increase, classic implementations of optimization methods on one single machine are no longer feasible. Many applications require solving optimization problems with a large number of parameters. Problems of this scale are very common in the Big Data era [5–7]. Therefore, it is important to study the problem of largescale optimizations on distributed systems.
One of the optimization methods, which is extensively employed in machine learning, is Stochastic gradient descent (SGD) [8, 9]. SGD is simple to implement and it works fast when the number of training instances is high, as SGD does not use the whole training data in each iteration. However, SGD has its drawbacks, hyper parameters such as learning rate or the convergence criteria need to be tuned manually. If one is not familiar with the application at hand, it can be very difficult to determine a good learning rate or convergence criteria. A standard approach is to train the model with different parameters and test them on a validation dataset. The hyperparameters which give best performance results on the validation dataset are picked. Considering that the search space for SGD hyperparameters can be large, this approach can be computationally expensive and time consuming, especially on largescale optimizations.
Batch methods such as LBFGS algorithm, along with the presence of a line search method [10] to automatically find the learning rate, are usually more stable and easier to check for convergence than SGD [11]. LBFGS uses the approximated second order gradient information which provides a faster convergence toward the minimum. It is a popular algorithm for parameter estimation in machine learning and some works have shown its effectiveness over other optimization algorithms [11–13].
In a largescale model, the parameters, their gradients, and the LBFGS historical vectors are too large to fit in the memory of one single computational machine. This also makes the computations too complex to be handled by the processor. Due to this, there is a need for distributed computational platforms which allow parallelized implementations of advanced machine learning algorithms. Consequently, it is important to scale and parallelize LBFGS effectively in a distributed system to train a largescale model.
In this paper, we explain a parallelized implementation of the LBFGS algorithm on HPCC Systems platform. HPCC Systems is an open source, massive parallelprocessing computing platform for Big Data processing and analytics [14]. HPCC Systems platform provides a distributed file storage system based on hardware clusters of commodity servers, system software, parallel application processing, and parallel programing development tools in an integrated system.
Another notable existing largescale tool for distributed implementations is MapReduce [15] and its open source implementation, Hadoop [16]. However, MapReduce was designed for parallel processing and it is illsuited for the iterative computations inherent in optimization algorithms [4, 17]. HPCC Systems allows for parallelized iterative computations without the need to add any new framework over the current platform and without the limitation of adapting the algorithm to a specific platform (such as MapReduce keyvalue pairs).
Our approach in implementing LBFGS over the HPCC Systems platform distributes the parameter vector over many computing nodes. Therefore, a larger number of parameters can be handled by increasing the number of computational nodes. This makes our approach more scalable compared to the typical approaches in parallelized implementations of optimization algorithms, where the global gradient is computed by aggregating the local gradients which are computed on many machines [18]. Each machine maintains the whole parameter vector in order to calculate the local gradients on a specific subset of data examples. Thus, handling a larger number of parameters requires increasing the memory on each computational node which makes these approaches harder or even infeasible to scale, where the number of parameters are very large. On the other hand, our approach can scale to handle a very large number of parameters by simply increasing the number of commodity computational nodes (for example by increasing the number of instances on an Amazon Web Services cluster).
The remainder of this paper is organized as follows. In section “Related work”, we discuss related work on the topic of distributed implementation of optimization algorithms. Section “HPCC systems platform” explains the HPCC Systems platform and how it provides capabilities for a distributed implementation of the LBFGS algorithm. Section “LBFGS algorithm” provides theoretical details of the LBFGS algorithm. In section “Implementation of LBFGS on HPCC Systems”, we explain our implementation details. In section “Results”, we provide our experimental results. Finally, in section “Conclusion and discussion”, we conclude our work and provide suggestions for future research.
Related work
Optimization algorithms are the heart of many modern machine learning algorithms [19]. Some works have explored the scaling of optimization algorithms to build largescale models with numerous parameters through distributed computing and parallelization [9, 18, 20, 21]. These methods focus on linear, convex models where global gradients are obtained by adding up the local gradients which are calculated on each computational node. The main limitation of these solutions is that each computational node needs to store the whole parameter vector to be able to calculate the local gradients. This can be infeasible when the number of parameters is very large. In another study, Niu et al. [22] only focus on optimization problems where the gradient is sparse, meaning that most gradient updates only modify small parts of the parameter vector. Such solutions are not general and can only work for a subset of problems.
The research most related to ours are [18] and [4]. Agarwal et al. [18] present a system for learning linear predictors with convex losses on a cluster of 1000 machines. The key component in their system is a communication infrastructure called AllReduce which accumulates and broadcasts values over all nodes. They developed an implementation that is compatible with Hadoop. Each node maintains a local copy of the parameter vector. The LBFGS algorithm runs locally on each node to accumulate the gradient values locally and the global gradient is obtained by AllReduce. This restricts the parameter vector size to the available memory on only one node. Due to this constraint, their solution only works up to 16 million parameters.
Dean et al. [4] present the Sandblaster batch optimization framework for distributed implementation of LBFGS. The key idea is to have a centralized sharded parameter server where the parameter vector is stored and manipulated in a distributed manner. To implement distributed LBFGS, a coordinator process issues commands which are performed independently on each parameter vector shard. Our approach also utilizes a vector partitioning method to store the parameter vector across multiple machines. Unlike [18], where the number of parameters is limited to the available memory on one machine, the parameter vector is distributed on many machines which increases the number of parameters that can be stored.
The approach presented in [4] requires a new framework with a parameter server and a coordinator to implement batch optimization algorithms. The approach presented in [18] requires the AllReduce platform on top of MapReduce. However, we do not design or add any new framework on top of the HPCC Systems platform for our implementations. The HPCC Systems platform provides a framework for a general solution for largescale processing which is not limited to a specific implementation. It allows manipulation of the data locally on each node (similar to the parameter server in [4]). The computational commands are sent to all the computational nodes by a master node (similar to the coordinator approach in [4]). It also allows for aggregating and broadcasting the result globally (similar to AllReduce in [18]). Having all these capabilities, makes the HPCC Systems platform a perfect solution for parallel and largescale computations. Since it is an open source platform, it allows practitioners to implement parallelized and distributed computations on large amounts of data without the need to design their own specific distributed platform.
HPCC Systems platform
HPCC Systems platform
Parallel relational database technology has proven ineffective in analyzing massive amounts of data [23–25]. As a result, several organizations developed new technologies which utilize large clusters of commodity servers to provide the underlying platform to process and analyze massive data. Some of these technologies include MapReduce [23–25], Hadoop [16] and the open source HPCC Systems.
MapReduce is a basic system architecture designed by Google for processing and analyzing large datasets on commodity computing clusters. The MapReduce programming model allows distributed and parallelized transformations and aggregations over a cluster of machines. The Map function converts the input data to groups according to a keyvalue pair, and the Reduce function performs aggregation by keyvalue on the output of the Map function. For more complex computations, multiple MapReduce calls must be linked in a sequence.
Although MapReduce provides basic functionality for many data processing operations, users are limited since they need to adapt their applications to the MapReduce model to achieve parallelism. This can include the implementation of multiple sequenced operations which can add overhead to the overall processing time. In addition, many processing operations do not naturally fit into the group byaggregation model using single keyvalue pairs. Even simple applications such as selection and projection must fit into this model and users need to provide custom MapReduce functions for such operations, which is more error prone and limits reusability [24].
HPCC Systems platform, on the other hand, is an opensource integrated system environment which excels at both extract, transform and load (ETL) tasks and complex analytics using a common data centric parallel processing language called Enterprise Control Language (ECL). HPCC Systems platform is based on a DataFlow programming model. LexisNexis Risk Solutions^{1} independently developed and implemented this platform as a solution to largescale data intensive computing. Similar to Hadoop, the HPCC Systems platform also uses commodity clusters of hardware running on top of the Linux operating system. It also includes additional system software and middleware components to meet the requirements for dataintensive computing such as comprehensive job execution, distributed query and file system support.
The data refinery cluster in HPCC Systems (Thor system cluster) is designed for processing massive volumes of raw data which ranges from data cleansing and ETL processing to developing machine learning algorithms and building largescale models. It functions as a distributed file system with parallel processing power spread across the nodes (machines). A Thor cluster can scale from a single node to thousands of nodes. HPCC also provides another type of cluster, called ROXIE [14], for rapid data delivery which is not in the scope of this paper.
The Thor cluster is implemented using a master/slave topology with a single master and multiple slave processes, which provide a parallel job execution environment for programs coded in ECL. Each slave provides localized data storage and processing power within the distributed file system cluster. The Thor master monitors and coordinates the processing activities of the slave nodes and communicate status information. ECL programs are compiled into optimized C++ source code, which is subsequently linked into executable machine code distributed to the slave processes of a Thor cluster. The distribution of the code is done by the Thor master process. Figure 2 shows a representation of a physical Thor processing cluster.
The distributed file system (DFS) used in the Thor cluster is record oriented which is somewhat different from the block format used in MapReduce clusters. Each record represents one data instance. Records can be fixed or variable length, and support a variety of standard (fixed record size, CSV, XML) and custom formats including nested child datasets. The files are usually transferred to a landing zone and from there they are partitioned and distributed as evenly as possible, with records in sequential order, across the available processes in the cluster.
ECL programming language

Enterprise Control Language incorporates transparent and implicit data parallelism regardless of the size of the computing clusters reducing the complexity of the parallel programming.

Enterprise Control Language was specifically designed for manipulation of large amounts of data. It enables implementation of data intensive applications with complex dataflows and huge volumes of data.

Since ECL is a higherlevel abstraction over C++, it provides more productivity improvements for programmers over languages such as Java and C++. The ECL compiler generates highly optimized C++ for execution.
DataFlows defined in ECL are parallelized across the slave nodes which process partitions of the data. ECL includes extensive capabilities for data definition, filtering and data transformations. ECL is compiled into optimized C++ format and it allows inline C++ functions to be incorporated into ECL statements. This allows the general data transformation and flow to be represented with ECL code, while the more complex internal manipulations on data records can be implemented as inline C++ functions. This makes the ECL language distinguishable from other programing languages for datacentric implementations.
Enterprise Control Language transform functions operate on a single record or pair of records at a time depending on the operations. Builtin transform operations in the ECL language which process through entire datasets include PROJECT, ITERATE, ROLLUP, JOIN, COMBINE, FETCH, NORMALIZE, DENORMALIZE, and PROCESS. For example, the transformation function for the JOIN operation, receives two records at a time and performs the join operation on them. The join operation can be as simple as finding the minimum of two values or as complex as a complicated userdefined inline C++ function.
The Thor system allows data transformation operations to be performed either locally on each physical node or globally across all nodes. For example, a global maximum can be found by aggregating all the local maximums obtained on each node. This is similar to the MapReduce approach, however, the big advantage of ECL is that this is done naturally and there is no need to define any keyvalue pair or any Map or Reduce functions.
LBFGS algorithm
The LBFGS (Limitedmemory BFGS) algorithm modifies BFGS to obtain Hessian approximations that can be stored in just a few vectors of the length n. Instead of storing a fully dense \(n \times n\) approximation, LBFGS stores just m vectors (\(m \ll n\)) of length n that implicitly represent the approximation. The main idea is that it uses curvature information from the most recent iterations. The curvature information from earlier iterations are considered to be less likely to be relevant to the Hessian behavior at the current iteration and are discarded in the favor of the memory.
Implementation of LBFGS on HPCC Systems
Main idea
We used the ECL language to implement the main DataFlow in the LBFGS algorithm. We also implemented inline C++ functions as required to perform some local computations. We used the HPCC Systems platform without adding any new framework on top of it or modifying any underlying platform configuration.
To implement a largescale LBFGS algorithm where the length of the parameter vector x is very large, a natural solution would be to store and manipulate the vector x on several computing machines/nodes. If we use N machines, the parameter vector is divided into N nonoverlapping partitions. Each partition is stored and manipulated locally on each machine. For example, if there are 100 machines available and the length of the vector x is \(8^{10}\) (80 GB), each machine ends up storing \(\frac{1}{100}\) of the parameter vector which requires 0.8 GB of memory. By using this approach, the problem of handling a parameter vector of size 80 GB is broken down to handling only 0.8 GB of partial vectors locally across 100 computational nodes. Even a machine with enough memory to store such a large parameter vector will need even more memory for the intermediate computations and will take a significant amount of time to run only one iteration. Distributing the storage and computations on several machines benefits both memory requirements and computational durations.
The main idea in the implementation of our paralellized LBFGS alorithm is to distribute the parameter vector over many machines. Each machine manipulates the portion of the locally assigned parameter vector. The LBFGS caches (\(\{s_i, y_i\}\) pairs) are also stored on the machines locally. For example, if the \(j \text{th}\) machine stores the \(j \text{th}\) partition of the parameter vector, it also ends up storing the \(j \text{th}\) partition of the \(s_i\) and \(y_i\) vectors by performing all the computations locally. Each machine performs most of the operations independently. For instance, the summation of two vectors that are both distributed on several machines, includes adding up their corresponding partitions on each machine locally.
ECL implementation
In this subsection, we explain our implementation using ECL language by providing some examples from the code. The goal is to demonstrate the simplicity of the ECL language as a language which provides parallelized computations. We refer the interested reader to ECL manual [30] for a detailed explanation of the ECL language.
The record can include other fields as required by the computations. For simplicity, we only show the records related to the actual data and its distribution over several nodes.
The above statement defines vector x as a dataset of records where each record has the x_record format. The “…” includes the actual parameter vector x values which can be a file that contains the initial parameter values or it can be a predefined dataset which is defined in ECL. We exclude that part for simplicity.
The above statement is joining the two datasets x_distributed and y_distributed where both datasets are distributed over several machines. The JOIN operation is performed locally by pairing the records from the left dataset (x_distributed) and the right dataset (y_distributed) with the same node_id values. The LOCAL keyword results in the two records to be joined locally. The transform function returns the local dot product value for each node_id. Using a simple SUM statement provides the final dot product result. The dot product result can then be used in any operation without any explicit reference to the fact that this is a global value that needs to broadcast to local machines. The HPCC Systems platform implicitly broadcasts such global values on local machines.
Results
Dataset characteristics
Dataset  # Instances  # Classes  # Features  # Parameters  Parameter size (GB) 

lshtcsmall  4463  1139  51,033  58,126,587  0.5 
lshtclarge  93,805  12,294  347,256  4,269,165,264  34 
Wikipediamedium  456,886  31,521  346,299  10,915,690,779  87 
To showcase the effectiveness of our implementation, we consider three different datasets with increasing number of parameters, lshtcsmall, lshtclarge and wikipediamedium which are largescale text classification datasets.^{2} The characteristics of the datasets are shown in Table 1. Each instance in the wikipediamedium dataset can belong to more than one class. To build the SoftMax objective function, we only considered the very first class among the multiple classes listed for each sample as its label for the wikipediamedium dataset.
We used the implemented LBFGS algorithm to optimize the Softmax regression objective function [31] for these datasets. Softmax regression (or multinomial logistic regression) is a generalization of logistic regression for the case where there are multiple classes to be classified. The number of parameters for the SoftMax regression is equal to the multiplication of the number of classes by the number of features. We use double precision to represent real numbers (8 bytes). The parameter size column in Table 1 approximates the memory size which is needed to store the parameter vector by multiplying the number of parameters by 8. Since the parameter vector is not sparse, we store it as a dense vector which include continuous real values.
We used a cluster of 20 machines, each with 4GB of RAM memory for the lshtcsmall dataset. We used an AWS (Amazon Web Service) cluster with 16 instances of r3.8xlarge^{3} (each instance runs 25 THOR nodes) for lshtclarge and wikipediamedium datasets where each node has almost 9GB of RAM.
Figure 4 and 5 show the difference from the optimal solution as the number of iterations increases for different values of m in the LBFGS algorithm for lshtcsmall and lshtclarge datasets, respectively. We chose the regularization parameter as λ = 0.0001 for these two datasets. We chose a λ value that causes the LBFGS algorithm not to converge as quickly so we can demonstrate more iterations in our results. Figure 6 shows the difference from the optimal solution as the number of iterations increases for \(m=5\) for wikipediamedium dataset. For this dataset, we chose λ = 0.0001 in addition to λ = 0.0001 because the LBFGS algorithm converges very fast in the case of λ = 0.0001. The reason we only considered \(m=5\) for this datset is that the number of iteration for the LBFGS algorithm is small.
Tables 2, 3, and 4 present the corresponding information for the results shown in Figs. 4, 5, and 6, respectively. The number of iterations is the value where the LBFGS algorithm reached the optimum point. We define the ending criteria for our LBFGS algorithm the same as the ending criteria defined in minfunc library [32]. Since in each iteration, the Wolfe line search might need to calculate the objective function more than once to find the best step length, the overall number of times the objective function is calculated is usually more than the number of iterations in the LBFGS algorithm. The total memory usage in these tables presents the required memory by the LBFGS algorithm. It includes the memory required to store the updated parameter vector, the gradient vector, and \(2 \times m\) LBFGS cache vectors.
The results indicate that increasing the m value in the LBFGS algorithm causes the algorithm to reach the optimum point in less number of iterations. However, the time it takes for the algorithm to reach the optimum point does not necessarily decrease. The reason is that increasing m causes the calculation of step direction in LBFGS twoloop recursion algorithm takes more time.
Results description for lshtcsmall dataset
m  Memory (GB)  # Iterations  # Objective function  Duration (s) 

5  5  68  77  652 
10  10  49  58  603 
15  15  46  54  668 
20  20  41  48  655 
25  24  40  46  686 
30  29  36  44  641 
Results description for lshtclarge dataset
m  Memory (GB)  # Iterations  # Objective function  Duration (s) 

5  410  61  83  3752 
10  751  50  69  3371 
15  1093  48  64  3360 
20  1434  57  40  2869 
25  1776  58  42  3076 
30  2117  57  40  3012 
Results description for wikipediamedium dataset
λ  Memory (GB)  # Iterations  # Objective function  Duration (s) 

0.0001  1048  4  11  202 
0.00001  1048  12  21  1297 
Conclusion and discussion
In this paper, we explained a parallelized distributed implementation of LBFGS which works for training largescale models with billions of parameters. The LBFGS algorithm is an effective parameter optimization method which can be used for parameter estimation for various machine learning problems. We implemented the LBFGS algorithm on HPCC Systems which is an open source, dataintensive computing system platform originally developed by LexisNexis Risk Solutions. Our main idea to implement the LBFGS algorithm for largescale models, where the number of parameters is very large, is to divide the parameter vector into partitions. Each partition is stored and manipulated locally on one computational node. In the LBFGS algorithm, all the computations can be performed locally on each partition except the dot product computation which needs different computational nodes to share their information. The ECL language of the HPCC Systems platform simplifies implementing parallel computations which are done locally on each computational node, as well as performing global computations where computational nodes share information. We explained how we used these capabilities to implement LBFGS algorithm on a HPCC platform. Our experimental results show that our implementation of the LBFGS algorithm can scale from handling millions of parameters on dozens of machines to billions of parameters on hundreds of machines. The implemented LBFGS algorithm can be used for parameter estimation in machine learning problems with a very large number of parameters. Additionally, It can be used in image or text classification applications, where the large number of features and classes naturally increase the number of model parameters, especially for models such as deep neural networks.
Compared to the parallelized implementation of LBFGS called Sandblaster, by Google, the HPCC Systems implementation does not require adding any new component such as a parameter server to the framework. HPCC Systems is an open source platform which already provides the datacentric parallel computing capabilities. It can be used by practitioners to implement their largescale models without the need to design a new framework. In future work, we want to use the HPCC Systems parallelization capabilities on each node which is done through multithreaded processing to further speed up our implementations.
Declarations
Authors’ contributions
MMN carried out the conception and design of the research, performed the implementations and drafted the manuscript. TMK, FV and JH provided reviews on the manuscript. JH set up the experimental framework on AWS and provided expert advice on ECL. All authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Bennett KP, ParradoHernández E. The interplay of optimization and machine learning research. J Mach Learn Res. 2006;7:1265–81.MathSciNetMATHGoogle Scholar
 Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1–21.View ArticleGoogle Scholar
 Xing EP, Ho Q, Xie P, Wei D. Strategies and principles of distributed machine learning on big data. Engineering. 2016;2(2):179–95.View ArticleGoogle Scholar
 Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV, et al. Large scale distributed deep networks. In: Advances in neural information processing systems. Lake Tahoe, Nevada: Curran Associates Inc.; 2012. p. 1223–31.Google Scholar
 Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in neural information processing systems. Lake Tahoe, Nevada: Curran Associates, Inc.; 2012. p. 1097–05.Google Scholar
 Dong L, Lin Z, Liang Y, He L, Zhang N, Chen Q, Cao X, Izquierdo E. A hierarchical distributed processing framework for big image data. IEEE Trans Big Data. 2016;2(4):297–309.View ArticleGoogle Scholar
 Sliwinski TS, Kang SL. Applying parallel computing techniques to analyze terabyte atmospheric boundary layer model outputs. Big Data Res. 2017;7:31–41.View ArticleGoogle Scholar
 ShalevShwartz S, Singer Y, Srebro N. Pegasos: primal estimated subgradient solver for svm. In: Proceedings of the 24th international conference on machine learning. New York: ACM; 2007. p. 807–14.Google Scholar
 Zinkevich M, Weimer M, Li L, Smola AJ. Parallelized stochastic gradient descent. In: Lafferty JD, Williams CKI, ShaweTaylor J, Zemel RS, Culotta A, editors. Advances in neural information processing systems. Vancouver, British Columbia, Canada: Curran Associates Inc.; 2010. p. 2595–03.Google Scholar
 Nocedal J, Wright SJ. Numerical optimization. 2nd ed. New York: Springer; 2006.MATHGoogle Scholar
 Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY. On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning (ICML11). 2011. p. 265–72.Google Scholar
 Schraudolph NN, Yu J, Günter S, et al. A stochastic quasinewton method for online convex optimization. Artif Intell Stat Conf. 2007;7:436–43.Google Scholar
 Daumé III, H.: Notes on cg and lmbfgs optimization of logistic regression. http://www.umiacs.umd.edu/~hal/docs/daume04cgbfgs, implementation http://www.umiacs.umd.edu/~hal/megam/. 2004; 198: 282.
 Middleton A, Solutions P. Hpcc systems: introduction to hpcc (highperformance computing cluster). White paper, LexisNexis Risk Solutions. 2011. http://cdn.hpccsystems.com/whitepapers/wp_introduction_HPCC.pdf.
 Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.View ArticleGoogle Scholar
 White T. Hadoop: the definitive guide. 3rd ed. 2012.Google Scholar
 Datasets, R.D.: A faulttolerant abstraction for inmemory cluster computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy Mccauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California: Berkeley.Google Scholar
 Agarwal A, Chapelle O, Dudík M, Langford J. A reliable effective terascale linear learning system. J Mach Learn Res. 2014;15(1):1111–33.MathSciNetMATHGoogle Scholar
 Sra S, Nowozin S, Wright SJ. Optimization for machine learning. Cambridge: The MIT Press; 2011.Google Scholar
 Dekel O, GiladBachrach R, Shamir O, Xiao L. Optimal distributed online prediction using minibatches. J Mach Learn Res. 2012;13:165–202.MathSciNetMATHGoogle Scholar
 Teo CH, Smola A, Vishwanathan S, Le QV. A scalable modular convex solver for regularized risk minimization. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2007. p. 727–36.Google Scholar
 Recht B, Re C, Wright S, Niu F. Hogwild: a lockfree approach to parallelizing stochastic gradient descent. In: ShaweTaylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, editors. Advances in neural information processing systems. Granada, Spain: Curran Associates, Inc.; 2011. p. 693–701.Google Scholar
 Dean J, Ghemawat S. Mapreduce: a flexible data processing tool. Commun ACM. 2010;53(1):72–7.View ArticleGoogle Scholar
 Chaiken R, Jenkins B, Larson PÅ, Ramsey B, Shakib D, Weaver S, Zhou J. Scope: easy and efficient parallel processing of massive data sets. Proc VLDB Endow. 2008;1(2):1265–76.View ArticleGoogle Scholar
 Stonebraker M, Abadi D, DeWitt DJ, Madden S, Paulson E, Pavlo A, Rasin A. Mapreduce and parallel dbmss: friends or foes? Commun ACM. 2010;53(1):64–71.View ArticleGoogle Scholar
 Pike R, Dorward S, Griesemer R, Quinlan S. Interpreting the data: parallel analysis with sawzall. Sci Program. 2005;13(4):277–98.Google Scholar
 Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava U. Building a highlevel dataflow system on top of mapreduce: the pig experience. Proc VLDB Endow. 2009;2(2):1414–25.View ArticleGoogle Scholar
 Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al. Tensorflow: largescale machine learning on heterogeneous distributed systems. arXiv preprint. 2016. arXiv:1603.04467.
 Wolfe P. Convergence conditions for ascent methods. SIAM Rev. 1969;11(2):226–35.MathSciNetView ArticleMATHGoogle Scholar
 Team BRD. Ecl language reference. White paper, LexisNexis Risk Solutions. 2015. http://cdn.hpccsystems.com/install/docs/3_4_0_1/ECLLanguageReference.pdf.
 Bishop CM. Pattern recognition and machine learning (information science and statistics). Secaucus: Springer; 2006.MATHGoogle Scholar
 Schmidt M. minFunc: unconstrained differentiable multivariate optimization in Matlab. 2005. http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html.