A comprehensive evaluation of ensemble learning for stock-market prediction

Nti, Isaac Kofi; Adekoya, Adebayo Felix; Weyori, Benjamin Asubam

doi:10.1186/s40537-020-00299-5

Research
Open access
Published: 11 March 2020

A comprehensive evaluation of ensemble learning for stock-market prediction

Journal of Big Data volume 7, Article number: 20 (2020) Cite this article

29k Accesses
188 Citations
5 Altmetric
Metrics details

Abstract

Stock-market prediction using machine-learning technique aims at developing effective and efficient models that can provide a better and higher rate of prediction accuracy. Numerous ensemble regressors and classifiers have been applied in stock market predictions, using different combination techniques. However, three precarious issues come in mind when constructing ensemble classifiers and regressors. The first concerns with the choice of base regressor or classifier technique adopted. The second concerns the combination techniques used to assemble multiple regressors or classifiers and the third concerns with the quantum of regressors or classifiers to be ensembled. Subsequently, the number of relevant studies scrutinising these previously mentioned concerns are limited. In this study, we performed an extensive comparative analysis of ensemble techniques such as boosting, bagging, blending and super learners (stacking). Using Decision Trees (DT), Support Vector Machine (SVM) and Neural Network (NN), we constructed twenty-five (25) different ensembled regressors and classifiers. We compared their execution times, accuracy, and error metrics over stock-data from Ghana Stock Exchange (GSE), Johannesburg Stock Exchange (JSE), Bombay Stock Exchange (BSE-SENSEX) and New York Stock Exchange (NYSE), from January 2012 to December 2018. The study outcome shows that stacking and blending ensemble techniques offer higher prediction accuracies (90–100%) and (85.7–100%) respectively, compared with that of bagging (53–97.78%) and boosting (52.7–96.32%). Furthermore, the root means square error (RMSE) recorded by stacking (0.0001–0.001) and blending (0.002–0.01) shows a better fit of ensemble classifiers and regressors based on these two techniques in market analyses compared with bagging (0.01–0.11) and boosting (0.01–0.443). Finally, the results undoubtedly suggest that an innovative study in the domain of stock market direction prediction ought to include ensemble techniques in their sets of algorithms.

Introduction

The stock market is considered to be a stochastic and challenging real-world environment, where the stock-price movements are affected by a considerable number of factors [1, 2]. Billions of structured and unstructured data are generated daily from the stock market around the globe, increasing the “volume”, “velocity”, “variety” and “veracity” of stock market data, and making it complex to analyse [1, 3]. In analysing this “Big Data” from the stock market, two methods have generally been accepted, namely: fundamental analysis and technical analysis. The fundamental analysis focuses on the economic trends of local and international milieus, public sentiments, financial-statement and assets reported by companies, political conditions and companies associations worldwide [1, 4]. The technical analysis is based on statistical analysis, using the historical movement of the stock-prices. Technical indicators such as moving-average, dead cross and golden-cross are employed for effective stock trading decisions. Despite the existence of these techniques, market analysis is still challenging and open [1].

To overcome the challenges in the stock market analysis, several computational models based on soft-computing and machine learning paradigms have been used in the stock-market analysis, prediction, and trading. Techniques like Support Vector Machine (SVM) [2, 5], DTs [6], neural networks [7], Naïve Bayes [8, 9] and artificial neural networks (ANN) [10, 11] were reported to have performed better in stock-market prediction than conventional arithmetic methods like Logistic regression (LR), in respect of error prediction and accuracy. Nevertheless, ensemble learning (EL) based on a learning-paradigm that combines multiple learning algorithms, forming committees to improve-predictions (stacking and blending) or decrease variance (bagging), and bias (boosting) is believed to perform better than single classifiers and regressors [12, 13].

Succinctly, EL techniques have been applied in serval sectors such as health [14], agriculture [15], energy [16], oil and gas [17], and finance [12, 18]. In all these applications, their reported accuracies support the argument that ensemble classifiers or regressors are often far more precise than the discrete classifiers or regressors. For this reason, the need for building a better-ensemble classification and regression models has become a critical and active research area in supervised learning, with boosting and bagging being the most common amalgamation methods used in the literature [16].

Despite numerous works revealing the dominance of ensemble classifier over single classifier, most of these studies only ensemble a specific type of classifier or regressor for stock-market prediction, such as NN [18,19,20], DT [21, 22] and SVM [12, 23]. Also, most previous studies [12, 19, 21, 22, 24,25,26,27,28,29,30], on ensemble methods for stock-market predictions adopted the decrease variance approach (boosting or bagging) and experimented with data from one country. Furthermore, a comparison between bagging (BAG) and boosting (BOT) combination techniques by [12, 21] revealed that the BAG technique outperformed the BOT technique. However, the conclusion of these studies pointed out that the performance of ensemble classifiers using boosting or bagging in stock-market prediction is territory dependent. Thus, the authors foresee that some ensemble methods may perform better on data from some parts of the globe than other parts. This assumption calls for the application of different ensemble techniques to be benchmarked with stock-data from different continents, to ascertain their performance.

Besides, little is known on comparing ensemble classifiers and regressors using different combination techniques with same or diverse base learners in predicting the stock market. Hence, in stock-market prediction, to the best of our knowledge, there is no comprehensive comparative study to evaluate the performances of a good pool of diverse ensembles regressors and classifiers based on stock-data from three or more continents.

Therefore, this study seeks to perform a comprehensive comparative study of ensemble learning methods for classification and regression machine learning tasks in stock market prediction. The following specific objectives aiding this study are as follows:

i.
To bring together the theory of EL and appreciate the algorithms, which use this technique.
ii.
To review some of the recently published articles on ensemble techniques for classification and regression machine learning tasks in stock market prediction.
iii.
To set up ensemble classifiers and regressors with DTs, SVM and NN using stacking, blending, bagging, and boosting combination techniques.
iv.
To examine and compare execution times, accuracy, and error metric of techniques in (iii) over stock data from GSE, JSE, NYSE and BSE-SENSEX.

Hopefully, this paper brings more clarity on which ensembles techniques is best suitable for machine learning tasks in stock market prediction. Again, offer help to beginners in the machine-learning field, to make an informed choice concerning ensemble methods that quickly offer best and accurate results in stock-market prediction. Furthermore, we probe the arguments made in [12, 21] about the consistency of ensemble learning superiority over stock data from different countries. Finally, this paper contributes to the literature in that it is, to the best of our knowledge, the first in stock market prediction to make such an extensive comparative analysis of ensemble techniques.

The remaining sections of the paper are organised as follows. “Related works evaluation” section presents a review of related works. In “Procedure of proposed method” section, we present a quick dive-into basic and advanced ensemble methods and the study procedure. “Predictive models” section discusses the results of empirical studies. “Ensemble methods (EMs)” section concludes this study and describes avenues for future research.

Related works evaluation

Literature has shown that the applications of some powerful ML algorithms have significantly improved the accuracy of stock prices classification and prediction [31, 32]. As such, ML has drawn the attention in stock market prediction, and several ensemble ML techniques have recorded high prediction accuracy in current studies.

Sohangir et al. [33] examined the ability of deep learning techniques such as LSTM and CNN to improve the prediction accuracy of the stock using public sentiments. The out of the study showed that deep learning technique (CNN) outperformed ML algorithms like Logistic regression and Doc2vec. Their Simulation outcome demonstrated the attractiveness of their proposed ensemble method compared with auto-regressive integrated moving average, generalised autoregressive conditional heteroscedasticity. Likewise, Abe et al. [34] applied a deep neural network technique to predict stock price and reported that deep technique is more accurate than shallow neural networks.

An ensemble of state-of-the-art ML techniques, including deep neural networks, RF and gradient-boosted trees were proposed in [35], to predict the next day stock price return on the S&P 500. Their experimental findings were hopeful, signifying that a sustainable profit prospect in the short-run is exploitable through ML, even in the case of a developed-market. Qiu et al. [36] presented a stock prediction model based on ensemble ν-Support Vector Regression Model.

Similarly, an ensemble of Bayesian model averaging (BMA), weighted-average least squares (WALS), least absolute shrinkage and selection operator (LASSO) using AdaBagging was proposed in [24] to predict stock price. Pasupulety et al. [37] proposed an ensemble of extra tree regressor and support vector regressor using stacking to predict the stock price based on public sentiment. Pulido et al. [38] ensembled NN with fuzzy incorporation (type-1 and type-2) for predicting the stock market [38], they achieved a high prediction accuracy by the proposed model compared with single NN classifier. An ensemble of trees in an RF using LSboost was carried out [25]; the study achieved reduced prediction error.

A Comparison of single, ensemble and integrated ensemble ML techniques to predict the stock market was carried out in [39]. The study showed that boosting ensemble classifiers outperformed bagged classifiers. Sun et al. [26] proposed an ensemble LSTM using AdaBoost for stock market prediction. Their results show that the proposed AdaBoost-LSTM ensemble outperformed some other single forecasting models. A homogenous ensemble of time-series models including SVM, logistic regression, Lasso regression, polynomial regression, Naive forecast and more was proposed in [40] for predicting stock price movement. Likewise, Yang et al. [41] ensembled SVM, RF and AdaBoost using voting techniques to predict a buy or sell of stocks for intraday, weekly and monthly. The study shows that the ensemble technique outperformed single classifier in terms of accuracy. Gan et al. [42] proposed an ensemble of feedforward neural networks for predicting the stock closing price and reported a higher accuracy in prediction as compared with single feedforward neural networks.

In another study, a 2-phase ensemble framework, including several non-classical disintegration models, namely, ensemble empirical mode decomposition, empirical mode decomposition, and complete ensemble empirical mode decomposition with adaptive noise, and ML models, namely, SVM and NN, was proposed for predicting stock-prices [43]. Implementation and evaluation of RF robustness in stocks selection strategy was carried out [31]. Using the fundamental and technical dataset, they concluded that in sound stocks investment, fundamental features, and long-term technical features are of importance to long-term profit. Mehta et al. [44] proposed a weighted ensemble model using weighted SVM, LSTM and multiple regression for predicting the stock market. Their results show that the ensemble learning technique attained maximum accuracy with lesser variance in stock prediction.

Similarly, Assis et al. [45] proposed an NN ensemble for predicting stock price movement. A deep NN ensemble using bagging for stock market prediction was proposed in [29]. The study revealed that assembling several neural networks to predict stock price movement is highly accurate than a single deep neural network. Jiang et al. [27] implemented different state-of-the-art ML techniques, including a tree-based and LSTM ensemble using stacking combination technique to predict stock price movement based on both information from the macroeconomic conditions and historical transaction data. The authors recorded an accuracy of 60–70% on average. Kohli et al. [46] examined different ML algorithms (SVM, RF, Gradient Boosting and AdaBoost) performance in stock market price prediction. The study showed that AdaBoost outperformed Gradient Boosting in terms of predicting accuracy.

The work in [19] presents an ensemble classifier of NN using bagging. Their results revealed that the ensemble of NN performs much better than a single NN classifier. Equally, Wang et al. [4] proposed an RNN ensemble framework that combines trade-based features deduced from historical trading records and characteristic features of the list companies to perceive stock-price manipulation activities effectively. Their experimental results reveal that the proposed RNN ensemble outperforms state-of-the-art methods in distinguishing stock price manipulation by an average of 29.8% in terms of AUC value. Existing studies have shown that ensemble classifiers and regressors are of higher predicting accuracy than a single classifier and regressor.

In the same way, Ballings et al. [12] compared LR, NN, K-Nearest Neighbour (K-NN), and SVM ensembles using bagging and boosting. The study results revealed that bagging algorithm (random forest) outperformed boosting algorithm (AdaBoost). Nevertheless, the study concluded that the performance of ensemble methods is dependent on the domain of the dataset used for the study. Therefore, to obtain a generalisation of EL methods, a comprehensive comparison among ensemble methods using datasets from different continents are required.

Table 1 (Appendix A), present a summary of pertinent studies on stock market prediction using EL based on different combination techniques. We categorised the relevant literature based on (i) the base (weak) learner and the total number used. (ii) The type of machine learning task (classification or regression). (iii) The origin of the data used for the experimental analysis. (iv) The combination technique used and (v) evaluation metric used to contrast and compare the relative metamorphoses.

Table 1 Comparison of related studies

A comprehensive evaluation of ensemble learning for stock-market prediction

Abstract

Introduction

Related works evaluation

Procedure of proposed method

Predictive models

Decision tree (DT)

Support vector machine (SVM)

Neural networks (NN)

Ensemble methods (EMs)

Basic ensemble techniques

Max voting (MV)

Averaging

Weighted average (WA)

Advanced EL techniques

Stacking (STK)

Blending (BLD)

Bagging (BAG)

Boosting (BOT)

Study framework

Research data

Data cleaning

Data transformation

Empirical analysis and discussion

Empirical setup

Model evaluation

Results and discussion

Homogenous ensembled classifiers by BAG and BOT

Error metrics analysis of homogenous ensembled classifiers

Homogenous ensembled regressors by BAG and BOT

Heterogeneous ensembled classifier by STK and BLD

Accuracy measure of heterogeneous ensembled classifier by STK and BLD

Error metrics analysis of heterogeneous ensembled classifier by STK and BLD

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords