- Short Report
- Open Access
Step away from stepwise
© The Author(s) 2018
- Received: 23 May 2018
- Accepted: 5 September 2018
- Published: 15 September 2018
Stepwise regression is a popular data-mining tool that uses statistical significance to select the explanatory variables to be used in a multiple-regression model.
A fundamental problem with stepwise regression is that some real explanatory variables that have causal effects on the dependent variable may happen to not be statistically significant, while nuisance variables may be coincidentally significant. As a result, the model may fit the data well in-sample, but do poorly out-of-sample.
Many Big-Data researchers believe that, the larger the number of possible explanatory variables, the more useful is stepwise regression for selecting explanatory variables. The reality is that stepwise regression is less effective the larger the number of potential explanatory variables. Stepwise regression does not solve the Big-Data problem of too many explanatory variables. Big Data exacerbates the failings of stepwise regression.
- Stepwise regression
- Data mining
- Big Data
Researchers typically do not know with certainty which explanatory variables ought to be included in their multiple regression models. More than 50 years ago, stepwise regression was proposed as an efficient way to select the most useful explanatory variables. Despite widespread criticism, it never disappeared and has enjoyed a revival as a method for analyzing Big Data, where the number of potential explanatory variables can be very large. This paper uses a series of Monte Carlo simulations to demonstrate that stepwise regression is a poor solution to a surfeit of variables. In fact, the larger the number of potential explanatory variables, the more likely stepwise regression is to be misleading.
The stepwise regression method
Efroymson  proposed choosing the explanatory variables for a multiple regression model from a group of candidate variables by going through a series of automated steps. At every step, the candidate variables are evaluated, one by one, typically using the t statistics for the coefficients of the variables being considered.
A forward-selection rule starts with no explanatory variables and then adds variables, one by one, based on which variable is the most statistically significant, until there are no remaining statistically significant variables.
A backward-elimination rule starts with all possible explanatory variables and then discards the least statistically significant variables, one by one. The discarding stops when each variable remaining in the equation is statistically significant. Backward elimination is challenging if there is a large number of candidate variables and impossible if the number of candidate variables is larger than the number of observations.
A bi-directional stepwise procedure is a combination of forward selection and backward elimination. As with forward selection, the procedure starts with no variables and adds variables using a pre-specified criterion. The wrinkle is that, at every step, the procedure also considers the statistical consequences of dropping variables that were previously included. So, a variable might be added in Step 2, dropped in Step 5, and added again in Step 9.
Some researchers use stepwise regression to prune a list of plausible explanatory variables down to a parsimonious collection of the “most useful” variables. Others pay little or no attention to plausibility. They let the stepwise procedure choose their variables for them.
False confidence in stepwise results
Several authors [2–10] have pointed out that standard statistical tests assume a single test of a pre-specified model and are not appropriate when a sequence of steps is used to choose the explanatory variables. The standard errors of the coefficient estimates are underestimated, which makes the confidence intervals too narrow, the t statistics too high, and the p values too low—which leads to overfitting and creates a false confidence in the final model. In 1995, one educational psychology journal announced that authors should not submit papers using stepwise regression .
However, stepwise regression remains a popular tool (for example, [11–13]) and most statistical software packages include stepwise regression—which evidently reflects the demand for it and, perversely, may tempt researchers to try it. A survey of papers published in 2004 in three leading ecological and behavioral journals found that 57% of the papers that reported multiple regression results used stepwise regression . A survey of four leading epidemiologic journals found that 20% of the articles published in 2008 used stepwise regression . A study of articles published between 2004 and 2008 in two leading Chinese epidemiology journals found that, of the articles using multiple regression models, 44% used stepwise procedures .
Several textbooks endorse stepwise regression [16, 17], including a handbook explicitly devoted to data mining methods . The Chartered Financial Analyst Level II exam includes stepwise regression .
Other problems with stepwise regression
Similarly, a spending model that uses last year’s income and this year’s income as explanatory variables should be equivalent to a model that uses last year’s income and the change in income from last year to this year. Multiple regression estimates will not be affected; stepwise estimates might.
Thompson  argues that another major problem with stepwise regression is that a local optimization obtained by including variables one-by-one is not necessarily a global optimization. For example, selecting a fifth explanatory variable contingent on the four variables that were already chosen does not necessarily select the five variables that give the highest possible R2.
However, global maximization is not a goal worth seeking. Choosing a model’s explanatory variables based on R2 or statistical significance is treacherous—and this is the most fundamental problem with stepwise regression and the most compelling reason why researchers should stop using it.
The traditional statistical analysis of data follows what has come to be known as the scientific method that replaced superstition with scientific knowledge. Based on observation or speculation, the researcher poses a question, such as whether vitamin C reduces the incidence and severity of the common cold. The researcher then gathers data, ideally through a controlled experiment, to test the theory. If there are statistically persuasive differences in the outcomes for those taking vitamin C and those taking a placebo, the study concludes that vitamin C has a statistically significant effect. The researcher uses data to test a theory.
Data mining goes in the other direction, analyzing data without being motivated or encumbered by preconceived theories. Data-mining algorithms are programmed to look for trends, correlations, and other patterns in data. When an interesting pattern is found, the researcher may argue that the data speak for themselves and that is all that needs to be said. We don’t need theories—data are sufficient.
In addition to those who believe that theories are unnecessary, some believe that data should be used to discover new theories (for example, [21–23]). The label knowledge discovery emphasizes that the goal is a data-driven discovery of new, heretofore, unknown theories. Indeed, committed data-miners view the use of a priori knowledge of the phenomena being modeled as a constraint that limits the possibilities for knowledge discovery .
“If you torture the data long enough, Nature will confess,” said 1991 Nobel-winning economist Ronald Coase. The statement is still true. However, achieving this lofty goal is not easy. First, “long enough” may, in practice, be “too long” in many applications and thus unacceptable. Second, to get “confession” from large data sets one needs to use state-of-the-art “torturing” tools. Third, Nature is very stubborn—not yielding easily or unwilling to reveal its secrets at all.
The author was apparently unaware of the fact that Coase intended his comment not as a lofty goal, but as a succinct criticism of the practice of ransacking data in search of statistical significance .
Variables should be included in a model because, on theoretical grounds, they should be in the model, not based on the size of their t-values. The estimated coefficients of the true explanatory variables are biased if variables that belong in the model are excluded, and have enlarged variances if variables that don’t belong are included [28, 29].
New life with Big Data
Stepwise regression was born back when computers were much slower than today, but it has become a popular data-mining tool because it is computationally less demanding than a full search over all possible combinations of explanatory variables and, it is hoped, will give a reasonable approximation to the results of a full data-mining search. For instance, Cios et al.  recommend stepwise regression as an efficient way of using data mining for knowledge discovery (see also [30–32]).
Suppose that a researcher has 100 possible explanatory variables and wants to choose up to 10 variables to include in a regression model. There are 19.4 trillion possible combinations to choose from. With 1000 possible explanatory variables, there are 2.66 × 1023 combinations of up to 10 variables. With one million possible explanatory variables, the number of possibilities grows to 2.76 × 1053.
Stepwise regression circumvents the computational burden of trying all possible combinations of explanatory variables, by testing variables, one by one, in each step. The use of forward-selection stepwise regression for identifying the 10 most statistically significant explanatory variables requires only 955 regressions if there are 100 candidate variables, 9955 regressions if there are 1000 candidates, and slightly fewer than 10 million regressions if there are one million candidate variables. This simplification is very appealing, and many researchers working with Big Data have succumbed to the appeal of stepwise regression.
the more data, the more arbitrary, meaningless and useless (for future action) correlations will be found in them. Thus, paradoxically, the more information we have, the more difficult is to extract meaning from it. Too much information tends to behave like very little information.
If there is a fixed set of true statistical relationships that are useful for making predictions, the data deluge necessarily increases the ratio of meaningless statistical relationships to true relationships.
The fundamental problem with the notion that data come before theory is simple: We think that patterns are unusual and therefore meaningful; in Big Data, patterns are inevitable and therefore meaningless.
Stepwise regression steps—indeed leaps—into this trap. It follows automated rules that only consider statistical correlations, with no regard for whether it makes sense to include a potential explanatory variable. It is data without theory. It is data mining on steroids.
A Monte Carlo simulation model can be used to demonstrate the core problem with stepwise regression and how the problem is exacerbated in large data sets.
Steyerberg et al.  argue that stepwise models do poorly in small data sets, an argument they illustrate by applying stepwise regression to subsets of a data set with 4, 8, or 16 explanatory variables (whose estimated coefficients are assumed to be the “true” values). Derksen and Keselman  analyze 250 simulations of a Monte Carlo model with 12, 18, or 24 candidate explanatory variables and conclude that stepwise regression often chooses the wrong explanatory variables. Done decades ago, when computer capabilities were modest, these tests were understandably limited to a small number of explanatory variables and simulations.
After generating the data for the explanatory variables and the dependent variable, a stepwise regression procedure was used with the n candidate variables evaluated in random order. At each step, the potential explanatory variable with the lowest two-sided p-value is added to the equation if this p value is less than 0.05. One million simulations were done for each parameterization of the model.
The central question is how effective stepwise regression is at identifying the true variables that determine Y, so that reliable predictions can be made with fresh data. So, in each simulation, 100 observations were used to estimate the stepwise model’s coefficients, and the remaining 100 observations were used to test the model’s reliability.
In practice, a stepwise regression procedure might sometimes select explanatory variables that, although not directly affecting the dependent variable, are systematically related to variables that do affect the dependent variable. For example, consumer spending depends on income, which is related to years of education. Even if education does not directly influence spending, it is a noisy proxy for income and might find its way into a stepwise regression equation. All the explanatory variables in these Monte Carlo simulations were generated independently (so that there are no proxy variables) in order to focus on the fact that stepwise regression might be fooled by purely coincidental correlations.
While they might be fortuitously correlated with the dependent variable during the estimation period, nuisance variables are useless out-of-sample because they are truly independent of the variable being predicted. The selection of nuisance variables by the stepwise regression procedure gives a false confidence in the estimated model because of the high t values and the boost they provide to R2.
An extreme case (that did happen in some simulations) is when all of the explanatory variables chosen by the stepwise procedure are nuisance variables. Although there might be a great fit during the estimation period, the prediction errors will be large out-of-sample because the dependent variable will be predicted based solely on the values of irrelevant variables. There are less extreme consequences in less extreme cases, but when nuisance variables are included in the stepwise equation, we should anticipate that the prediction errors will be larger out-of-sample than in-sample.
Average number of explanatory variables per equation, σx = 5
σy = 10
σy = 20
σy = 30
σy = 10
σy = 20
σy = 30
Frequencies for an included variable being a nuisance variable, σx= 5
σy = 10
σy = 20
σy = 30
σy = 10
σy = 20
σy = 30
Frequencies for the number of true variables selected, σx = 5 and σy = 20
Stepwise enthusiasts often claim that adding variables based on statistical significance will improve the model’s predictions, by which they mean improve the fit for the data used to estimate the model. However, adding variables does not necessarily help, and may hurt, when a stepwise model is used to make predictions with fresh data.
In-sample and out-of-sample prediction errors, σx = 5 and σy= 20
The stepwise models consistently did substantially worse out-of-sample than in-sample. As the number of candidate variables increases, the in-sample fit improves, while the out-of-sample fit deteriorates, causing the ratio of the out-of-sample errors to the in-sample errors to balloon.
A model’s weaknesses can be exposed by the deterioration of the model’s fit using fresh data. It is therefore reasonable to hold out part of the available data for testing the estimated model [36, 37]. The two parts of the data are labeled in-sample and out-of-sample or, more recently, training data and validation data.
It is always a good idea to test a model with fresh data. However, choosing a data-mined model by using a repetitive cycle of in-sample estimation and out-of-sample testing does not guarantee that the best model will be chosen.
Tireless data mining guarantees that some models will fit both parts of the data remarkably well, even if none of the models are meaningful. Just as some models are certain to fit the in-sample data by luck alone, so some models are certain to fit the out-of-sample data as well. Uncovering a model that fits both the in-sample data and the out-of-sample data is just another form of data mining. Instead of discovering a model that fits half the data, we discover a model that fits all the data. That doesn’t solve the fundamental problem, which is that models that are chosen solely to fit the data, either half the data or all the data, cannot be expected to fit new data nearly as well.
Figure 1 shows that, although the out-of-sample RMSEs are generally larger than the in-sample RMSEs, there are many simulations in which the out-of-sample RMSE is close enough to the in-sample RMSE to suggest that a good model has been discovered—when, in fact, all the models are just coincidental correlations. Specifically, the out-of-sample RMSE is less than the in-sample RMSE 2% of the time, and within 10% of the in-sample RMSE 8% of the time. For similar simulations with the drift model, the out-of-sample RMSE is less than the in-sample RMSE 5% of the time, and within 10% of the in-sample RMSE 15% of the time.
A persistent data miner would have no trouble finding a model that performs almost as well out-of-sample as in-sample, even though the model is useless because the variable being predicted is only coincidentally related to the explanatory variables.
In addition to stepwise regression, several other feature selection methods have been proposed to deal with the curse of dimensionality, which can be computationally demanding and lead to overfitting and inaccurate out-of-sample predictions due of the inclusion of nuisance variables. For example, good results have been reported with recursive feature elimination with cross-validation [38, 39] and regularized tree ensembles , which are two efficient ways of identifying a parsimonious set of predictors. One of the particular strengths of recursive feature elimination with cross-validation is that a feature selection method is most likely to be successful when it is validated with out-of-sample data.
The stepwise simulations reported here confirm the value of using theoretical arguments or expert opinion to select the initial list of predictors. The stepwise regression models are much more successful when the procedure begins with 5 true variables and 5 nuisance variables than with 5 true variables and hundreds of nuisance variables.
One appealing way to deal with ambiguous theory is to use a Bayesian approach that explicitly allows uncertainty about the relevance of potential predictors and does not force a binary choice between inclusion and exclusion. Bayesian regression combines the data with a prior distribution for the model’s parameters by using Bayes’ theorem to derive a posterior distribution for the parameters and for predictions made with the model. As the amount of data increases, the posterior means converge to the least squares estimates. The computations can be challenging, but have now become practical. Detailed examples can be found in [41–43].
Ridge regression implicitly uses prior distributions for the coefficients of the explanatory variables that have zero means, identical variances, and are independent . It seems unlikely that the coefficients of predictors chosen on the basis of expert opinion would have prior means of zero. It is more appealing to use explicit priors instead of implicit priors.
Stepwise regression selects explanatory variables for multiple regression models based on their statistical significance. Although it has often been criticized for the misapplication of single-step statistical tests to a multi-step procedure, stepwise regression has become popular with Big Data because it is a very efficient way of choosing a relatively small number of explanatory variables from a vast array of possibilities. The assumption is that the larger the number of possible predictors, the more useful is stepwise regression.
This paper uses Monte Carlo simulations to demonstrate that a stepwise procedure may choose nuisance variables rather than true variables and that the out-of-sample accuracy of the model may be far worse than the in-sample fit. These problems are more likely to be serious when there are a large number of potential predictors. Stepwise regression does not solve the problem of Big Data. Big Data exacerbates the problems of stepwise regression.
The author read and approved the final manuscript.
GS received his Ph.D. in Economics from Yale University and was an Assistant Professor there for 7 years. He is now the Fletcher Jones Professor of Economics at Pomona College. He has written (or co-authored) more than 80 academic papers and 13 books. His Standard Deviation: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics (Overlook/Duckworth, 2015) was a London Times Book of the Week and debunks a variety of dubious and misleading statistical practices. The AI Delusion (Oxford University Press, 2018) argues that, in this age of Big Data, the real danger is not that computers are smarter than us, but that we think computers are smarter than us and, so, trust computers to make important decisions for us.
The author declares that there is no competing interests.
Availability of data and materials
Not applicable (all data are from Monte Carlo simulations; the source code is available).
Consent for publication
Ethics approval and consent to participate
Not applicable (no human participants).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Efroymson MA. Multiple regression analysis. In: Ralston A, Wilf HS, editors. Mathematical methods for digital computers. New York: Wiley; 1960.Google Scholar
- Thompson B. Why won’t stepwise methods die? Meas Eval Couns Dev. 1989;21(4):146–8.View ArticleGoogle Scholar
- Hurvich CM, Tsai CL. The impact of model selection on inference in linear regression. Am Stat. 1990;44(3):214–7.Google Scholar
- Harrell FE Jr. Regression modeling strategies: with applications to linear models, logistic regression and survival analysis. New York: Springer; 2001.View ArticleGoogle Scholar
- Hendry DF, Krolzig HM. Automatic econometric model selection. London: Timberlake Consultants Press; 2001.MATHGoogle Scholar
- Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med. 2004;66:411–21.Google Scholar
- Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol. 2006;75(5):1182–9.View ArticleGoogle Scholar
- Castle JL, Fawcett NWP, Hendry DF. Evaluating automatic model selection, Technical Report 474. Oxford: Department of Economics, University of Oxford; 2010.Google Scholar
- Flom PL, Cassell DL. Stopping stepwise: why stepwise and similar selection methods are bad, and what you should use. In: NESUG 2007 proceedings. 2007.Google Scholar
- Thompson B. Stepwise regression and stepwise discriminant analysis need not apply here: a guidelines editorial. Educ Psychol Meas. 1995;55:525–34.View ArticleGoogle Scholar
- Marascuilo LA, Serlin RC. Statistical methods for thesocial and behavioral sciences. New York: W. H. Freeman; 1988.MATHGoogle Scholar
- Huberty CJ. Problems with stepwise methods—better alternatives. In: Thompson B, editor. Advances in social science methodology, vol. 1. Greenwich: JAI Press; 1989.Google Scholar
- Vlachopoulou M, Ferryman TA, Zhou N, Tong J. A stepwise regression method for forecasting net interchange schedule. https://doi.org/10.1109/pesmg.2013.6672763. 2013.
- Walter S, Tiemeier H. Variable selection: current practice in epidemiological studies. Eur J Epidemiol. 2009;24(12):733–6.View ArticleGoogle Scholar
- Liao H, Lynn HS. A survey of variable selection methods in two Chinese epidemiology journals. BMC Med Res Methodol. 2010;10:87. https://doi.org/10.1186/1471-2288-10-87.View ArticleGoogle Scholar
- Rachev ST, Mittnik S, Fabozzi FJ, Focardi SM, Jašić T. Financial econometrics: from basics to advanced modeling techniques. New York: Wiley; 2006.MATHGoogle Scholar
- McDonald JH. Handbook of biological statistics. 3rd ed. Baltimore: Sparky House Publishing; 2014.Google Scholar
- Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York: Springer; 2016.MATHGoogle Scholar
- Wiley. Wiley 11th hour study guide for level II CFA exam. 2nd ed. New York: Wiley; 2017. p. 31.Google Scholar
- Friedman M. The permanent income hypothesis: a theory of the consumption function. Princeton: Princeton University Press; 1957.Google Scholar
- Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.Google Scholar
- Kecman V. Foreword. In: Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA, editors. Data mining: a knowledge discovery approach. New York: Springer; 2007.Google Scholar
- Begoli E, Horsey J. Design principles for effective knowledge discovery from big data. In: Software architecture (WICSA) and European conference on software architecture (ECSA), 2012 joint working IEEE/IFIP conference.Google Scholar
- Piatetsky-Shapiro G. Knowledge discovery in real databases: a report on the IJCAI-89 workshop. AI Mag. 1991;11(5):68–70.Google Scholar
- Sagiroglu S, Sinanc D. Big data: a review. In: 2013 international conference on collaboration technologies and systems (CTS). 2013.Google Scholar
- Kecman V. Foreword. In: Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA. Data mining: a knowledge discovery approach. New York: Springer; 2007.MATHGoogle Scholar
- Tullock G. A comment on Daniel Klein’s “A plea to economists who favor liberty”. East Econ J. 2001;27(2):203–7.Google Scholar
- Wooldridge JW. Introductory econometrics: a modern approach. 3rd ed. Mason: Thompson; 2006. p. 94–7.Google Scholar
- Stock JH, Watson MW. Introduction to econometrics. 2nd ed. Boston: Pearson; 2007. p. 316–9.Google Scholar
- Hastie T, Tibshirani R, Friedman J. the elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer; 2009. http://www-stat.stanford.edu/~tibs/ElemStatLearn/download.html.
- Varian HR. Big data: new tricks for econometrics. J Econ Perspect. 2014;28(2):3–27.View ArticleGoogle Scholar
- Bruce P, Bruce A. Practical statistics for data scientists: 50 essential concepts. Sebastopol: O’Reilly Media; 2017.Google Scholar
- Calude CS, Longo G. The deluge of spurious correlations in big data. Found Sci. 2016. https://doi.org/10.1007/s10699-016-9489-4.View ArticleMATHGoogle Scholar
- Steyerberg EW, Eijkemans MJC, Habbema JDF. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999;52(10):935–42.View ArticleGoogle Scholar
- Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol. 1992;45(2):265–82.View ArticleGoogle Scholar
- Mayers JH, Forgy EW. The development of numerical credit evaluation systems. J Am Stat Assoc. 1963;58(303):799–806.View ArticleGoogle Scholar
- Mark J, Goldberg MA. Multiple regression analysis and mass assessment: a review of the issues. Apprais J. 2001;56:89–109.Google Scholar
- Guyan I, Weston J, Barnhill S, Vopnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46:389–422.View ArticleGoogle Scholar
- Mukherjee T, Duckat M, Kumar P, Paquet JD, Rodriguez D, Haulcomb M, George K, Pasiliao E. RSSI-based supervised learning for uncooperative direction-finding. In: Altun Y, editor. Machine learning and knowledge discovery in databases. ECML PKDD 2017, vol. 10536., Lecture Notes in ComputerCham: Springer; 2015.Google Scholar
- Deng H, Runger G. Feature selection via regularized trees. In: Proceedings of the 2012 international joint conference on neural networks (IJCNN), IEEE; 2012.Google Scholar
- Box GEP, Tiao GC. Bayesian inference in statistical analysis. New York: Wiley; 1973.MATHGoogle Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. 2nd ed. Boca Raton: Chapman and Hall/CRC; 2003.MATHGoogle Scholar
- Koehrsen W. Introduction to Bayesian linear regression. Towards Data Science. 2018. https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7.
- Smith G, Campbell F. A critique of some ridge regression methods. J Am Stat Assoc. 1980;75(369):74–81.MathSciNetView ArticleGoogle Scholar