Application of variable selection and dimension reduction on predictors of MSE’s development

Wubetie, Habtamu Tilaye

doi:10.1186/s40537-018-0153-4

Methodology
Open access
Published: 18 February 2019

Application of variable selection and dimension reduction on predictors of MSE’s development

Habtamu Tilaye Wubetie¹

Journal of Big Data volume 6, Article number: 17 (2019) Cite this article

4457 Accesses
4 Citations
1 Altmetric
Metrics details

Abstract

Nature create variables using its character component, and variables are sharing characters from a vary small to relatively large scale. This results, variables to have from a vary different to a more similar character, and leads to have a relation ship. Literature suggested different relation measures based on the nature of variable and type of relation ship exist. Today, due to having high variety of frequently produced large data size, currently suggested variable filtering and selection methods have gaps to full fill the need. This research desires to fill this gap by comparing literature suggested methods to finding out a better variable selection and dimension reduction methods. The result from regression analysis using all literature suggested factors shows that none of the predictors for development status of enterprise are significant, and only 10 predictors for number of employer in an enterprise are significant out of 81 factors. Since, variable selection and dimension reduction methods are applied to find out predictors of a response by removing variable redundancy, and complexity of incorporating large number variable. Based on statistical power, for the results from variable selection methods, specially association and correlation methods showed that, CANOVA more efficiently detects non-linear or non-monotonic correlation between a continuous–continuous and a continuous-categorical variables. Spearman’s correlation coefficient more efficiently detects a monotonic correlation between a continuous with a continuous, and a continuous with a categorical variable. Pearson correlation coefficient more efficiently detects the linear correlation between continuous variables. MIC efficiently detects non-linear or non-monotonic relation between continuous variables. Chi-square test of independence efficiently detects relation between a continuous with a continuous, and categorical with categorical variables, but the non linear or non monotonic relation between a continuous with a categorical are not well detected. On the other hand, the result from lasso and stepwise methods reveals that, the relation between the predictor and response due to interaction effect not detected by correlation and association methods are detected by stepwise variable selection method, and the multicollinearity is detected and removed by lasso method. Regressing the response variable “number of employer in an enterprise” based on variables selected by lasso and stepwise method does bring greater model fitness (based on adjusted R-squared value) than variables selected by association and correlation methods. Similarly, regressing the response variable “development status of an enterprise” based on variables selected by association and correlation methods does bring 12 significant variables, where none of variables are significant from variables selected by lasso and stepwise methods. As a result, 51 predictors for number of employment in an enterprise, and 40 predictors for development status of an enterprise are detected as significantly related variables. And, lasso and stepwise methods are preferred to select predictors of a continuous response variable “number of employers in an enterprise”, and association and correlation methods are preferred to select predictors of a categorical response variable “development status of an enterprise”. Finally, the reduced regression models result reveals that, 20 predictors have causal relation with number of employment in an enterprise, and 12 predictors have causal relation with development status of an enterprise. On the other hand, based on model fitness, information lost, and number of significant factors, principal factor is preferred and applied in dimension reduction for a categorical response variable “development status of an enterprise”, and factor score based regression is preferred and applied for a continuous response variable “number of employers in an enterprise”. However, the comparison of the results in variable selection and dimension reduction indicates that, variable selection methods gave more gain in model fitness than dimension reduction methods. Hence, the suggested variable selection methods are more preferred than dimension reduction methods, and applied to find out predictors. In general, the suggested procedure for variable selection methods are recommended when small number of variables are studied, and the suggested dimension reduction methods are recommended for large number of variant variables (Big data case).

Introduction

Nature create variables using its character component, and variables are sharing characters from a vary small to relatively large scale. This results, variables to have from a vary different to a more similar character. Variables having a more similar character are variables sharing largely a more similar character component (have relatively the same composition), and apparently a vary small similarity is due to high difference in component character composition. Hence, taking variables having more similar character as one variable or taking one of them as a representative can remove natural character redundancy, and it helps to mange and analyse the relation ship between variables in a world of large amount of variables are inter-related. This inter-relation between the variables causes the variables to have a direct causal relation, or an indirect causal relation or relation with out causal nature. Statistically, a direct causal relation indicates the presence of dependency between variables, where as indirect causality is due to the presence of latent variable. However, the relation between the variables without known causality is due to not well understood relation in the real world. The relation between variables can be linear or non-linear or random. Statistical methods like, variable-selection and variable-dimension-reduction methods can used to reduce the number of variable by taking single variable or merging as a component for statistically significantly similar variables.

Measuring the predictor–predictor relation, and response–predictor relation is important to recognize the relationship exist, and having a short list of influential factors for further analysis to determine their effect on response variable.

However, due to inter-relation between dependent variables, their influence on response variable is not only individual rather in group too. Since, the natural inter-relation between variable is not captured and considered by simulation study, or by predictor–response association or correlation measures only. Correspondingly, this interaction effect is planed to detected for real data using Micro and small enterprise (MSE’s) data set[File Name: MSEs.csv] by considering the predictors filtered by association, correlation and regression measures for predictor–predictor and predictor–response relation. Then, the possible combination of selected (filtered) groups of variables are then regressed for response variable, and significantly and potentially related variables are re-selected using stepwise and lasso variable selection method.

Statistical measures of association, correlations and regression are used to find out the relation exist between variables. In this research the statistical relation measures used for variable selection, and dimension reduction are, Pearson correlation coefficient, Spearman’s rank correlation coefficient, Chi-square test of independence, maximal information criterion (MIC), continuous analysis of variance test (CANOVA), stepwise variable selection and lasso variable selection, and Principal factor and Factor score analysis respectively.

Wang et al. [20] used simulated and real datasets (kidney cancer RNA-seqdataset) to compare the false positive rates and statistical power of CANOVA to six other methods (Distance correlation’s, Hoeffding’s independence test, CANOVA the Pearson correlation coefficient, the Spearman’s rank correlation coefficient, the Kendall’s rank correlation coefficient and the Maximal information coefficient), and showed that CANOVA, the Pearson correlation coefficient, the Spearman’s rank correlation coefficient, the Kendall’s rank correlation coefficient and the MIC gave the expected false positives. Hence, these methods can detect the true significant variables. However, the false positive rate is lower than the expected for distance correlation and higher than the expected for Hoeffding’s independence test. So the true significant variables may not be detected by distance correlation, and there may be false significant variables in Hoeffding’s independence test result. Hence, Pearson correlation were recommended when correlation between two continuous variable is linear, and CANOVA were recommended when the correlation between two continuous variable is non-linear or complicated.

Variable dimension reduction is a tool to avoid complexity due to having large number of variables by considering the possible small number of variables those can reflect the needed information: which arise due to some variables are highly correlated to each other or to latent variable, or from the set of variables some variables may accounted for large amount of variability in the data set. For this type of problem variable reduction methods like principal factor analysis and factor score analysis are suggested [1, 2].

Currently due to having high variety of frequently produced big data size, literature suggested variable filtering and selection methods have gaps to full fill the need. Hence, this research desires to fill this gap by finding out a better variable filtering, selection and dimension reduction methods using real data. The above statistical methods of variable-selection and variable-dimension-reduction are applied to reduce the number of variable by taking single variable or merging as a component for statistically significantly similar variables.

Data and variable

From literature, entrepreneur’s development is measured in relation to the success of an individual, society, and firm survival [3, 4]. Bosma et al. [4] measured development of enterprise by considering profits of the entrepreneur, employment created by the entrepreneur, and the survival period of the firm. The determinants for development of entrepreneurs are dependent on the starting human capital, social capital, financial capital and strategies applied on business.

Coduras et al. [5] construct a measure for an individual’s readiness for entrepreneurship based on three main categories: sociological, psychological and managerial–entrepreneurial. The South African small enterprise development agency perform a study based on literature and current data for the impact of 2008 and 2009 global financial crisis on South Africa’s SMMEs, and they suggests that the South Africa’s SMMEs are challenged by access to finance and markets, poor infrastructure, labour laws, crime, skills shortages and inefficient bureaucracy. Assefa et al. [7] perform a study on factors affecting the success of Micro and Small-scale Enterprises in Addis Ababa and five other major regional towns in Ethiopia and find out the key success factors are personal qualities, such as having an articulate vision or ambition and innate abilities, working experience in the formal sector as a factory employee or having worked in family businesses, managerial and entrepreneurial skills, and higher equity in the invested money. Whereas shortage and small size of credit, shortage of working and sales spaces, lack of rental machinery and stringent licensing requirements are constraints of MSEs.

The sample data is taken from Debre Markos town enterprises in 2017. The study units are individuals starting their business in the interval of a year 1994 to 2006 and currently working on their own enterprise or business. The respondents gave detailed information on their entrepreneurial knowledge, skill and experience, on business environment and their strategies. Additional information on enterprises were also taken from Trade and industry office of Debre Markos town.

Sampling method of a study is determined based on the nature of the population under study. Ethiopian Ministry of Urban Development and Housing (MoUDH) classify micro and small size enterprise into five sectors, namely Manufacturing sector, Service sector, Trade, Construction sector, service sector, and Mining and Quarrying Sector. However, based on the present Trade and industry office of Debre Markos town MSEs are re-classified as Manufacturing sector, Service sector, Trade, Urban farming and Construction sector, by splitting Service sector in to service and Urban Farming. Hence, enterprises across sector are more heterogeneous than within sector, stratified sampling method is the right choice. The sample size is determined by using stratified optimal allocation based on the strata’s variance calculated from the information (secondary data) obtained from Trade and industry office: for the situation in which the variable of interest is enterprise development status which is categorical with value 1 (achieved expected progress stated by MoUDH) and 0 (not achieved expected progress), and at $99\%$ level of confidence for the true population proportion to be in 0.05 interval of the sample proportion, 179 sample of enterprise is taken from a total of 2093 enterprises. The study unites are allocated to each strata by considering strata’s variance rather than proportion, due to high difference in strata’s size where some clusters have size less than 20 and some larger than a thousand [8].

Variable of the study

Under these study two dependent and 81 independent variables are considered. List of explanatory variables considered are listed in Appendix: Tables 12, 13, 14, 15, 16 and 17.

Dependent variable

The variable of interest is enterprise development status. Bosma et al. [4] measured Entrepreneurs development (which is individual approach to measure enterprise development status ) in relation to, the success of an individual like profit made and capital growth, the success of society based on employee capacity, and firm survival. Contextually, Ethiopian Ministry of Urban Development and Housing (MoUDH) state a measure for development status of micro and small size enterprise based on the progress made by an enterprise on their capital accumulation and human capital mainly in terms of number of employee [3]. The MoUDH definition for micro and small enterprise is given by Table 1.

Table 1 Current definition of MSEs in Ethiopia

Application of variable selection and dimension reduction on predictors of MSE’s development

Abstract

Introduction

Data and variable

Variable of the study

Dependent variable

Explanatory variables

Variable-selection method

Chi-squared test of independence

Continuous analysis of variance test (CANOVA)

Maximal information criterion (MIC)

Pearson correlation coefficient

Spearman’s rank correlation coefficient

Stepwise variable selection

Lasso variable selection

Dimension reduction methods

Principal factor and factor score analysis

Model

Linear regression

Logistic regression

Result and discussion

Variable selection

Model result from selected variables

Linear regression

Logistic regression

Dimension reduction

Model result for dimension reduction

Linear regression

Logistic regression

Conclusion

Future work

Abbreviations

References

Authors' contributions

Acknowledgements

Competing interests

Availability of supporting data

Consent for publication

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords