Missing data management and statistical measurement of socioeconomic status: application of big data
 Habtamu Tilaye Wubetie^{1}Email author
Received: 1 August 2017
Accepted: 19 October 2017
Published: 19 December 2017
Abstract
Socioeconomic status measurement is an ongoing problem where different suggested measurements are given by researchers. This work investigates a socioeconomic status measurement derived from natural correlations of variables which can better and meaningfully cluster African countries for the level of status. The researcher used 48 African countries socioeconomic yearly time series data from 1993 to 2013 of IMF 2013 data set for data management (i.e, 2737 variables for 21 years), however, the analysis is reasonably done based on recent 14 years time series data. In data management, missing values are treated (imputed) by using regression estimates, Lagrange interpolation, linear interpolation and linear spline interpolation based on the appropriate method which best fits for the trend of data with minimum error at each time level. From principal component and factor analysis of average time series data, 7 principal factors contributed by 84 variables which explain \(70\%\) of the variation in the data set are suggested as a socioeconomic status measuring components and as a result the considered clustering methods (Kmean Method, Average linkage method, Ward’s method and Bootstrap Ward’s method) are agreed on six clusters of countries, those are statistically significant at \(95\%\), where as three countries each where suggested as outliercountries made an individual cluster.
Keywords
Introduction
Socioeconomic status measurement is an ongoing problem, where different studies had been made to measure it as a single measured variable, several single measured variables, or as a composite of several measured variables. Socioeconomic status is defined as one’s access to financial, social, cultural and human capital resources, and it is recommended that, family income with other indicators of home possessions and resources, parental educational attainment and parental occupational status (the “big 3”) as components of a core socioeconomic status measure [1]. It has been also defined as one’s access to collectively desired resources , like, (1) material capital (income, wealth, trust funds, etc.), (2) human capital (skills, abilities, credentials, etc.) and (3) social capital (instrumental relationships such as being friends with lawyers and doctors) [2]. Duncan socioeconomic .Index has been used in US as a measure, which is a subjective assessment of occupational prestige based on educational attainment and income. In 1974 Peter Rossi et al. also developed a household prestige score as a measure. Currently in UK a National Statistics Socioeconomic classification (NSSEC) is used to calculate a measure for socioeconomic status based on one’s job and employment relations.
From sociological view status of the society can be levelled using status dimensions, those are power, wealth, prestige and information. American sociologists have used occupational prestige to level status, and they have observed for similarities or differences exist between levels at different time and place. Some sociologist have defined occupational prestige as resources availability (composed of both wealth and power) to each person where others relate it with prestige (composed of power, wealth and prestige). Goldthorpe and Hope define it as social standing which includes variables of standard of living, power and influence, level of qualification and value to society [3].
Barro [17] made a study on 100 countries from 1960 to 1990. He found that, the growth rate (real per capita GDP) is enhanced by higher initial schooling and life expectancy, lower fertility, lower government consumption, better maintenance of the rule of law, lower inflation and improvements in the terms of trade. In 2015 a study had been made on factors affecting economic growth in developing countries by using crosscountry data for 76 countries from 2010, 2005, 2000 and 1995. The variables used to asses factors for GDP per capita growth are volume of export, government debt (% of GDP), natural resource yield (% of GDP), net foreign aid record (USD), life expectancy (years), Investment rate (% of GDP) and FDI inflow (% of GDP). From the result it was found that, high volume of exports, plentiful natural resources, longer life expectancy and higher investment rates have positive impacts on the growth of per capita gross domestic product in developing countries [4].
In constructing a measurement for socioeconomic status, its reliability is more important. The reliability is based on socioeconomic and statistical significance of the classifications made using appropriate method from a representative data. However, there is criticisms on representativeness on some of African data. The source for the problem is diversified. From history, in the 1980s and 1990s statistical offices didn’t received appropriate attention as a source of data, even today the data is distorted due to the shift in data demand by donors. In addition, projects on Africa has been focusing on achieving target development rather than answering an important development questions. An other problem is lack of data, as African Development Bank survey noted nearly onefifth of the respondent countries had not conducted an industry survey since 2000. In addition to the above problems, African data have faced sampling (inappropriate sample size and sampling technique) and nonsampling error (respondent error, non response, recording error, etc.) [5]. Missing data can be a problem when there is nonresponse or the data is not collected for the variable mainly not at random. Lagrange interpolation method can be used to interpolate missing values when one value is dependent on its neighbour data sets. Vaseghi [6] formulate the general form of polynomial interpolator and statistical interpolators applicable for missing data imputation purpose. He considers the special forms of Lagrange, Newton, Hermite and cubic spline interpolators for polynomial interpolators. Lokupitiya et al. [7] uses NASS data for barley crop yield in 1997 where ecological variable are spatially correlated to select a better interpolator method and find out regression and Multiple imputation as a better interpolator from the interpolation methods considered (regression, kernel smoothing, universal kriging and multiple imputation) for Y (response) based on the target or control variable X (explanatory).
Howell [8] considered missing data problem for standard experimental studies and observational studies. In observational studies missing values can be treated by hot deck imputation, mean substitution and pairwise deletion, but those methods lead to bias in parameter estimation. However, expectation/maximization (EM) algorithm and multiple imputation (MI) are the most best techniques which are based on iterative solutions in which the parameter estimates lead to imputed values, which in turn change the parameter estimate. MI is an interesting approach because it uses randomized techniques to do its imputation an example of it is regression imputation, which regresses the response variable based on the explanatory variable.
The lack of reliable data on African countries economy limits knowledge on the economic effect of structural adjustment, as a result the economic growth time series for African economies does not appropriately capture changes in economic development [9]. Currently a better African socioeconomic data is IMF [11], alternatively if we trust the AfDB, they may miss a few base year revision [10], though, AfDB is not really fully agree on IMF [11] report. Meanwhile, the AfDB, conclude that: “Overall, the situation with regard to GDP is not nearly as bad as has recently been suggested” [9]. In working over this problem, since considering the distribution of the data gives detailed and general information about the characteristics of interest than one value times series data is preferable than using single value. Consequently the risk of govern by only inappropriate data can be reduced.
Previous study on measurement of socioeconomic status construct measuring components or variables based on socioeconomic stand of an individual or community [1–4]. However, socioeconomic status measurement is still ongoing problem. This paper desired to construct measuring components by investigating a natural correlation exist between possible suggested variables, those can able to cluster countries based on socioeconomic status and level the status for components. However, a single measure for a status is not constructed, since the concern is to give specific suggestion based on the stand of cluster countries for components. Hence, time series data is used to manage the African socioeconomic data problem and determine components of socioeconomic status measurement which can classify African countries based on status through comparison across the region. Correspondingly, missing values were treated by a method which give minimum error from the true value at each time interval. Missing values are imputed using linear regression model, Lagrange interpolation, Linear interpolation and linear spline interpolation. Principal component analysis, factor analysis and cluster analysis are used in determining principal factor of socioeconomic status measurement and clustering African countries based on those factors. The result reveals that \(70\%\) of the variation in the data set is explained by the suggested 7 components (principle factors), which are contributed by 84 variables, and using those socioeconomic components 6 cluster of African countries are formed at \(95\%\) confidence level were 3 countries are consider as outlier.
Methodology
Data and variable
Data IMF [11] socioeconomic yearly time series data set containing 2737 variables [File Name: 21yearData.csv] from year 1993 to 2013 for 48 African Countries were used for data management, however, the analysis is reasonably done based on the data set from 2000 to 2013. The reason of using 14 years of data instead of 21 is due to the recently growing demand for data which apparently increases outputs from statistical offices. This leads the missing value to decrease in recent years. Specifically almost all the 44 respondent countries have carried out at least one household survey of income or expenditure since 2000 [9].
 1.
 2.
Moreover, as it was indicated on the introduction section AfDB does also agree on the [11] report.
 1.
Some data values are missing.
 2.
Different country have different base year for their GDP. In response IMF update each country’s GDP based on its base year. Hence, there may be a loss in information, and comparison using single year data is inappropriate.
 3.
The data have some discrepancies or some davit from AfDB [10] data [as AfDB [10] conclusion: IMF GDP report is not nearly as bad as has recently been suggested]. This problem is mainly raised in data collection, processing and distribution phase. It is a duty and responsibility of statistical offices or any primary data source organizations to apply appropriate data collection techniques and standardized processing method, and honest distribution based on the nature of the data as data is the public property. To manage this problem, distribution based analysis can reduce the risk of inferences and give relevant result than using a single value. Hence, considering time series of the data can help to do this job. For instance considering the time series of the data acquires the progress of the GDP, makes the comparison more appropriate in contrast to using 1 year GDP.
Missing data management
Missing data value is the absence of the data value completely at random (if missing values of any variable dose not depend on any value) or at random (if missing values in response variable does not depend on in its’ own value but dependent on other variables) or may be not at random (if missing values follow some structure or model) [8]. The data series of African socioeconomic variables on fixed timeinterval have high number of missing values for some countries comparing to other’s. As listed below in the sequence of missing values per country, countries such as, Somalia, South Sudan, and Sao Tome and Principe have high number of missing values compared to Tunisia, Morocco and South Africa. This suggests the probability to be missed is dependent on its’ own value.
In addition, since each socioeconomic variable is expressed in time and have strong indirect correlation (r = − 0.8413), the recent year has less probability of having missing value than the old one. Hence, missing values are not at random and nonignorable.
Interpolation
Interpolation is the estimation of unknown values using the values of known samples at the neighbourhood points [6].
Interpolation by simple linear regression method
Linear regression estimate imputation is one of the single imputation method used the surviving creature characteristicswhen the variable with missing value has correlation with explanatory variable (time) and the series of data values follow linear trend [7]. However, socioeconomic data expect to have some trends but may not exactly linear in time. Hence, applying this method may enhance correlation and under estimate the standard error of the regression coefficients by under estimating the variance of the imputed variables [8]. So with this consideration simple linear regression estimate is used to impute when the missing data is at the beginning (\(t_1\)) or/and at the end (\(t_n\)). However, missing values at the internal part were treated by comparing this method for minimum error with other exact estimation methods discussed in “Linear interpolation”, “Linear spline interpolation” and “Lagrange polynomial interpolation” sections.
Linear interpolation
Linear spline interpolation
This method works in similar fashion as linear interpolation in a way that the missing value is interpolated by using its most two neighbours except it works for sequentially missed values. Here with some adjustment (i.e., the same upper neighbour) this method is applied to interpolate sequentially missed values. For a time series of discreet data points of socioeconomic status given by \(\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}\), when the sequence of values \(y_{t_2}\), \(y_{t_{3}}\) ... \(y_{t_{n1}}\) are missed.
Lagrange polynomial interpolation
Lagrange polynomial interpolation is one type of exact interpolation which uses all given neighbours to estimate missing values.
Appreciatively, from theoretical advantage of Lagrange polynomial interpolation, since this method considers all known data value of a variable to estimate missing value at the point, the estimate is not only governed by its two most neighbouring data. However, due to the complication of the formula this method is employed when there is only one missing value in variable values or left with one missing value after other methods are employed.
Normality assumption It is known that the surviving creature characteristics is normally distributed. As usual, in our case this assumption is important to infer for a population because socioeconomic status of African population and the variables that determine these characteristics are expected to be normally distributed. Moreover, from central limit theorem, we have the property that, the sampling distribution of the sample statistic approaches to normal distribution as sample size increases \((n > 30)\) and from law of large number, we have the property that, as sample size increases the sample statistic approaches to the population parameter.
Data set of socioeconomic status used for analysis is a multivariate time series data set of 2737 variables from 48 African countries for a year from 2000 to 2013.
 1.
When some variables are highly correlated to each other, those variables are describing the underlined characteristics which is governed by their correlation, so this characteristics will be the interest on the group. The characteristics as a new variable can be written as a linear combination of those correlated variables which can maximize the accounted variation from the total variation in the data set.
 2.
When some variables are correlated to the same latent or may be new variable in describing the situation of interest (socioeconomic status), the latent or new variable as a linear combination of these variables is taken, in a way that the linear combination can maximize the accounted variation.
 3.
From the set of variables, some variables may accounted for large amount of variability in the data set. Hence, these variables can express larger amount of variation in the data set, so we can take those variables which can address the variation need to accounted.
In another word, we are assessing the variation between the random variables and variance of a variable. Normally, the variation between random variables is estimated by the distance variation of each random variable from their mean in units of standard deviation. This distance is a standardized and correlation free random variable [12]. In doing so, statistical distance plays an important role because the smaller this distance between the variables implies high correlation (it is observed on the offdiagonal of correlation or covariance matrix).
Principal component analysis
Principal component analysis describes the correlation or variance–covariance structure between the set of variables through a few uncorrelated latent or new variables, each of which is a linear combination of the original variables which can maximize the variance accounted. Most often these new variables reveals a new interpretation that is not visible in original variables [13]. The newly created variables are called principal components.
Let the random vector \(\mathbf Y '\) \(= [Y_1, Y_2, \ldots Y_p]\) have covariance matrix \(\Sigma\) with eigenvalue–eigenvector pair \((\lambda _1, e_1)\), \((\lambda _2, e_2)\), ..., \((\lambda _p, e_p)\), where \(\lambda _1\ge \lambda _2 \ge \cdots \ge \lambda _p \ge 0\). Then ith principal component \(Z_i\) for \(i=1, 2, \ \ldots , \ k\) where \(k \le p\) is a linear combination given by:
Principal components are arranged in decreasing order based on the proportion of the variation they can explain, in away that the first principal component accounts for the maximum variation than any of others. Therefore, taking the most first components may can address most of the variation in the original data (like up to 80 or 90% of the population variation). However, deciding the number of principal components have no yet well stated rule, but it is also advisable to consider the size of eigenvalues and the nature of components. Most often principal components with relatively equivalent small size eigenvalues are not consider. In general, it helps to reduce the data size in variable and shows the correlation between the variables on it [12], and those new variables are used for further analysis like cluster and regression analysis.
Factor analysis
Factor analysis is used to describe the observed correlation (covariance relation) between the variables in terms of few new random variables called factors [13]. This method concerns about grouping highly correlated variables together in a way that variables in different groups are relatively slightly correlated. So in a group, those variables are addressing a characteristics which is governed by the underline correlation, called factor. Factors are sightly correlated new variables.
 1.
\(\mathbf F \ \sim \ N(\mathbf 0 , \mathbf I )\).
 2.
\(\varepsilon \ \sim \ N(\mathbf 0 , \Phi )\), where \(\Phi \ = \ diag( \phi _1, \phi _2, \ldots , \phi _p)\).
 3.\(\mathbf F\) and \(\varepsilon\) are independent. This assumption leads us to estimate covariance matrix, given by:and, \(Cov(\mathbf F ,\mathbf Y )=\mathbf L\) or \(Cov( Y_i, F_j ) = l_{ij}\),$$\begin{aligned} \Sigma = \Lambda \Lambda ' + \Phi , \end{aligned}$$
The comparison of estimate of covariance to the original covariance tells us how the factor model fits the covariance matrix of original variable by the considered factors. Minimum discrepancy shows the good fit. Moreover, communality and uniqueness tell us the variance accounted by factors. Specifically the ith communality tells us the portion of the variance of \(Y_i\) explained by k common factors and ith uniqueness tells about the portion of variance of Y \((Var (Y_i)\ )\) explained by the ith specific factors. Our concern is mainly looking at the factor model that explains covariance structure without much loss of information by small number of common factors.
Cluster analysis
Cluster analysis is a method of grouping of objects or variables based on similarity or distance by considering the nature of the variable or scale of measurements and the subject matter knowledge inorder to make objects in a group to be similar and objects in different groups be relatively different. Usually objects, units or cases are clustered based on sort of distance, whereas variables are clustered based on correlation coefficients with a goal to find optimal group [14].
 1.
Hierarchical clustering method: Hierarchical clustering is an unsupervised method of grouping list of items through successive merging based on similarity or successive division based on dissimilarity. This method fall into two categories, Agglomerative hierarchical method and Divisive hierarchical method. Divisive hierarchical method start with group of items and continues by dividing the group into two subgroups by taking most similar items together in one group till each individual item make its own cluster where as Agglomerative hierarchical method start with a single item and merge most similar items together as a group, and these groups are merged successively based on similarity until the similarity is low. Then, those groups with low similarity are taken as clusters. The choice of similarity between groups or items can be measured based on average linkage or nearest neighbour linkage or the farthest neighbour linkage between the points of the groups or ward’s method. However, Agglomerative hierarchical algorithm is faster due to its computational efficiency (running time complexity \(O(n^3 )\)) than divisive clustering algorithm (running time complexity \(O(2^n)\)) [15, 16]. Hence, Agglomerative hierarchical method specifically average linkage (average euclidean distance) and Ward’s similarity measure are used. As described on the introduction section African socioeconomic data have a problem, so working based on the characteristics of the distribution can give relevant information. Hence, Average linkage helps to control the impact of a single value, so the result will not be fully affected by a probablymisleading nearest (due to single linkage method) or farthest point (due to complete linkage method). e.g. End points of Chaining cluster. The result of Agglomerative hierarchical clustering can be presented by two dimensional graphs called dendrogram or by the \(95\%\) confidence bounded ellipse scatter plot of the first and the second principal factors (which shows the proportion of variance in the data set explained by the first two components in determining clusters).
 2.Ward’s hierarchical clustering and it’s bootstrap extension: In this approach the focus is minimizing the information lost due to clustering. It is clear that joining dissimilar clusters results in inflated error sum of square (ESS) and leads to much information loss. Hence, a merging with smallest change in ESS results in minimum loss of information. At the beginning each item is considered as a cluster and ESS of the i cluster is zero (ESS\(_i\), for i= 1, 2, ..., K) and ESS of the data set is \(\sum _{i=1}^{K}ESS_i = 0\), in general if there are L clusters, \(ESS = ESS_1 + ESS_2 + \cdots+ ESS_L\), and finally if all clusters are in one group, error sum of square is given as;where \(y_i\) is the multivariate measurement associated with the ith item and \(\overline{y}\) is mean of all items. The result multivariate clustering is expected to be roughly elliptical [13]. Now the equation is in how much confidence a cluster can include the items assigned by ward’s method or how assigned elements of a cluster are variable. This can be check by creating a dataset using resampling (resampling may be from empirical distribution of the data or by resampling with replacement from the data) and do clustering for each dataset, if the proportion of an item included in the same cluster is grater or equal to the desired level of confidence, then an item is assigned to the cluster in the given confidence level.$$\begin{aligned} ESS= \sum _{i=1}^{K}(y_i \overline{y})'(y_i \overline{y}), \end{aligned}$$
 3.
Nonhierarchical clustering method: Nonhierarchical clustering techniques are designed to group items, rather than variables, into a collection of K clusters, which is predetermined in our case by hierarchical clustering techniques [13]. Nonhierarchical clustering is started either from random partitioning of items into K initial clusters or an initial set point which will form clusters. This paper uses the popular Nonhierarchical clustering method called Kmean method, which starts by random partitioning of items into K initial clusters and goes through the list of items for assigning an item to a cluster with a closest mean to an item.
Numerical examples for missing data management
Example 1
Estimation of artificially missed value for the known series of data (Y_ Actual), when a missing observation is at any point between the first and the last observation of the variable values. The result is given in Table 1 and Fig. 2.
Result for the estimates of a missing value by Lagrange interpolation, linear interpolation and linear regression estimation: in case of one missing value
Year  2000  2001  2002  2003  2004  2005  2006 

Y(t)_Actual  41.18  36.69  35.50  38.25  40.05  47.21  48.81 
Lagrange interpolation  41.18  85.34  30.58  39.45  42.07  46.31  46.92 
Linear interpolation  41.18  38.34  37.47  37.78  42.73  44.43  47.14 
Linear spline interpolation  41.18  42.75  42.42  41.51  40.99  40.12  39.84 
Lagrange error  0.00  − 48.65  4.92  − 1.20  − 2.02  0.90  1.89 
Linear interpolation error  0.00  − 1.65  − 1.96  0.47  − 2.67  2.77  1.67 
Linear regression error  0.00  − 6.06  − 6.91  − 3.27  − 0.94  7.09  8.97 
Year  2007  2008  2009  2010  2011  2012  2013 

Y(t)_Actual  47.07  47.97  35.37  38.44  38.79  36.89  33.22 
Lagrange interpolation  49.51  41.77  44.78  30.86  52.39  − 31.64  33.22 
Linear interpolation  48.39  41.22  43.21  37.08  37.67  36.00  33.22 
Linear spline interpolation  39.76  39.38  40.37  39.85  39.60  39.87  33.22 
Lagrange error  − 2.45  6.20  − 9.40  7.59  − 13.61  5.25  0.00 
Linear interpolation error  − 1.32  6.75  − 7.84  1.36  1.12  0.89  0.00 
Linear regression error  7.31  8.59  − 5.00  − 1.40  − 0.81  − 2.98  0.00 
Example 2
Estimation of two nonsequentially artificially missing values for the known series of data (Y_ Actual), when the missing observations are at any point between the first and the last observation. Table 2 and Fig. 3 shows the estimates of missing values by linear interpolation, linear spline interpolation and regression estimation methods. Here the considered cases are, when the first missing observation is at position i the second is at position i + 4, for i = 2,3, ..., n − 5.
Result for the estimates of a missing values by linear interpolation, linear spline interpolation and linear regression estimation: in case of two nonsequentially missing values
Year  2000  2001  2002  2003  2004  2005  2006 

Y(t)_Actual  41.18  36.69  35.50  38.25  40.05  47.21  48.81 
Linear interpolation  41.18  38.34  37.47  37.78  42.73  44.43  47.14 
Linear regression estimate  41.18  41.88  41.60  41.00  40.47  40.39  39.89 
Linear spline interpolation  41.18  38.34  37.47  37.78  42.73  44.43  47.14 
Linear interpolation error  0.00  − 1.65  − 1.96  0.47  − 2.67  2.77  1.67 
Linear Reqression error  0.00  − 5.19  − 6.10  − 2.75  − 0.42  6.81  8.92 
Linear spline interpolation error  0.00  − 1.65  − 1.96  0.47  − 2.67  2.77  1.67 
Year  2007  2008  2009  2010  2011  2012  2013 

Y(t)_Actual  47.07  47.97  35.37  38.44  38.79  36.89  33.22 
Linear interpolation  48.39  41.22  43.21  37.08  37.67  36.00  33.22 
Linear regression estimate  39.77  39.60  39.96  39.20  38.89  38.74  33.22 
Linear spline interpolation  48.39  41.22  43.21  37.08  37.67  36.00  33.22 
Linear interpolation error  − 1.32  6.75  − 7.84  1.36  1.12  0.89  0.00 
Linear Reqression error  7.30  8.37  − 4.58  − 0.75  − 0.10  − 1.85  0.00 
Linear spline interpolation error  − 1.32  6.75  − 7.84  1.36  1.12  0.89  0.00 
Example 3
Estimation of four sequentially artificially missing values for the known series of data (Y_ Actual), when the missing observations are sequential at any interval between the first and the last observation. Table 3 and Fig. 4 shows the estimates of missing values by linear spline interpolation and linear regression estimation methods. Here the considered cases are, when the first missing observation is at position i then the missing observation will be sequential up to \(i+3\), for i = 2,3, ..., n − 4.
Conclusion
 1.
Figure 3 shows that the plot of the errors due to linear interpolation and linear spline interpolation are equally closer to the horizontal error free line than the plot of the error due to linear regression line. Hence, the error due to linear interpolation and linear spline interpolation are smaller than the error due to linear regression. This reveals that, for a such time series data with nonlinear trend when missing values are not sequential, linear interpolation and linear spline interpolation brings a better estimate than linear regression.
 2.
Figure 4 shows that the plot of error due to linear spline interpolation is closer to the horizontal error free line than the plot of error due to linear regression. This reveals that linear spline interpolation estimator for two or more sequentially missing values have smaller error than the linear regression estimator. Therefore, for a such time series data with nonlinear trend, linear spline interpolation brings a better estimate for two or more sequentially missing values than linear regression.
 3.
Figure 2 depicts that, since the plot of Lagrange interpolation error is closer to the horizontal zero error line than the plot of linear interpolation error and regression estimate error on the period of time interval between 5 and 9, this paper uses this method when missing values are in the middle of the observation, specifically on time interval between 5 and 9. Eventhough, Lagrange polynomial interpolation gives minimum error estimator for missing value on the specified interval above, however, linear interpolation estimator have minimum error on the rest of the series comparing to Lagrange interpolation and regression estimator. Hence, linear interpolation is applied to estimate when a missing is on the other intervals (i.e, on the beginning and ending part) of the series.
Result and discussion
The concern is to formulate and apply statistical method which can grasp highly contributing variables from total variation to make major components which can significantly and meaningfully able to level the status of socioeconomic development through African countries. Therefore, once the suggested socioeconomic status measuring variables from literatures were considered, working on those variables by removing redundancies and variables which have no visible role in total variance can help us to reduce the number of variables need to be considered for measurement without much loss of information. One of the techniques to do this is finding highly correlated variables (the correlation may be direct or through latent variable) and replace them by newvariable (component) which is govern by underline correlation through a linear combination of those variables, which can maximize the variance accounted by them out of the total variation. Hence, principal component and factor analysis are applied in finding key variables.
Principal component and factor analysis
 (a)
Scree plot of variance: The scree plot in Fig. 5 shows that the bend point starts at principal factor 5. After this point the plot descends slowly but at principal components 8 there is another slight bend. Hence, two points can proposed, however, a rough view at scree plot suggests 4 principal components.
 (b)
Eigenvalues: The variance or eigenvalue of the principal components given in Table 4 reveals that, the first 14 principal components have greater than one eigenvalue, the first 8 principal components have greater than 2 eigenvalue and the first 5 components have greater than 3.76 eigenvalue. Since the eigenvalue of a principal components < 1 implies that from total variance the variance accounted by a component is less than one, the principal component with large eigenvalue were chosen to explain the variation in the data set (usually with eigenvalue > 1). Based on this aspect the first four or seven factors or 13 can be taken.
 (c)
Proportion of the total variation: Here the deal is the proportion of the total variation contributed by those factors. From the result in Table 4 principal components with greater variance are selected to satisfy the proportion of total variance want to be accounted. Therefore, based on the result suggestions in (1) and (2) if the first four components are taken only 53.068% of the variation would be explained alternatively if the first seven components are taken 67.056% of the variation would be explained, where taking 13 components explain upto 83.58% of variation, but considering 13 variables are not still small and the total variation accounted by newly added 5 components is only \(16.524\%\), inaddition the scree plot dose not support it.
 (d)
Subject matter consideration: The subject matter consideration is important to have meaningful and interpretable component for socioeconomic status. From these aspect it is observed that factors resulted from analysis of 4 principal components particularly factor 3 and 4 composed of variables from different categories of data and it makes factor 3 and 4 difficult to interpret and relate with real socioeconomic data categories. On the other hand factors obtained based on analysis of 7 principal components are more direct to interpret and easy to relate with categories of socioeconomic data (education, economic, health, infrastructure and population demographic data). To conclude from the above four reasoning taking the first 7 principal component have relative advantage in explaining more proportion of variation (i.e. up to 67% of total variation and while in factor analysis leads the factors to explain upto 70% (Appendix 1: Table 11) of total variation in the data set), and in estimating easily and meaningfully interpretable fear number of factors.
Result for the estimates of missing values by linear spline interpolation and linear regression estimation: In case of 4 sequentially missing values
Year  2000  2001  2002  2003  2004  2005  2006 

Y(t)_Actual  41.18  36.69  35.50  38.25  40.05  47.21  48.81 
Linear spline interpolation  41.18  42.38  39.11  37.82  40.19  39.12  45.45 
Linear regression Estimate  41.18  47.63  42.67  39.76  38.60  37.75  38.65 
Linear spline interpolation error  0.00  − 5.69  − 3.61  0.43  − 0.14  8.09  3.36 
Linear regression error  0.00  − 10.94  − 7.17  − 1.51  1.45  9.45  10.16 
Year  2007  2008  2009  2010  2011  2012  2013 

Y(t)_Actual  47.07  47.97  35.37  38.44  38.79  36.89  33.22 
Linear spline interpolation  46.81  45.03  45.02  42.07  39.12  36.17  33.22 
Linear regression Estimate  39.35  40.01  42.17  42.30  42.44  42.58  33.22 
Linear spline interpolation error  0.26  2.94  − 9.65  − 3.63  − 0.33  0.72  0.00 
Linear regression error  7.72  7.96  − 6.79  − 3.86  − 3.66  − 5.69  0.00 
Once the number of principal factor is determined key variables for principal factor can be selected based on loadings and their correlation with principal factor. A variable with large loadings implies that it is highly contributed by the factor and high correlation implies that the variable is highly important to determine the factor. From the result it is observed that a variable weighted with high loading by a principal factor has high correlation with it. The communality also justifies this implication.
 1.The key variables in principal factor 1 are related to sustainable life measure. The result from correlation between key variables in the factor 1 suggests that:

There is negative correlation (− 0.646) between infant mortality rate and improved sanitation facility. Hence, low sanitation can be the cause for infant mortality.

There is good direct correlation (0.618) between life expectancy at birth and improved sanitation facility.

There is strong direct correlation (0.812) between incidence of tuberculosis and prevalence of HIV.

Cause of death by communicable diseases and maternal, prenatal and nutrition conditions, and cause of death, by noncommunicable diseases have strong negative correlation (− 0.862), this implies that attention was given for one of them, so attention should be given for communicable diseases too.

There is direct correlation (0.71) between infant mortality rate and communicable diseases.

There is indirect correlation (− 0.70) between life expectancy and communicable diseases.

Life expectancy at birth and infant mortality rate have strong negative correlation (− 0.844). This implies that most of the countries with short life expectancy should decrease infant mortality by improving sanitation problem.

The general suggestion for the source of short life expectancy in Africa leads to low sanitation and death due to communicable diseases.

 2.Principal factor 2 is related to capital. The correlation between key variables of principal factor 2 suggests the following results.

Labour force and population is highly correlated (0.932). This is a reflection for most populated area have high labour force.

There is strong direct correlation between transportation systems. A country with a better Air transport have a better Rail lines and Container port traffic (0.83 and 0.85 respectively), and a country with a better Rail lines have also a better Container port traffic (0.80).

Air transport and Rail lines have strong direct correlation with GDP at market price (0.79 and 0.74 respectively). There is also strong correlation between Air transport and Gross capital formation (0.78). The correlation suggests that transportation system have strong influence on GDP at market price and Gross capital formation.

GDP at market price have strong positive correlation with Gross capital formation and Foreign direct investment (0.925 and 0.85 respectively). Hence, GDP at market price of a country can be enhanced by calling Foreign investment and accumulating capital.

In general Foreign investment, accumulating capital and transportation system have strong influence on GDP at market price.

 3.Principal factor 3 is general income related factor. The correlation between variables of principal factor 3 suggests the following results.

Electric power consumption have high correlation with GDP per capita, PPP (current international $) (0.753) and mobile cellular subscriptions (0.761).

GDP per capita, PPP (current international $) and Improved sanitation facilities have strong correlation (0.744).

 4.Principal factor 4 is related to life risk. The correlation between variables of principal factor 4 suggests the following results.

Prevalence of HIV and incidence of tuberculosis have some negative correlation (− 0.415, − 0.497, respectively) with life expectancy.

Prevalence of HIV and incidence of tuberculosis have some what visible correlation (0.442, 0.362, respectively) with manufacturing. This result is a surprising result which reflects that, manufacturing areas are suspected to be the source for medium rate of prevalence of HIV and tuberculosis. Hence, health polices should consider what have to be done in manufacturing area to reduce the prevalence of HIV and incidence of Tuberculosis.

 5.Principal factor 5 is more of related to literacy. The correlation between variables of principal factor 5 suggests the following results.

Cash surplus/deficit is strongly correlated with adult literacy rate and youth literacy rate (0.704, 0.725, respectively). Hence, illiteracy reduction plays an important role for cash surplus.

 6.Principal factor 6 contrasts rate of water supply and consumption. The correlation between variables of principal factor 6 suggests the following results.

There is an indirect Annual freshwater withdrawals in Agriculture have strong indirect correlation with Annual freshwater withdrawals in domestic (− 0.922) and Annual freshwater withdrawals in industry (− 0.794).

There is some direct correlation (0.498) between Annual freshwater withdrawals in Domestic and Annual freshwater withdrawals in industry.

 7.Principal factor 7 reflects the contrast between GDP growth rate and inflation. The correlation between variables of principal factor 7 suggests the following results.

There is high correlation between GDP percapita growth and inflation rate (0.954). This suggests that countries with high GDP percapita growth should control inflation. This result agree with Barro [17] suggestion.

There is also some direct correlation (0.40) between GDP percapita growth and export of good and services. This result agree with Upreti [4].

Summary for variance accounted by principal components
Principal components  PCI  PC2  PC3  PC4  PC5  PC6  PC7  PC8 

Variance  15.03  6.17  4.81  3.73  2.73  2.61  2.54  1.98 
Proportion of variance  0.27  0.11  0.09  0.07  0.05  0.05  0.05  0.04 
Cumulative proportion  0.27  0.38  0.46  0.53  0.58  0.63  0.67  0.71 
Principal components  PC9  PC10  PC11  PC12  PC13  PC14  PC15  PC16 

Variance  1.85  1.70  1.35  1.13  1.12  1.10  0.93  0.77 
Proportion of variance  0.03  0.03  0.02  0.02  0.02  0.02  0.02  0.01 
Cumulative proportion  0.74  0.77  0.79  0.81  0.83  0.85  0.87  0.89 
Data quality
Before doing further analysis, it is important to know the quality and nature of the data in order to find the appropriate method and make inference. We can study the quality and nature of the data by checking for outliers and distribution type (usually normality) respectively. Since principal factor is a linear combination of all variables with some loadings, assessing for principal factor is the reflection of assessing variables. Hence, our focus is to know what nature does principal factors have.
Q–Q is used to plot to check the normality of principal factors and Tchart to assess outliers in the data set.
From Fig. 6 Q–Q plots suggest that some of the principal factors are approximately normally distributed (i.e., principal factors in plot 2, 3, 5, 6), whereas some of them show some divergences (those are principal factors in plot 1, 4 and 7), this implies that they are not far from normal distribution. So, working with them can bring relevant inference for the population parameters.
Summary table for principal factors
Variable code  Loadings  Corr  Com 

Principal component l (sustainable life)  
SP.DYN.LEOO.IN  1.06  0.8079716  0.9380648 
SH.DTH.NCOM.ZS  1.01  0.823052  0.7976794 
SH.H2O.SAFE.RU.Z  0.78  0.7575063  0.6749383 
SP.POP.1564.TO.ZS  0.73  0.8764492  0.8405396 
IT.NET.USER.P2  0.67  0.7321589  0.6816921 
NV.SRV.TETC.ZS  0.63  0.5728387  0.7281336 
SH.STA. ACSN  0.55  0.77288  0.7755206 
SH.H2O.SAFE.UR.Z  0.54  0.5288412  0.6275978 
SE.PRE.ENRR  0.48  0.5518474  0.394378 
IT.CEL.SETS.P2  0.47  0.7351707  0.8132745 
SE.TER.ENRR  0.42  0.5128509  0.4215655 
SE.SEC.ENRR  0.39  0.552263  0.4368579 
SH.MED.PHYS.ZS  0.38  0.351458  0.4507551 
IT.NET.BBND.P2  0.37  0.51236  0.345433 
SH.DYN.AIDS.ZS  − 0.41  0.0777566  0.8047301 
NV.AGR.TOTL.ZS  − 0.41  − 0.7452145  0.8178983 
SH.TBS.INCD  − 0.53  − 0.0656801  0.7888127 
SP.DYN.CBRT.IN  − 0.74  − 0.8915574  0.8582985 
SH.STA..MMRT  − 0.82  − 0.8261328  0.725329 
SH.DTH. COMMA.ZS  − 0.93  − 0.7800013  0.7496144 
SP.DYN.IMRT.IN  − 0.98  − 0.8936826  0.8558577 
SP.DYN.CDRT.IN  − 1.04  − 0.7554875  0.8471443 
Principal component 2 (capital)  
NY.GOP.MKTP.CD  0.97  0.9517201  0.9170857 
IS.RRS.TOTL.KM  0.89  0.8176941  0.7891843 
BX.KLT.DINV.CD.WD  0.87  0.8659704  0.8049668 
NE.GDI.TOTL.CD  0.85  0.8781617  0.8110496 
IS.AIR.DPRT  0.84  0.8702821  0.8434615 
IS.SHP.GOOD.TU  0.74  0.7496073  0.7518197 
SP.POP.TOTL  0. 7  0.6570489  0.6306013 
SL.TLF.TOTL.IN  0.66  0.6468067  0.7005713 
EG.USE.ELEC.KH.P  0.45  0.5377256  0.8262583 
SE.TER.ENRR  0.41  0.4621814  0.4215655 
ER.H2O.FWTL.K3  0.34  0.457826  0.40111402 
NE.IMP.GNFS.ZS  − 0.31  − 0.3255368  0.5675932 
Principal component 3 (income related factor)  
NY.GDP.PCAP.PP.C  0.89  0.9446474  0.9213499 
NY.GDP.PCAP.CD  0.85  0.9250355  0.9110524 
NV.IND.TOTL.ZS  0.83  0.7730768  0.6893409 
GC.DOD.TOTL.GD.Z  0.7  0.8077165  0.741978 
NE.EXP.GNFS.ZS  0.59  0.7130032  0.853719 
EG.USE.ELEC.KH.  0.58  0.726317  0.8262583 
IT.CEL.SETS.P2  0.55  0.7422919  0.8132745 
SH.STA ACSN  0.46  0.6969131  0.7755206 
NE.TRD.GNFS.ZS  0.45  0.5497256  0.7607122 
SH.MED.BEDS.ZS  0.45  0.4365858  0.4086245 
SE.SEC.ENRR  0.35  0.5159998  0.4368579 
SH.H2O.SAFE.UR.Z  − 0.31  − 0.0620451  0.6275978 
NV.AGR.TOTL.ZS  − 0.38  − 0.6669364  0.8178983 
SH.XPD.TOTL.ZS  − 0.46  − 0.4721462  0.5657422 
NV.SRV.TETC.ZS  − 0.5  − 0.0756577  0.7281336 
Principal component 4 (life risk)  
SH.DYN.AIDS.ZS  1.05  0.7965701  0.8047301 
SH.TBS.INCD  1  0.7393787  0.7888127 
NV.IND.MANF.ZS  0.77  0.6267752  0.5451881 
SP.DYN.CDRT.IN  0.52  0.1150051  0.8471443 
IS.SHP.GOOD.TU  0.39  0.4575752  0.7518197 
NE.IMP.GNFS.ZS  0.39  0.5374423  0.5675932 
SE.XPD.TOTL.GD.Z  0.38  0.3719803  0.3151677 
NE.TRD.GNFS.ZS  0.34  0.5519894  0.7607122 
SH.XPD.TOTL.ZS  0.33  0.2408635  0.5657422 
SH.H2O.SAFE.UR.Z  0.31  0.4626387  0.6275978 
NV.AGR.TOTL.ZS  − 0.32  − 0.6058561  0.8178983 
DT.TDS.DECT.EX.Z  − 0.37  − 0.1428698  0.2930229 
SP.DYN.LEOO.IN  − 0.57  − 0.1076823  0.9380648 
Principal component 5 (literacy)  
SE.ADT.1524.LT.ZS  0.87  0.8778255  0.8556102 
SE.ADT.LITR.ZS  0.86  0.8747162  0.8454292 
GC.BAL.CASH.GD.Z  0.77  0.7345393  0.61338 
SH.MED.BEDS.ZS  0.37  0.4346196  0.4086245 
SP.MTR.1519.ZS  − 0.38  − 0.3940773  0.263375 
Principal component 6 (Rate of Water supply and consumption contrast)  
ER.H2O.FWDM.ZS  1  0.8787253  0.8500172 
ER.H2O.FWIN.ZS  0.66  0.5969486  0.478074 
NY.GNS.ICTR.ZS  0.39  0.3865365  0.1993376 
DT.TDS.DECT.EX.Z  0.36  0.2825562  0.2930229 
NE.IMP.GNFS.ZS  0.32  0.5611142  0.5675932 
NE.TRD.GNFS.ZS  0.31  0.5644163  0.7607122 
ER.H2O.FWTL.K3  − 041  − 0.4826178  0.4011402 
NV.IND.MANF.ZS  − 0.43  − 0.1210153  0.5451881 
ER.H2O.FWAG.ZS  − 0.99  − 0.8804037  0.8455708 
Principal component 7 (GDP growth rate)  
NY.GDP.PCAP.KD.Z  0.88  0.805833  0.7268232 
FP.CPI.TOTL.ZG  0.86  0.7815264  0.7317425 
SH.MED.CMHW.P3  0.46  0.4290214  0.2630565 
NE.EXP.GNFS.ZS  0.4  0.6200583  0.853719 
SH.MED.NUMW.P3  0.4  0.4986513  0.4551595 
SH.MED.PHYS.ZS  0.39  0.4168554  0.4507551 
NV.SRV.TETC.ZS  0.32  0.3575785  0.7281336 
SH.H2O.SAFE.UR.Z  − 0.41  − 0.3178607  0.6275978 
SH.XPD.TOTL.ZS  − 0.46  − 0.4648261  0.5657422 
TChart for countries based on principal components
Country code  BENß  GHA  CMR  TGO  KEN  UGA  SEN 
TChart  0.54  0.78  0.90  0.94  1.17  1.35  1.36 
Country code  ZMB  GNB  GIN  MOZ  RWA  COM  MWI 
TChart  1.62  1.78  1.79  1.86  2.33  2.51  2.63 
Country code  BFA  TCD  SOM  MLI  MRT  ZWE  SDN 
TChart  2.91  2.93  3.13  3.20  3.74  3.93  3.99 
Country code  AGO  BDI  STP  BWA  LBR  NAM  MAR 
TChart  3.99  4.39  4.56  4.84  4.85  4.93  5.07 
Country code  ERI  SLE  DZA  MUS  MDG  CAF  CPV 
TChart  5.22  5.36  5.46  5.63  5.65  5.71  7.39 
Country code  TUN  DJI  GAB  ETH  SYC  LSO  NGA 
TChart  7.68  8.23  8.79  10.29  11.48  11.84  12.21 
Country code  GNQ  LBY  SWZ  NER  SSD  ZAF  
TChart  14.62  16.61  16.83  28.55  30.56  32.87 
Cluster analysis
A cluster analysis were used for grouping objects or variables without having any prior information or hypothesis on the number, elements and structure of the groups.
Cluster analysis is classified into two types, Hierarchical cluster analysis and Nonhierarchical cluster analysis.
Hierarchical cluster analysis is one of the preferable methods in determining the number of clusters and suggesting an initial elements of the cluster. In this analysis, average linkage, Ward’s method and its bootstrap extension from the types of Agglomerative Hierarchical Cluster are used.
On the other hand Nonhierarchical clustering techniques are used to cluster countries or identify elements of the cluster based on the the number of clusters obtained from hierarchical cluster analysis. Here one of the Nonhierarchical cluster analysis called Kmean clustering is used to determine elements of the clusters in addition to the considered Agglomerative Hierarchical methods.
 1.
Determining appropriate number of clusters.
 2.
Determining elements of the cluster.
Number of clusters
Summary table for withincluster variability, betweencluster variability and \(F_{ratio}\)
No. of clusters  2.00  3.00  4.00  5.00  6.00  7.00  8.0  9.00  10.00  11.00  12.00  13.00  14.00 

Belweencluster variability  1219.46  1063.91  948.85  811.08  691.45  658.29  565.32  563.71  438.92  374.95  348.20  320.78  330.61 
Withincluster viability  550.88  661.89  836.55  958.72  1074.51  1112.05  1252.35  1294.21  1351.96  1366.89  1426.89  1447.97  1461.04 
F_ratio  0.45  0.62  0.88  1.18  1.55  1.69  2.22  2.30  3.08  3.65  4.10  4.51  4.42 
 (a)
Average linkage method: From the result in dendrogram Fig. 7 based on the distance of clusters are joining, the suggestion would be 9 clusters, where three countries (South Africa, South Sudan and Niger) are each of them forming an individual cluster. Where the more the shorter distance of joining implies the more clusters are similar.
 (b)Clustering based on screen plot of withincluster sum of square and \(F_{ratio}\): One of the assumption in clustering is that betweencluster variability should be relatively larger than withincluster variability. Hence, for the given degree of freedom comparing the empirical \(F_{ratio}\) with the theoretical \(F_{statistics}\) helps in decision making process of statistically significance minimum number of clusters [19]. Here, decision in clustering process is made for the number of clusters with \(F_{ratio} > F_{statistics}\). Based on the result in Table 7 proposing 9 clusters is reasonable since the value of \(F_{ratio} = 2.296\ >\ F_{(8,40)}= 2.18\) at \(95\%\) confidence. Where,Roughly, the number of clusters can be suggested by looking the bend point on the scree plot of withincluster sum of square (the change in withingroups sum of square error below this point should be negligible)[18–21]. The result described at Fig. 8 suggests nine clusters. The two methods are agree on nine numbers of clusters.$$\begin{aligned} F_{ratio} = \dfrac{ \text {Betweencluster variability}}{\text {Withincluster variability}}. \end{aligned}$$
 (c)
Bootstrap resampling of Ward’s method: Which give statistically significant number of clusters for the desired level of confidence. If the proportion of number of times items are assigned together is at least a desired level of confidence times, then this group is considered as one cluster with the desired level of confidence. E.g. If some groups of items are assigned together, with proportion of number of times grater than or equal to 0.95, thus, these groups of items are considered as one cluster in 95% confidence level). Figure 11 reveals that the number of cluster is 6 at 95% confidence level and where 5 countries have no data support to be clustered. We should recall that from the detection of outliers by TChart I showed those country (South Africa, South Sudan, Niger and Equatorial Guinea) in Fig. 6 plot 8 and in Table 6 as extreme value (outliers) and suspected outliers, except Seychelles. From the above three approach results, I can conclude that, the appropriate number of clusters is 9 considering that some outlier values form individual clusters. They are South Africa, South Sudan and Niger, which each form a cluster.
Determining elements of the cluster
Summary for cluster element of each clustering Method
Clusters  Countries assigned for each cluster  

Average linkage  Kmean  Ward’s  Boot strap Ward’s (95% confidence level)  
Cluster 1  NAM, BWA, SWZ, LSO  BWA, NAM, SWZ  NAM, BWA, SWZ, LSO, MUS, DJI  NAM, BWA, SWZ, LSO, MUS, DJI 
Cluster 2  DZA, MAR, MUS, TUN, CPV  MUS, TUN, CPV, STP, DJI, LSO, LBR  DZA, MAR, TUN, CPV  DZA, MAR, TUN, CPV 
Cluster 3  AGO, MRT, SDN, GHA, ZMB. CMR, STP, DJI, KEN, SEN, BEN, ZWE, TCD, ERI, MLI MDG, UGA, COM, BFA, GNB, SLE, TGO, GIN, RWA, CAF, MOZ, LBR, BDI, MWI, SOM  MRT, CMR, BEN, TCD, MLI, MDG, GIN, MOZ, SOM  TCD, MLI, SOM, MRT, GIN, MOZ, SDN, MDG, GHA, CMR, KEN, SEN, BEN, AGO, ZMB, ZWE  TCD, MLI, SOM, MRT, GIN, MOZ, SDN, MDG, GHA, CMR, KEN, SEN, BEN, AGO, ZMB, ZWE 
Cluster 4  NGA, ETH  NGA, SDN, ETH  NGA, ETH  NGA, ETH 
Cluster 5  LBY, GAB  GHA, ERI, UGA, COM, BFA, GNB, SLE, TGO, RWA, CAF, BDI, MWI  UGA, MWI, ERI, COM, BFA, GNB, SLE, TGO, RWA, CAF, BDI, STP, LBR  UGA, MWI, ERI, COM, BFA, GNB, SLE, TGO, RWA, CAF, BDI, STP, LBR 
Cluster 6  GNQ, SYC  GNQ, LBY, SYC, GAB, DZA, ZAF, MAR  GNQ, LBY, SYC, GAB  LBY, GAB 
Cluster 7  SSD  SSD  SSD  
Cluster 8  NER  NER  NER  
Cluster 9  ZAF  AGO, ZMB, KEN, SEN, ZWE  ZAF 
 (a)
Average linkage method: In this method the nearest clusters are joined based on average distance between them, where the distance is the euclidean distance between all items of pairs of clusters. Based on the result of analysis described in Table 8 the deviation of this method is that, most of the countries are grouped in cluster 3 where as by Kmean method these countries are split into three clusters specifically cluster 3, 5 and 9, and into two clusters by Ward’s method specifically cluster 3 and 5.
 (b)
Kmean method: It is one of the Nonhierarchical cluster analysis with a purpose of assigning elements to predetermined clusters, in a way that each item is assigned to a cluster with the nearest mean for the first two principal factors (in this case these two components explained \(42.36\%\) of the total variation), while the distance is measured by euclidean distance. Based on the result described by Fig. 9 and Table 8, clustering by this method is almost agree with Ward’s method, with some exceptions, like, this method merges South Africa but Wards split it out as one cluster, cluster 3 and 9 elements of this method are merged into cluster 3 by Ward’s method.
 (c)
Ward’s method and Bootstrap resampling of Ward’s method: The objective of Ward’s method is to minimize the information lost in clustering by joining clusters resulting in minimum error sum of square. As described in (b) Ward’s method almost agree with K Mean method with some exceptions. This is an indication of stability of clusters. Additionally, the result from Bootstrap as described in Table 8 assures the existence of the first 6 clusters and stability of its elements at \(95\%\) confidence level. Bootstrap suggests that the cluster with large pvalue is highly supported by the data. Hence, the number of clusters and elements selection have to be done based on the desired pvalue. Based on the result in Fig. 11 there is no enough evidence to reject nonexistence of the clusters formed by South Africa, South Sudan, Niger at 95% confidence level. Equatorial Guinea and Seychelles are not included in cluster 6 at this level of confidence (their pvalue is 77%), this implies that both Equatorial Guinea and Seychelles are included 77% of times in cluster 6. So including them in cluster 6 will not have big influence on similarity of cluster elements. The Dendrogram representation of Ward’s method and Bootstrap resampling of Ward’s method is given in Figs. 10 and 11 respectively.
Conclusion for cluster analysis
Based on the results of the above four methods discussed, Wald’s method finds out the most stable and statistically significantly exist clusters with the exception of cluster 7, 8 and 9. This result strengthen suggestion given on data quality “Data quality” section the 3 left clusters made by South Africa, Niger and South Sudan needs further investigation from the source of data. Meanwhile, since the data source is not easily accessed for further assessment and AfDB [10] was accepted the IMF report, the better decision is taking the three clusters to measure their current status. However, further inferences for the population were made based on those statistically significant clusters by including South Africa, because based on real situation observed relatively some extreme data values for South Africa is expected.
Inference for population
The summary result in Table 9, Appendix 1: Tables and Figs. 12, 13 for the relation between clusters and principal factors suggests that cluster 2, 9, 1 and 6 countries have good sustainable (Good) life (variables of PC1, Appendix 1) than other cluster countries. This result specifiably indicates that Tunisia, Mauritius, Seychelles, CapeVerde, Morocco, Algeria and South Africa have relatively better sustainable life than other African countries, this implies that these countries used relatively suitable policies on variables of PC1 than other African countries used. In terms of capital (variables of PC2, Appendix 1) cluster 9, 2 and 4 countries have a better status. This result specifiably shows that South Africa, Nigeria, Algeria and Morocco have relatively better capital than other African countries, this implies that these countries used relatively suitable policies on variables of PC2 than other African countries used.
Summary for level of clusters based on principal factors
Ward’s cluster  PCI  PC2  PC3  PC4  PC6  PC7  PC5 

Cluster 1  Medium (− 0.045, 2.11)  Low (− 0.75, − 0.39)  Low (− 0.75, 1.05)  High (0.59, 3.07)  High (− 0.56, 2.49)  Low (− 0.54, 0.49)  Low (− 0.40, 1.04) 
Cluster 2  High level (1.35, 2.46)  Medium (− 0.16,1.44)  Medium (− 0.11, 1.10)  Medium (− 0.54, 0.48)  Low (− 0.71, − 0.05)  Low (− 0.37, 1.12)  Low (− 0.32, 0.57) 
Cluster 3  Low (− 1.63, − 0.53)  Low (− 0.69, 0.84)  Low (− 0.96, 0.49)  Medium (− 1.22, 1.24)  Low (− 1.73, − 0.077)  Low (− 0.51, 1.56)  Low (0.65, − 1.13) 
Cluster 4  Low (− 1.28, − 0.79)  Medium (0.81, 2.71)  Low (− 0.58, 0.12)  Low (− 1.36, − 1.21)  Low (− 0.93, − 0.72)  Low (− 0.89, 0.41)  High (0.59, 2.71) 
Cluster 5  Low (− 1.62, 0.71)  Low (− 0.86, − 0.12)  Low (− 0.41, − 0.41)  Medium (− 0.93, 0.32)  Low (− 0.69, 1.96)  Low (− 1.54 ,0.45)  Medium (− 0.62, 1.21) 
Cluster 6  Medium (− 0.13, 2.06)  Low (− 0.53, 0.44)  High (2.23, 3.45)  Medium (− 0.79, 0.61)  Medium (− 1.14, 2.29)  Low (− 0.81, 1.26)  Medium (− 0.13, 0.83) 
Cluster 7  Low (− 0.54)  Low (0.47)  Low (0.16)  Medium (− 0.37)  Low (0.35)  High (5.13)  Low (0.03) 
Cluster 8  Low (− 0.88)  Low (− 0.34)  Low (− 0.79)  Medium (− 0.95)  Low (− 0.20)  Low (− 0.93)  Low (− 5.03) 
Cluster 9  Medium (1.34)  High (5.12)  Medium (0.49)  Medium (2.22)  Low (0.29)  Low (0.17)  Low (− 0.43) 
Future work
Previous study on socioeconomic status measurement construct measuring components or variables based on theoretical view of socioeconomic stand of an individual or community [1, 2]. However, socioeconomic status measurement is still ongoing problem. So to put a hand in solving this problem statistical approach is used to construct measuring components by investigating a natural correlation exist between possible suggested variables, those can able to cluster countries based on their socioeconomic status and level the status by component. Limitation of this study is a comprehensive single measure for a status is not constructed, rather levelling is componentwise and specific suggestion based on the stand of cluster countries for a component variables is given.
Conclusion
The result of Principal component analysis, factor analysis and cluster analysis reveals that, 70% of the variation is encountered by 7 principal factors (Appendix 1: Table 11), using this variation, countries are grouped in to 6 statistically significant (at 95% Confidence Interval) and stable clusters with additional three outlier clusters Table 9. Facts observed from the final out put in Fig. 13 (where the black cross shows outlier values of principal factors) and in Appendix 1: Tables suggests that, Tunisia, Mauritius and Seychelles have relatively better sustainable life (specifically on PC1 Variables listed on Appendix 1: Tables, where as South Africa and Nigeria (recently boom capital) accounts for huge Capital in Africa (specifically for PC2 Variables listed on Appendix 1: Tables. In addition the result also indicates that, Libya, Equatorial Guinea, Seychelles and Gabon have better income source (Mainly income related factors or in general on PC3 Variables listed on Appendix 1: Tables, however, except Seychelles those countries main source of income is oil. Further to these, Ethiopia, Sudan and Nigeria have low life risk or good health policy (specifically on PC4 Variables listed on Appendix 1: Tables. From the result there is also an evidence for a better performance by Ethiopia, Burundi, Burkina Faso and Botswana on literacy reduction (specifically PC5 Variables listed on Appendix 1: Tables. However, it is claimed that, water supply for domestic consumption is not in good status over the continent, even though, Djibouti, Seychelles, Lesotho and Liberia give a better focus for domestic water consumption contrast for Agricultural purpose (generally for PC6 Variables listed on Appendix 1: Tables. It is also pointed that, Angola, Equatorial Guinea and Cape Verde have a better GDP per capita growth, but with high inflation rate (specifically on PC7 variables listed on Appendix 1: Tables. However, high inflation is tackle for growth rate [17], so it needs a solution.
The general suggestion can be, Tunisia’s sustainable life policies (variables of PC1, Appendix 1: Tables, South Africa’s and Nigeria’s Strategy on building Economic Capital, Seychelles’s income source policy (oil independent economy), Ethiopia’s health and illiteracy reduction policies, Djibouti water supply policy for domestic consumption and Angola’s economy growth strategy with some intervention policies on controlling inflation for one country can help to have a better socioeconomic status. Specifically, manufacturing areas are comparatively exposed to HIV and tuberculosis, so controlling mechanism should be applied to reduce prevalence rate. In another side, poor sanitation and communicable diseases have correlation with life expectancy and infant mortality rate. Hence, improving sanitation and controlling communicable disease can bring good life expectancy and reduce infant mortality. It is also observed that economic status of a country mainly GDP at market price (current US$) is affected by Foreign direct investment net inflows (BoP, current US$), Gross capital formation (current US$) and transportation system. Hence, adapting economic policy that can attract Foreign direct investment and developing good saving culture with a better transportation system can help to enhance GDP of a county (This result agree with Barro [17] and Upreti [4] suggestion). In addition producing a system which create and use high Electric power and produce high quantities of export of goods and services can help to enhance GDP per capita, PPP (current international $).
Declarations
Acknowledgements
The author forwards his heartfelt gratitude to two anonymous reviewers for their careful reading of the manuscript and their helpful comments that improve the presentation of this work. Moreover, the author is also grateful for to Prof. Dr. Axel Schumann for his valuable comments. The author also thanks International Monitory fund for free data source and AIMSCameroon for resources in doing a paper.
Competing interests
The author declares that he has no competing interests.
Availability of supporting data
All support data files are available.
Consent for publication
Author proves consent of publication for this research.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Cowan CD, Hauser RM, Kominski RA, Levin HM, Lucas SR, Morgan SL, Spencer MB, Chapman C. Improving the measurement of socioeconomic status for the national assessment of educational progress: a theoretical foundation. 2012.Google Scholar
 Oakes JM, Rossi PH. The measurement of SES in health research: current practice and steps toward a new approach. Soc Sci Med. 2003;56(4):769–84.View ArticleGoogle Scholar
 Haller AO. The social grading of occupations: a new approach and scale. In: John H, editors. Goldthorpe Keith Hope; 1976.Google Scholar
 Upreti P. Factors affecting economic growth in developing Countries. Major themes in economics. Berlin: Spring; 2015.Google Scholar
 Cochran WG. Sampling techniques. New York: Wiley; 2007.MATHGoogle Scholar
 Vaseghi SV. Advanced digital signal processing and noise reduction. New York: Wiley; 2008.View ArticleGoogle Scholar
 Lokupitiya RS, Lokupitiya E, Paustian K. Comparison of missing value imputation methods for crop yield data. Environmetrics. 2006;17(4):339–49.MathSciNetView ArticleGoogle Scholar
 Howell DC. The treatment of missing data. In: The sage handbook of social science methodology; 2007. p. 208–224.Google Scholar
 Jerven M. Why We Need to Invest in African Development Statistics: From a Diagnosis of Africa’s Statistical Tragedy Towards a Statistical Renaissance. African Arguments 2013.Google Scholar
 African Development Bank. Situational analysis of economic statistics in Africa: Special focus on GDP measurement. Abidjan: African Development Bank; 2013. http://www.afdb.org/fileadmin/uploads/afdb/Documents/Publications/Economic%20Brief%20%20Situational%20Analysis%20of%20the%20Reliability%20of%20Economic%20Statistics%20in%20Africa%20Special%20Focus%20on%20GDP%20Measurement.pdf. Accessed 7 Dec 2017.Google Scholar
 World Bank. World Economic and Financial Surveys, Regional Economic Outlook, SubSaharan Africa. 2008.Google Scholar
 Miguez F. Introduction to R for multivariate data analysis. 2007.Google Scholar
 Johnson RA, Wichern DW. Prentice hall Englewood Cliffs. Applied multivariate statistical analysis. 5th ed. New Jersy: Prentice hall Englewood Cliffs; 2002.Google Scholar
 Rencher AC. Methods of multivariate analysis, vol. 492. New York: Wiley; 2003.MATHGoogle Scholar
 Kumar S, Toshniwal D. Analysis of hourly road accident counts using hierarchical clustering and cophenetic correlation coefficient (CPCC). J Big Data. 2016;3(13):1–11.Google Scholar
 Kumar S, Toshniwal D. A novel framework to analyze road accident time series data. J Big Data. 20016;3(8):1–11.Google Scholar
 Barro RJ. Determinants of economic growth: a crosscountry empirical study (No. w5698). Nat Bureau Econom Res. 1996.Google Scholar
 Calinski RB, Harabasz J. A dendrite method .for cluster analysis. Commun Stat. 1974;3(1):1–27.MathSciNetMATHGoogle Scholar
 Steinley Douglas, Brusco Michael J. Choosing the number of clusters in Kmeans clustering. Psychol Methods. 2011;3(16):285–97.View ArticleGoogle Scholar
 Steinley D. Validating clusters with the lower bound for sum of squares error. Psychometrika. 2007;72(1):93–106.MathSciNetView ArticleMATHGoogle Scholar
 Steinley D, Brusco MJ. A new variable weighting and selection procedure for Kmeans cluster analysis. Multivar Behav Res. 2008;43(1):77–108.View ArticleGoogle Scholar