Skip to main content

Missing data management and statistical measurement of socio-economic status: application of big data

Abstract

Socio-economic status measurement is an ongoing problem where different suggested measurements are given by researchers. This work investigates a socio-economic status measurement derived from natural correlations of variables which can better and meaningfully cluster African countries for the level of status. The researcher used 48 African countries socio-economic yearly time series data from 1993 to 2013 of IMF 2013 data set for data management (i.e, 2737 variables for 21 years), however, the analysis is reasonably done based on recent 14 years time series data. In data management, missing values are treated (imputed) by using regression estimates, Lagrange interpolation, linear interpolation and linear spline interpolation based on the appropriate method which best fits for the trend of data with minimum error at each time level. From principal component and factor analysis of average time series data, 7 principal factors contributed by 84 variables which explain \(70\%\) of the variation in the data set are suggested as a socio-economic status measuring components and as a result the considered clustering methods (K-mean Method, Average linkage method, Ward’s method and Bootstrap Ward’s method) are agreed on six clusters of countries, those are statistically significant at \(95\%\), where as three countries each where suggested as outlier-countries made an individual cluster.

Introduction

Socio-economic status measurement is an ongoing problem, where different studies had been made to measure it as a single measured variable, several single measured variables, or as a composite of several measured variables. Socio-economic status is defined as one’s access to financial, social, cultural and human capital resources, and it is recommended that, family income with other indicators of home possessions and resources, parental educational attainment and parental occupational status (the “big 3”) as components of a core socio-economic status measure [1]. It has been also defined as one’s access to collectively desired resources , like, (1) material capital (income, wealth, trust funds, etc.), (2) human capital (skills, abilities, credentials, etc.) and (3) social capital (instrumental relationships such as being friends with lawyers and doctors) [2]. Duncan socio-economic .Index has been used in US as a measure, which is a subjective assessment of occupational prestige based on educational attainment and income. In 1974 Peter Rossi et al. also developed a household prestige score as a measure. Currently in UK a National Statistics Socio-economic classification (NS-SEC) is used to calculate a measure for socio-economic status based on one’s job and employment relations.

From sociological view status of the society can be levelled using status dimensions, those are power, wealth, prestige and information. American sociologists have used occupational prestige to level status, and they have observed for similarities or differences exist between levels at different time and place. Some sociologist have defined occupational prestige as resources availability (composed of both wealth and power) to each person where others relate it with prestige (composed of power, wealth and prestige). Goldthorpe and Hope define it as social standing which includes variables of standard of living, power and influence, level of qualification and value to society [3].

Barro [17] made a study on 100 countries from 1960 to 1990. He found that, the growth rate (real per capita GDP) is enhanced by higher initial schooling and life expectancy, lower fertility, lower government consumption, better maintenance of the rule of law, lower inflation and improvements in the terms of trade. In 2015 a study had been made on factors affecting economic growth in developing countries by using cross-country data for 76 countries from 2010, 2005, 2000 and 1995. The variables used to asses factors for GDP per capita growth are volume of export, government debt (% of GDP), natural resource yield (% of GDP), net foreign aid record (USD), life expectancy (years), Investment rate (% of GDP) and FDI inflow (% of GDP). From the result it was found that, high volume of exports, plentiful natural resources, longer life expectancy and higher investment rates have positive impacts on the growth of per capita gross domestic product in developing countries [4].

In constructing a measurement for socio-economic status, its reliability is more important. The reliability is based on socio-economic and statistical significance of the classifications made using appropriate method from a representative data. However, there is criticisms on representativeness on some of African data. The source for the problem is diversified. From history, in the 1980s and 1990s statistical offices didn’t received appropriate attention as a source of data, even today the data is distorted due to the shift in data demand by donors. In addition, projects on Africa has been focusing on achieving target development rather than answering an important development questions. An other problem is lack of data, as African Development Bank survey noted nearly one-fifth of the respondent countries had not conducted an industry survey since 2000. In addition to the above problems, African data have faced sampling (inappropriate sample size and sampling technique) and non-sampling error (respondent error, non response, recording error, etc.) [5]. Missing data can be a problem when there is non-response or the data is not collected for the variable mainly not at random. Lagrange interpolation method can be used to interpolate missing values when one value is dependent on its neighbour data sets. Vaseghi [6] formulate the general form of polynomial interpolator and statistical interpolators applicable for missing data imputation purpose. He considers the special forms of Lagrange, Newton, Hermite and cubic spline interpolators for polynomial interpolators. Lokupitiya et al. [7] uses NASS data for barley crop yield in 1997 where ecological variable are spatially correlated to select a better interpolator method and find out regression and Multiple imputation as a better interpolator from the interpolation methods considered (regression, kernel smoothing, universal kriging and multiple imputation) for Y (response) based on the target or control variable X (explanatory).

Howell [8] considered missing data problem for standard experimental studies and observational studies. In observational studies missing values can be treated by hot deck imputation, mean substitution and pairwise deletion, but those methods lead to bias in parameter estimation. However, expectation/maximization (EM) algorithm and multiple imputation (MI) are the most best techniques which are based on iterative solutions in which the parameter estimates lead to imputed values, which in turn change the parameter estimate. MI is an interesting approach because it uses randomized techniques to do its imputation an example of it is regression imputation, which regresses the response variable based on the explanatory variable.

The lack of reliable data on African countries economy limits knowledge on the economic effect of structural adjustment, as a result the economic growth time series for African economies does not appropriately capture changes in economic development [9]. Currently a better African socio-economic data is IMF [11], alternatively if we trust the AfDB, they may miss a few base year revision [10], though, AfDB is not really fully agree on IMF [11] report. Meanwhile, the AfDB, conclude that: “Overall, the situation with regard to GDP is not nearly as bad as has recently been suggested” [9]. In working over this problem, since considering the distribution of the data gives detailed and general information about the characteristics of interest than one value times series data is preferable than using single value. Consequently the risk of govern by only inappropriate data can be reduced.

Previous study on measurement of socio-economic status construct measuring components or variables based on socio-economic stand of an individual or community [1,2,3,4]. However, socio-economic status measurement is still ongoing problem. This paper desired to construct measuring components by investigating a natural correlation exist between possible suggested variables, those can able to cluster countries based on socio-economic status and level the status for components. However, a single measure for a status is not constructed, since the concern is to give specific suggestion based on the stand of cluster countries for components. Hence, time series data is used to manage the African socio-economic data problem and determine components of socio-economic status measurement which can classify African countries based on status through comparison across the region. Correspondingly, missing values were treated by a method which give minimum error from the true value at each time interval. Missing values are imputed using linear regression model, Lagrange interpolation, Linear interpolation and linear spline interpolation. Principal component analysis, factor analysis and cluster analysis are used in determining principal factor of socio-economic status measurement and clustering African countries based on those factors. The result reveals that \(70\%\) of the variation in the data set is explained by the suggested 7 components (principle factors), which are contributed by 84 variables, and using those socio-economic components 6 cluster of African countries are formed at \(95\%\) confidence level were 3 countries are consider as outlier.

Methodology

Data and variable

Data IMF [11] socio-economic yearly time series data set containing 2737 variables [File Name: 21yearData.csv] from year 1993 to 2013 for 48 African Countries were used for data management, however, the analysis is reasonably done based on the data set from 2000 to 2013. The reason of using 14 years of data instead of 21 is due to the recently growing demand for data which apparently increases outputs from statistical offices. This leads the missing value to decrease in recent years. Specifically almost all the 44 respondent countries have carried out at least one household survey of income or expenditure since 2000 [9].

The preference of data set from IMF [11] over AfDB [10] is made due to advantages listed below in (1) and (2)

  1. 1.

    IMF [11] a data have best coverage (48 countries) than the AfDB [10] data (44 countries).

  2. 2.

    Moreover, as it was indicated on the introduction section AfDB does also agree on the [11] report.

More over, Morten Jerven [9] also advises that 11. If we use AfDB they may have missed a few base year revisions.

Variables In this study the proposed components are selected based on suggested results from previous studies mainly by Cowan et al. [1], and Oakes and Rossi [2]. Cowan et al. [1] recommend that. the socio-economic status component should include family income, parental educational attainment and parental occupational status. More over, expanded measure of socio-economic status can be constructed by adding home neighbourhood and school socio-economic status. Where family income includes home possessions (internet access, clothes dryer, dishwasher, more than one bath-room, one’s own bedroom), presence of household member needing healthcare assistance and household composition like size of household (total, number of adults). Correspondingly Oakes and Rossi [2] recommends material capital (income, wealth, trust funds, etc.), human capital (skills, abilities, credentials, etc.) and social capital (instrumental relationships such as being friends with lawyers and doctors). Hence, relative to the IMF data components, the proposed components are related to education, economy, health, infrastructure and population demographic data category. However, current IMF [11] African socio-economic data have three major problems:

  1. 1.

    Some data values are missing.

  2. 2.

    Different country have different base year for their GDP. In response IMF update each country’s GDP based on its base year. Hence, there may be a loss in information, and comparison using single year data is inappropriate.

  3. 3.

    The data have some discrepancies or some davit from AfDB [10] data [as AfDB [10] conclusion: IMF GDP report is not nearly as bad as has recently been suggested]. This problem is mainly raised in data collection, processing and distribution phase. It is a duty and responsibility of statistical offices or any primary data source organizations to apply appropriate data collection techniques and standardized processing method, and honest distribution based on the nature of the data as data is the public property. To manage this problem, distribution based analysis can reduce the risk of inferences and give relevant result than using a single value. Hence, considering time series of the data can help to do this job. For instance considering the time series of the data acquires the progress of the GDP, makes the comparison more appropriate in contrast to using 1 year GDP.

Missing data management

Missing data value is the absence of the data value completely at random (if missing values of any variable dose not depend on any value) or at random (if missing values in response variable does not depend on in its’ own value but dependent on other variables) or may be not at random (if missing values follow some structure or model) [8]. The data series of African socio-economic variables on fixed time-interval have high number of missing values for some countries comparing to other’s. As listed below in the sequence of missing values per country, countries such as, Somalia, South Sudan, and Sao Tome and Principe have high number of missing values compared to Tunisia, Morocco and South Africa. This suggests the probability to be missed is dependent on its’ own value.

List for the number of missing values per country:

figure a

In addition, since each socio-economic variable is expressed in time and have strong indirect correlation (r = − 0.8413), the recent year has less probability of having missing value than the old one. Hence, missing values are not at random and non-ignorable.

The time series of missing values:

figure b

Interpolation

Interpolation is the estimation of unknown values using the values of known samples at the neighbourhood points [6].

Interpolation by simple linear regression method

Linear regression estimate imputation is one of the single imputation method used the surviving creature characteristicswhen the variable with missing value has correlation with explanatory variable (time) and the series of data values follow linear trend [7]. However, socio-economic data expect to have some trends but may not exactly linear in time. Hence, applying this method may enhance correlation and under estimate the standard error of the regression coefficients by under estimating the variance of the imputed variables [8]. So with this consideration simple linear regression estimate is used to impute when the missing data is at the beginning (\(t_1\)) or/and at the end (\(t_n\)). However, missing values at the internal part were treated by comparing this method for minimum error with other exact estimation methods discussed in “Linear interpolation”, “Linear spline interpolation” and “Lagrange polynomial interpolation” sections.

The simple linear regression model for the given response variable Y (socio-economic variable) and the explanatory variable time (t) is given by:

$$\begin{aligned} Y_i= B_{0i} + B_{1ij} t_{ij} + \varepsilon _{ij} , \quad \text {for}~i = 1, 2, \, \ldots ,\, 2737, \quad \text{for} ~j = 1,2, \ldots ,~21. \end{aligned}$$
(1)

where \(B_0\) and \(B_1\) are intercept and slope parameters respectively, and \(\varepsilon _i\) is error term which is normally distributed with mean \(\mu\) and variance \(\sigma ^2\),  i.e,   \(\varepsilon _i \sim\) \(N(\mu , \sigma ^2\) ).

Linear interpolation

Linear interpolation is the simplest interpolation techniques for missing data imputation using the two known neighbours. For a time series of discreet data points of socio-economic variable given by \(\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}\), when there is missing value at \(t_i\), for known \(y_{t_{i-1}}\) and \(y_{t_{i+1}}\) linear interpolation can be used to estimate \(y_{t_i}\) at \(t_i\) by interpolating based on it’s neighbours \(y_{t_{i-1}}\) and \(y_{t_{i+1}}\) by using the following formula.

$$\begin{aligned} Y_{t_i}=Y_{t_{i-1}} + \dfrac{t_i - t_{i-1}}{t_{i+1}-t_{i-1}}\left( Y_{t_{i+1}} - Y_{t_{i-1}}\right) . \end{aligned}$$
(2)

This method is employed to interpolate non-sequential missing values. An illustration example is presented on Fig. 3 and Table 2.

Linear spline interpolation

This method works in similar fashion as linear interpolation in a way that the missing value is interpolated by using its most two neighbours except it works for sequentially missed values. Here with some adjustment (i.e., the same upper neighbour) this method is applied to interpolate sequentially missed values. For a time series of discreet data points of socio-economic status given by \(\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}\), when the sequence of values \(y_{t_2}\), \(y_{t_{3}}\) ... \(y_{t_{n-1}}\) are missed.

Then for any k, \(2 \le k\le n-1\)\(y_{t_k}\) is interpolated as:

$$\begin{aligned} Y_{t_k}=Y_{t_{k-1}} + \dfrac{t_k - t_{k-1}}{t_{n}-t_{k-1}}\left( Y_{t_{n}} - Y_{t_{k-1}}\right) . \end{aligned}$$
(3)

That is,

$$\begin{aligned} Y_{t_2}&=Y_{t_{1}} + \dfrac{t_2 - t_{1}}{t_{n}-t_{1}}\left( Y_{t_{n}} - Y_{t_{1}}\right) \\ Y_{t_3}&=Y_{t_{2}} + \dfrac{t_3 - t_{2}}{t_{n}-t_{2}}\left( Y_{t_{n}} - Y_{t_{2}}\right) \\&\vdots \\ Y_{t_{n-1}}&=Y_{t_{n-2}} + \dfrac{t_{n-1} - t_{n-2}}{t_{n}-t_{n-2}}\left( Y_{t_{n}} - Y_{t_{n-2}}\right) . \end{aligned}$$

On this paper linear interpolation is employed to interpolate sequentially missed values. An illustration example is presented on Fig. 4 and Table 3.

Lagrange polynomial interpolation

Lagrange polynomial interpolation is one type of exact interpolation which uses all given neighbours to estimate missing values.

For \(Y_{t_i}=f(t_i)\), where, \(\left\{ t_1< t_2 < \cdots \right\}\): is the function given at discreet time for socio-economic variable given by: \(\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}\). The Lagrange polynomial (the nth order polynomial) for the given points is used to approximate or estimate a function \(Y_{t_i}=f(t_i)\) at any time point \(t_i\) in the range, this process is called interpolation by Lagrange polynomial. For a missing value \(Y_{t_i}\) in the series of variable values the Lagrangian estimate was calculated by the following equation [6].

$$\begin{aligned} Y_{t_i}=\sum _{k=1}^{n}\prod _{{\begin{matrix} j=1\\ j\ne k \end{matrix}}}^{n}\left( \dfrac{t_i - t_{j}}{t_{k}-t_{j}}\right) Y_{t_k}. \end{aligned}$$
(4)

On this paper Lagrange interpolation is applied to interpolate when there is only one missing value in variable values or left with one missing value after other methods are employed. An illustration example is presented on Fig. 2 and Table 1.

Appreciatively, from theoretical advantage of Lagrange polynomial interpolation, since this method considers all known data value of a variable to estimate missing value at the point, the estimate is not only governed by its two most neighbouring data. However, due to the complication of the formula this method is employed when there is only one missing value in variable values or left with one missing value after other methods are employed.

Normality assumption It is known that the surviving creature characteristics is normally distributed. As usual, in our case this assumption is important to infer for a population because socio-economic status of African population and the variables that determine these characteristics are expected to be normally distributed. Moreover, from central limit theorem, we have the property that, the sampling distribution of the sample statistic approaches to normal distribution as sample size increases \((n > 30)\) and from law of large number, we have the property that, as sample size increases the sample statistic approaches to the population parameter.

Data set of socio-economic status used for analysis is a multivariate time series data set of 2737 variables from 48 African countries for a year from 2000 to 2013.

Through the analysis of socio-economic status, it is expected that some variables have high contribution or effect on the status of socio-economic well-being comparatively. In addition, some variables may be highly correlated. Therefore, to avoid complexity due to having large number of variables, it is better to consider the possible small number of variables those can reflect the needed information. This can be done specifically:

  1. 1.

    When some variables are highly correlated to each other, those variables are describing the underlined characteristics which is governed by their correlation, so this characteristics will be the interest on the group. The characteristics as a new variable can be written as a linear combination of those correlated variables which can maximize the accounted variation from the total variation in the data set.

  2. 2.

    When some variables are correlated to the same latent or may be new variable in describing the situation of interest (socio-economic status), the latent or new variable as a linear combination of these variables is taken, in a way that the linear combination can maximize the accounted variation.

  3. 3.

    From the set of variables, some variables may accounted for large amount of variability in the data set. Hence, these variables can express larger amount of variation in the data set, so we can take those variables which can address the variation need to accounted.

In general the above three theories lead to principal component analysis and explanatory factor analysis.

In another word, we are assessing the variation between the random variables and variance of a variable. Normally, the variation between random variables is estimated by the distance variation of each random variable from their mean in units of standard deviation. This distance is a standardized and correlation free random variable [12]. In doing so, statistical distance plays an important role because the smaller this distance between the variables implies high correlation (it is observed on the off-diagonal of correlation or covariance matrix).

For the multivariate normally distributed random variables denoted by the random vector \(\mathbf Y '\) \(= [Y_1, Y_2, \ldots Y_p]\), with p-dimensional normal density given by:

$$\begin{aligned} f(y)= \dfrac{1}{(2\pi )^{\frac{p}{2}} \mid \Sigma \mid ^\frac{1}{2}}exp^{-\frac{(\mathbf y -\mu )' \Sigma ^{-1} (\mathbf y -\mu )}{2}} , \end{aligned}$$

where p is the total number of variables, and \(y_i\), for \(i=1,2, \dots ,\ p~(\text {for}~ p = 2737)\) is of an n component normal random variable with mean \(\mu\) and variance \(\sigma ^2.\)

The squared statistical distance from \(\mathbf Y\) to population mean \(\mu\) for \(p \times 1\) vector \(\mathbf y\) of observations is given by:

$$\begin{aligned} (\mathbf y -\mu )'\Sigma ^{-1} (\mathbf y -\mu ) , \end{aligned}$$

where the \(p \times 1\) vector \(\mu\) represent the expected value of the random vector \(\mathbf Y\) and \(p\times p\) matrix \(\Sigma\) is the variance–covariance matrix of \(\mathbf Y\)

Principal component analysis

Principal component analysis describes the correlation or variance–covariance structure between the set of variables through a few uncorrelated latent or new variables, each of which is a linear combination of the original variables which can maximize the variance accounted. Most often these new variables reveals a new interpretation that is not visible in original variables [13]. The newly created variables are called principal components.

Let the random vector \(\mathbf Y '\) \(= [Y_1, Y_2, \ldots Y_p]\) have covariance matrix \(\Sigma\) with eigenvalue–eigenvector pair \((\lambda _1, e_1)\), \((\lambda _2, e_2)\), ..., \((\lambda _p, e_p)\), where \(\lambda _1\ge \lambda _2 \ge \cdots \ge \lambda _p \ge 0\). Then ith principal component \(Z_i\) for \(i=1, 2, \ \ldots , \ k\) where \(k \le p\) is a linear combination given by:

$$Z_i= \mathbf e _i'{} \mathbf Y =e_{i1}Y_1 + e_{i2}Y_2 \ldots e_{ip}Y_p$$

with \(Var(Z_i) = \mathbf e _i' \Sigma \mathbf e _i=\lambda _i\) for \(i=1, 2, \ \ldots , \ k\) and \(Cov(Z_i, Z_j)= \mathbf e _i' \Sigma \mathbf e _j =0\) for \(i \ne j\) which maximizes \(Var(Z_i) = \mathbf e _i' \Sigma \mathbf e _i .\)

Principal components are arranged in decreasing order based on the proportion of the variation they can explain, in away that the first principal component accounts for the maximum variation than any of others. Therefore, taking the most first components may can address most of the variation in the original data (like up to 80 or 90% of the population variation). However, deciding the number of principal components have no yet well stated rule, but it is also advisable to consider the size of eigenvalues and the nature of components. Most often principal components with relatively equivalent small size eigenvalues are not consider. In general, it helps to reduce the data size in variable and shows the correlation between the variables on it [12], and those new variables are used for further analysis like cluster and regression analysis.

Factor analysis

Factor analysis is used to describe the observed correlation (covariance relation) between the variables in terms of few new random variables called factors [13]. This method concerns about grouping highly correlated variables together in a way that variables in different groups are relatively slightly correlated. So in a group, those variables are addressing a characteristics which is governed by the underline correlation, called factor. Factors are sightly correlated new variables.

For the random vector \(\mathbf Y '\) \(= [Y_1, Y_2, \ldots Y_p]\) with mean vector \(\mu\) and covariance matrix \(\Sigma\). The factor model postulates that \(\mathbf Y\) is linearly dependent on a \(k\times 1\) random vector F called common factors and a \(p\times p\) diagonal matrix \(\varepsilon\) called specific factors. Then the interrelation between the elements of \(\mathbf Y\) is given by a factor model:

$$\begin{aligned} \mathbf Y =\mu + \Lambda \mathbf F + \varepsilon , \end{aligned}$$

where \(\Lambda\) is \(p \times k\) matrix of unknown constants called loadings.

Assumptions of factor model on \(\mathbf F\) and \(\varepsilon\);

  1. 1.

    \(\mathbf F \ \sim \ N(\mathbf 0 , \mathbf I )\).

  2. 2.

    \(\varepsilon \ \sim \ N(\mathbf 0 , \Phi )\), where \(\Phi \ = \ diag( \phi _1, \phi _2, \ldots , \phi _p)\).

  3. 3.

    \(\mathbf F\) and \(\varepsilon\) are independent. This assumption leads us to estimate covariance matrix, given by:

    $$\begin{aligned} \Sigma = \Lambda \Lambda ' + \Phi , \end{aligned}$$

    and, \(Cov(\mathbf F ,\mathbf Y )=\mathbf L\) or \(Cov( Y_i, F_j ) = l_{ij}\),

where \(h_i^2 = \sum _{j=1}^{k}l_{ij}^2\) (Communality) and \(\phi _i=Var(Y_i) - h_i^2\) (Uniqueness), for \(i=1, 2, 3, \ldots \ p\)

The comparison of estimate of covariance to the original covariance tells us how the factor model fits the covariance matrix of original variable by the considered factors. Minimum discrepancy shows the good fit. Moreover, communality and uniqueness tell us the variance accounted by factors. Specifically the ith communality tells us the portion of the variance of \(Y_i\) explained by k common factors and ith uniqueness tells about the portion of variance of Y \((Var (Y_i)\ )\) explained by the ith specific factors. Our concern is mainly looking at the factor model that explains covariance structure without much loss of information by small number of common factors.

Cluster analysis

Cluster analysis is a method of grouping of objects or variables based on similarity or distance by considering the nature of the variable or scale of measurements and the subject matter knowledge in-order to make objects in a group to be similar and objects in different groups be relatively different. Usually objects, units or cases are clustered based on sort of distance, whereas variables are clustered based on correlation coefficients with a goal to find optimal group [14].

In this paper the combined method of Hierarchical Clustering followed by Non-Hierarchical Clustering including bootstrap Ward’s method were used due to the advantages of Hierarchical Clustering is better in finding the number of groups and initial cluster members where as Non-hierarchical Clustering gives more accurate members based on initial cluster members given by hierarchical method.

  1. 1.

    Hierarchical clustering method: Hierarchical clustering is an unsupervised method of grouping list of items through successive merging based on similarity or successive division based on dissimilarity. This method fall into two categories, Agglomerative hierarchical method and Divisive hierarchical method. Divisive hierarchical method start with group of items and continues by dividing the group into two subgroups by taking most similar items together in one group till each individual item make its own cluster where as Agglomerative hierarchical method start with a single item and merge most similar items together as a group, and these groups are merged successively based on similarity until the similarity is low. Then, those groups with low similarity are taken as clusters. The choice of similarity between groups or items can be measured based on average linkage or nearest neighbour linkage or the farthest neighbour linkage between the points of the groups or ward’s method. However, Agglomerative hierarchical algorithm is faster due to its computational efficiency (running time complexity \(O(n^3 )\)) than divisive clustering algorithm (running time complexity \(O(2^n)\)) [15, 16]. Hence, Agglomerative hierarchical method specifically average linkage (average euclidean distance) and Ward’s similarity measure are used. As described on the introduction section African socio-economic data have a problem, so working based on the characteristics of the distribution can give relevant information. Hence, Average linkage helps to control the impact of a single value, so the result will not be fully affected by a probably-misleading nearest (due to single linkage method) or farthest point (due to complete linkage method). e.g. End points of Chaining cluster. The result of Agglomerative hierarchical clustering can be presented by two dimensional graphs called dendrogram or by the \(95\%\) confidence bounded ellipse scatter plot of the first and the second principal factors (which shows the proportion of variance in the data set explained by the first two components in determining clusters).

  2. 2.

    Ward’s hierarchical clustering and it’s bootstrap extension: In this approach the focus is minimizing the information lost due to clustering. It is clear that joining dissimilar clusters results in inflated error sum of square (ESS) and leads to much information loss. Hence, a merging with smallest change in ESS results in minimum loss of information. At the beginning each item is considered as a cluster and ESS of the i cluster is zero (ESS\(_i\), for i= 1, 2, ..., K) and ESS of the data set is \(\sum _{i=1}^{K}ESS_i = 0\), in general if there are L clusters, \(ESS = ESS_1 + ESS_2 + \cdots+ ESS_L\), and finally if all clusters are in one group, error sum of square is given as;

    $$\begin{aligned} ESS= \sum _{i=1}^{K}(y_i -\overline{y})'(y_i -\overline{y}), \end{aligned}$$

    where \(y_i\) is the multivariate measurement associated with the ith item and \(\overline{y}\) is mean of all items. The result multivariate clustering is expected to be roughly elliptical [13]. Now the equation is in how much confidence a cluster can include the items assigned by ward’s method or how assigned elements of a cluster are variable. This can be check by creating a dataset using re-sampling (re-sampling may be from empirical distribution of the data or by re-sampling with replacement from the data) and do clustering for each dataset, if the proportion of an item included in the same cluster is grater or equal to the desired level of confidence, then an item is assigned to the cluster in the given confidence level.

  3. 3.

    Non-hierarchical clustering method: Non-hierarchical clustering techniques are designed to group items, rather than variables, into a collection of K clusters, which is predetermined in our case by hierarchical clustering techniques [13]. Non-hierarchical clustering is started either from random partitioning of items into K initial clusters or an initial set point which will form clusters. This paper uses the popular Non-hierarchical clustering method called K-mean method, which starts by random partitioning of items into K initial clusters and goes through the list of items for assigning an item to a cluster with a closest mean to an item.

Numerical examples for missing data management

The following examples are the realization of Lagrange interpolation, linear spline interpolation, linear interpolation and linear regression estimation for artificially made missed value/s from the known data values of one of the IMF [11] data set variables, in case of Exports of goods and services (% of GDP) for Algeria. This examples are also used to illustrate and compare the error trends made by each method in interpolating or estimating artificially missing values (Fig. 1).

Fig. 1
figure 1

Time series plot for ‘Exports of goods and services (% of GDP)’ for Algeria with mean 40.39 and standard deviation 5.24

Example 1

Estimation of artificially missed value for the known series of data (Y_ Actual), when a missing observation is at any point between the first and the last observation of the variable values. The result is given in Table 1 and Fig. 2.

Table 1 Result for the estimates of a missing value by Lagrange interpolation, linear interpolation and linear regression estimation: in case of one missing value
Fig. 2
figure 2

The first plot is for estimates of a missing value; the second plot is for the errors in estimating a missing value

Example 2

Estimation of two non-sequentially artificially missing values for the known series of data (Y_ Actual), when the missing observations are at any point between the first and the last observation. Table 2 and Fig. 3 shows the estimates of missing values by linear interpolation, linear spline interpolation and regression estimation methods. Here the considered cases are, when the first missing observation is at position i the second is at position i + 4, for i = 2,3, ..., n − 5.

Table 2 Result for the estimates of a missing values by linear interpolation, linear spline interpolation and linear regression estimation: in case of two non-sequentially missing values
Fig. 3
figure 3

The first plot for estimates of missing values; the second plot for the errors in estimating missing values

Example 3

Estimation of four sequentially artificially missing values for the known series of data (Y_ Actual), when the missing observations are sequential at any interval between the first and the last observation. Table 3 and Fig. 4 shows the estimates of missing values by linear spline interpolation and linear regression estimation methods. Here the considered cases are, when the first missing observation is at position i then the missing observation will be sequential up to \(i+3\), for i = 2,3, ..., n − 4.

Fig. 4
figure 4

The first plot is for estimates of missing values; the second plot is for the errors in estimating missing values

Conclusion

The above examples of interpolation methods applied for missing imputation suggests the following result.

  1. 1.

    Figure 3 shows that the plot of the errors due to linear interpolation and linear spline interpolation are equally closer to the horizontal error free line than the plot of the error due to linear regression line. Hence, the error due to linear interpolation and linear spline interpolation are smaller than the error due to linear regression. This reveals that, for a such time series data with non-linear trend when missing values are not sequential, linear interpolation and linear spline interpolation brings a better estimate than linear regression.

  2. 2.

    Figure 4 shows that the plot of error due to linear spline interpolation is closer to the horizontal error free line than the plot of error due to linear regression. This reveals that linear spline interpolation estimator for two or more sequentially missing values have smaller error than the linear regression estimator. Therefore, for a such time series data with non-linear trend, linear spline interpolation brings a better estimate for two or more sequentially missing values than linear regression.

  3. 3.

    Figure 2 depicts that, since the plot of Lagrange interpolation error is closer to the horizontal zero error line than the plot of linear interpolation error and regression estimate error on the period of time interval between 5 and 9, this paper uses this method when missing values are in the middle of the observation, specifically on time interval between 5 and 9. Even-though, Lagrange polynomial interpolation gives minimum error estimator for missing value on the specified interval above, however, linear interpolation estimator have minimum error on the rest of the series comparing to Lagrange interpolation and regression estimator. Hence, linear interpolation is applied to estimate when a missing is on the other intervals (i.e, on the beginning and ending part) of the series.

Result and discussion

The concern is to formulate and apply statistical method which can grasp highly contributing variables from total variation to make major components which can significantly and meaningfully able to level the status of socio-economic development through African countries. Therefore, once the suggested socio-economic status measuring variables from literatures were considered, working on those variables by removing redundancies and variables which have no visible role in total variance can help us to reduce the number of variables need to be considered for measurement without much loss of information. One of the techniques to do this is finding highly correlated variables (the correlation may be direct or through latent variable) and replace them by new-variable (component) which is govern by underline correlation through a linear combination of those variables, which can maximize the variance accounted by them out of the total variation. Hence, principal component and factor analysis are applied in finding key variables.

Principal component and factor analysis

This subsection considers the ways to find number of principal components needs to be considered on constructing factors and selecting highly contributor variables on total variation. The next four considerations are helping in deciding the number of principal components need to be used.

  1. (a)

    Scree plot of variance: The scree plot in Fig. 5 shows that the bend point starts at principal factor 5. After this point the plot descends slowly but at principal components 8 there is another slight bend. Hence, two points can proposed, however, a rough view at scree plot suggests 4 principal components.

  2. (b)

    Eigenvalues: The variance or eigenvalue of the principal components given in Table 4 reveals that, the first 14 principal components have greater than one eigenvalue, the first 8 principal components have greater than 2 eigenvalue and the first 5 components have greater than 3.76 eigenvalue. Since the eigenvalue of a principal components < 1 implies that from total variance the variance accounted by a component is less than one, the principal component with large eigenvalue were chosen to explain the variation in the data set (usually with eigenvalue > 1). Based on this aspect the first four or seven factors or 13 can be taken.

  3. (c)

    Proportion of the total variation: Here the deal is the proportion of the total variation contributed by those factors. From the result in Table 4 principal components with greater variance are selected to satisfy the proportion of total variance want to be accounted. Therefore, based on the result suggestions in (1) and (2) if the first four components are taken only 53.068% of the variation would be explained alternatively if the first seven components are taken 67.056% of the variation would be explained, where taking 13 components explain upto 83.58% of variation, but considering 13 variables are not still small and the total variation accounted by newly added 5 components is only \(16.524\%\), in-addition the scree plot dose not support it.

  4. (d)

    Subject matter consideration: The subject matter consideration is important to have meaningful and interpretable component for socio-economic status. From these aspect it is observed that factors resulted from analysis of 4 principal components particularly factor 3 and 4 composed of variables from different categories of data and it makes factor 3 and 4 difficult to interpret and relate with real socio-economic data categories. On the other hand factors obtained based on analysis of 7 principal components are more direct to interpret and easy to relate with categories of socio-economic data (education, economic, health, infrastructure and population demographic data). To conclude from the above four reasoning taking the first 7 principal component have relative advantage in explaining more proportion of variation (i.e. up to 67% of total variation and while in factor analysis leads the factors to explain upto 70% (Appendix 1: Table 11) of total variation in the data set), and in estimating easily and meaningfully interpretable fear number of factors.

Fig. 5
figure 5

A screet plot of the principal components for the average of 14 year socio-economic time series data

Table 3 Result for the estimates of missing values by linear spline interpolation and linear regression estimation: In case of 4 sequentially missing values

Once the number of principal factor is determined key variables for principal factor can be selected based on loadings and their correlation with principal factor. A variable with large loadings implies that it is highly contributed by the factor and high correlation implies that the variable is highly important to determine the factor. From the result it is observed that a variable weighted with high loading by a principal factor has high correlation with it. The communality also justifies this implication.

Observing correlation between key variables of a principal factor can help to control the principal factor. So, it is important to focus on those highly correlated variables and control them firstly. The result for the correlation between key variables of each factor are given in Appendix 1: Table 12. The result of factor analysis using the first 7 principal components for correlation between principal factors and their key variables, loadings and cumulative of each key variables are given in Table 5 and reveals that:

  1. 1.

    The key variables in principal factor 1 are related to sustainable life measure. The result from correlation between key variables in the factor 1 suggests that:

    • There is negative correlation (− 0.646) between infant mortality rate and improved sanitation facility. Hence, low sanitation can be the cause for infant mortality.

    • There is good direct correlation (0.618) between life expectancy at birth and improved sanitation facility.

    • There is strong direct correlation (0.812) between incidence of tuberculosis and prevalence of HIV.

    • Cause of death by communicable diseases and maternal, prenatal and nutrition conditions, and cause of death, by non-communicable diseases have strong negative correlation (− 0.862), this implies that attention was given for one of them, so attention should be given for communicable diseases too.

    • There is direct correlation (0.71) between infant mortality rate and communicable diseases.

    • There is indirect correlation (− 0.70) between life expectancy and communicable diseases.

    • Life expectancy at birth and infant mortality rate have strong negative correlation (− 0.844). This implies that most of the countries with short life expectancy should decrease infant mortality by improving sanitation problem.

    • The general suggestion for the source of short life expectancy in Africa leads to low sanitation and death due to communicable diseases.

  2. 2.

    Principal factor 2 is related to capital. The correlation between key variables of principal factor 2 suggests the following results.

    • Labour force and population is highly correlated (0.932). This is a reflection for most populated area have high labour force.

    • There is strong direct correlation between transportation systems. A country with a better Air transport have a better Rail lines and Container port traffic (0.83 and 0.85 respectively), and a country with a better Rail lines have also a better Container port traffic (0.80).

    • Air transport and Rail lines have strong direct correlation with GDP at market price (0.79 and 0.74 respectively). There is also strong correlation between Air transport and Gross capital formation (0.78). The correlation suggests that transportation system have strong influence on GDP at market price and Gross capital formation.

    • GDP at market price have strong positive correlation with Gross capital formation and Foreign direct investment (0.925 and 0.85 respectively). Hence, GDP at market price of a country can be enhanced by calling Foreign investment and accumulating capital.

    • In general Foreign investment, accumulating capital and transportation system have strong influence on GDP at market price.

  3. 3.

    Principal factor 3 is general income related factor. The correlation between variables of principal factor 3 suggests the following results.

    • Electric power consumption have high correlation with GDP per capita, PPP (current international $) (0.753) and mobile cellular subscriptions (0.761).

    • GDP per capita, PPP (current international $) and Improved sanitation facilities have strong correlation (0.744).

  4. 4.

    Principal factor 4 is related to life risk. The correlation between variables of principal factor 4 suggests the following results.

    • Prevalence of HIV and incidence of tuberculosis have some negative correlation (− 0.415, − 0.497, respectively) with life expectancy.

    • Prevalence of HIV and incidence of tuberculosis have some what visible correlation (0.442, 0.362, respectively) with manufacturing. This result is a surprising result which reflects that, manufacturing areas are suspected to be the source for medium rate of prevalence of HIV and tuberculosis. Hence, health polices should consider what have to be done in manufacturing area to reduce the prevalence of HIV and incidence of Tuberculosis.

  5. 5.

    Principal factor 5 is more of related to literacy. The correlation between variables of principal factor 5 suggests the following results.

    • Cash surplus/deficit is strongly correlated with adult literacy rate and youth literacy rate (0.704, 0.725, respectively). Hence, illiteracy reduction plays an important role for cash surplus.

  6. 6.

    Principal factor 6 contrasts rate of water supply and consumption. The correlation between variables of principal factor 6 suggests the following results.

    • There is an indirect Annual freshwater withdrawals in Agriculture have strong indirect correlation with Annual freshwater withdrawals in domestic (− 0.922) and Annual freshwater withdrawals in industry (− 0.794).

    • There is some direct correlation (0.498) between Annual freshwater withdrawals in Domestic and Annual freshwater withdrawals in industry.

  7. 7.

    Principal factor 7 reflects the contrast between GDP growth rate and inflation. The correlation between variables of principal factor 7 suggests the following results.

    • There is high correlation between GDP per-capita growth and inflation rate (0.954). This suggests that countries with high GDP per-capita growth should control inflation. This result agree with Barro [17] suggestion.

    • There is also some direct correlation (0.40) between GDP per-capita growth and export of good and services. This result agree with Upreti [4].

Table 4 Summary for variance accounted by principal components

Data quality

Before doing further analysis, it is important to know the quality and nature of the data in order to find the appropriate method and make inference. We can study the quality and nature of the data by checking for outliers and distribution type (usually normality) respectively. Since principal factor is a linear combination of all variables with some loadings, assessing for principal factor is the reflection of assessing variables. Hence, our focus is to know what nature does principal factors have.

Q–Q is used to plot to check the normality of principal factors and T-chart to assess outliers in the data set.

From Fig. 6 Q–Q plots suggest that some of the principal factors are approximately normally distributed (i.e., principal factors in plot 2, 3, 5, 6), whereas some of them show some divergences (those are principal factors in plot 1, 4 and 7), this implies that they are not far from normal distribution. So, working with them can bring relevant inference for the population parameters.

From the result of T-chart Table 6 it is observed that countries, like, Niger (NER), South Africa (ZAF), South Sudan (SSD) have extraordinary values and Equatorial Guinea (GNQ), Libya (LBY), Swaziland (SWZ) have suspected values which have T-chart greater than \(\chi_{0.05, 7}^2 =14.067\) with 95% confidence. Figure 6 plot 8 graphically shows this result. Therefore, the data from these countries need to be checked, because outliers can occurred due to coding errors, respondent errors or large true values.

Table 5 Summary table for principal factors
Table 6 T-Chart for countries based on principal components
Fig. 6
figure 6

A matrix plot represents: the first 7 plots for Q–Q plot of the first 7 principal components, orderly, and the last two plots (plot 8 and 9) are T-Chart plot for countries based on principal factors

Cluster analysis

A cluster analysis were used for grouping objects or variables without having any prior information or hypothesis on the number, elements and structure of the groups.

Cluster analysis is classified into two types, Hierarchical cluster analysis and Non-hierarchical cluster analysis.

Hierarchical cluster analysis is one of the preferable methods in determining the number of clusters and suggesting an initial elements of the cluster. In this analysis, average linkage, Ward’s method and its bootstrap extension from the types of Agglomerative Hierarchical Cluster are used.

On the other hand Non-hierarchical clustering techniques are used to cluster countries or identify elements of the cluster based on the the number of clusters obtained from hierarchical cluster analysis. Here one of the Non-hierarchical cluster analysis called K-mean clustering is used to determine elements of the clusters in addition to the considered Agglomerative Hierarchical methods.

In this section the analysis is targeting on solving two main problems:

  1. 1.

    Determining appropriate number of clusters.

  2. 2.

    Determining elements of the cluster.

Number of clusters

The concern here is comparing the result obtained from the considered methods to estimate the number of clusters. The results from three different approaches: the Average linkage method, the scree plot of within groups sum of squares, ratio of between-cluster variability and within-cluster variability, and the Multi-scale bootstrap of Ward’s method are described and discussed below in (a), (b) and (c).

Fig. 7
figure 7

Average liknage clustering of African countries based on socio-economis status

Table 7 Summary table for within-cluster variability, between-cluster variability and \(F_{ratio}\)
Fig. 8
figure 8

A scree plot of socio-economic principal factors based on within group sum of squares

  1. (a)

    Average linkage method: From the result in dendrogram Fig. 7 based on the distance of clusters are joining, the suggestion would be 9 clusters, where three countries (South Africa, South Sudan and Niger) are each of them forming an individual cluster. Where the more the shorter distance of joining implies the more clusters are similar.

  2. (b)

    Clustering based on screen plot of within-cluster sum of square and \(F_{ratio}\): One of the assumption in clustering is that between-cluster variability should be relatively larger than within-cluster variability. Hence, for the given degree of freedom comparing the empirical \(F_{ratio}\) with the theoretical \(F_{statistics}\) helps in decision making process of statistically significance minimum number of clusters [19]. Here, decision in clustering process is made for the number of clusters with \(F_{ratio} > F_{statistics}\). Based on the result in Table 7 proposing 9 clusters is reasonable since the value of \(F_{ratio} = 2.296\ >\ F_{(8,40)}= 2.18\) at \(95\%\) confidence. Where,

    $$\begin{aligned} F_{ratio} = \dfrac{ \text {Between-cluster variability}}{\text {Within-cluster variability}}. \end{aligned}$$

    Roughly, the number of clusters can be suggested by looking the bend point on the scree plot of within-cluster sum of square (the change in within-groups sum of square error below this point should be negligible)[1821]. The result described at Fig. 8 suggests nine clusters. The two methods are agree on nine numbers of clusters.

  3. (c)

    Bootstrap re-sampling of Ward’s method: Which give statistically significant number of clusters for the desired level of confidence. If the proportion of number of times items are assigned together is at least a desired level of confidence times, then this group is considered as one cluster with the desired level of confidence. E.g. If some groups of items are assigned together, with proportion of number of times grater than or equal to 0.95, thus, these groups of items are considered as one cluster in 95% confidence level). Figure 11 reveals that the number of cluster is 6 at 95% confidence level and where 5 countries have no data support to be clustered. We should recall that from the detection of outliers by T-Chart I showed those country (South Africa, South Sudan, Niger and Equatorial Guinea) in Fig. 6 plot 8 and in Table 6 as extreme value (outliers) and suspected outliers, except Seychelles. From the above three approach results, I can conclude that, the appropriate number of clusters is 9 considering that some outlier values form individual clusters. They are South Africa, South Sudan and Niger, which each form a cluster.

Determining elements of the cluster

The result of cluster analysis from average linkage method, K-mean method, Ward’s method and Bootstrap Wards method were compared. The rough view of the result described in Table 8 suggests that almost all three methods agree on cluster 1, 4, 7, 8. It is an indication for stability of clusters. However, some deviations are observed. For instance K-mean method did not split out South Africa as average linkage and Ward’s method do. Average linkage method grouped most of countries in cluster 3 of K-mean method in to other clusters and Wards method merged cluster 3 and 9 of K-mean method in to one cluster (cluster 3). More details of the result for each method are discussed in (a), (b) and (c).

Table 8 Summary for cluster element of each clustering Method
Fig. 9
figure 9

Clustering by K-mean method of African countries based on the first two socio-economic principal factors

  1. (a)

    Average linkage method: In this method the nearest clusters are joined based on average distance between them, where the distance is the euclidean distance between all items of pairs of clusters. Based on the result of analysis described in Table 8 the deviation of this method is that, most of the countries are grouped in cluster 3 where as by K-mean method these countries are split into three clusters specifically cluster 3, 5 and 9, and into two clusters by Ward’s method specifically cluster 3 and 5.

  2. (b)

    K-mean method: It is one of the Non-hierarchical cluster analysis with a purpose of assigning elements to pre-determined clusters, in a way that each item is assigned to a cluster with the nearest mean for the first two principal factors (in this case these two components explained \(42.36\%\) of the total variation), while the distance is measured by euclidean distance. Based on the result described by Fig. 9 and Table 8, clustering by this method is almost agree with Ward’s method, with some exceptions, like, this method merges South Africa but Wards split it out as one cluster, cluster 3 and 9 elements of this method are merged into cluster 3 by Ward’s method.

  3. (c)

    Ward’s method and Bootstrap re-sampling of Ward’s method: The objective of Ward’s method is to minimize the information lost in clustering by joining clusters resulting in minimum error sum of square. As described in (b) Ward’s method almost agree with K- Mean method with some exceptions. This is an indication of stability of clusters. Additionally, the result from Bootstrap as described in Table 8 assures the existence of the first 6 clusters and stability of its elements at \(95\%\) confidence level. Bootstrap suggests that the cluster with large p-value is highly supported by the data. Hence, the number of clusters and elements selection have to be done based on the desired p-value. Based on the result in Fig. 11 there is no enough evidence to reject non-existence of the clusters formed by South Africa, South Sudan, Niger at 95% confidence level. Equatorial Guinea and Seychelles are not included in cluster 6 at this level of confidence (their p-value is 77%), this implies that both Equatorial Guinea and Seychelles are included 77% of times in cluster 6. So including them in cluster 6 will not have big influence on similarity of cluster elements. The Dendrogram representation of Ward’s method and Bootstrap re-sampling of Ward’s method is given in Figs. 10 and 11 respectively.

Fig. 10
figure 10

Clustering by Ward’s method of African countries based on socio-economic principal factors

Fig. 11
figure 11

Clustering by Bootstrap Ward’s method of African countries based on socio-economic principal factors

Conclusion for cluster analysis

Based on the results of the above four methods discussed, Wald’s method finds out the most stable and statistically significantly exist clusters with the exception of cluster 7, 8 and 9. This result strengthen suggestion given on data quality “Data quality” section the 3 left clusters made by South Africa, Niger and South Sudan needs further investigation from the source of data. Meanwhile, since the data source is not easily accessed for further assessment and AfDB [10] was accepted the IMF report, the better decision is taking the three clusters to measure their current status. However, further inferences for the population were made based on those statistically significant clusters by including South Africa, because based on real situation observed relatively some extreme data values for South Africa is expected.

Inference for population

The summary result in Table 9, Appendix 1: Tables and Figs. 12, 13 for the relation between clusters and principal factors suggests that cluster 2, 9, 1 and 6 countries have good sustainable (Good) life (variables of PC1, Appendix 1) than other cluster countries. This result specifiably indicates that Tunisia, Mauritius, Seychelles, Cape-Verde, Morocco, Algeria and South Africa have relatively better sustainable life than other African countries, this implies that these countries used relatively suitable policies on variables of PC1 than other African countries used. In terms of capital (variables of PC2, Appendix 1) cluster 9, 2 and 4 countries have a better status. This result specifiably shows that South Africa, Nigeria, Algeria and Morocco have relatively better capital than other African countries, this implies that these countries used relatively suitable policies on variables of PC2 than other African countries used.

Cluster 6 countries are generating high income and income related variables (variables of PC3, Appendix 1). So all cluster 6 countries Libya, equatorial Guinea, Seychelles and Gabon policies on PC3 variables are relatively preferable. Life risk (variables of PC4, Appendix 1) is low in cluster 4 and 8 countries, so following Ethiopian, and Nigerian policy in this aspect (for variables of PC4) can reduce life risk (here generalization based on Niger status is on given even if it scores medium PC4 value, since it’s cluster is not statistically significant). Cluster 4, and 5 have good literacy (variables of PC5, Appendix 1) status. Mainly Ethiopia, Burundi and Burkina Faso are doing appreciable work on addressing illiteracy reduction. Djibouti, Seychelles, Lesotho and Liberia have relatively better Water supply for domestic consumption contrast to supply for Agriculture (variables of PC6, Appendix 1). Surprisingly, most of cluster countries have no good supply of water, hence addressing pure water for domestic consumption need to be future work of African countries. Angola, Equatorial Genie and Cape Verde have good economic growth (here generalization based on South Sudan status is on given even if it scores the highest PC7 value, since it’s cluster is not statistically significant), but the inflation is high (variables of PC7, Appendix 1), so attention need to give to reduce inflation rate.

Fig. 12
figure 12

Dot plot for principal factor values of the clusters

Fig. 13
figure 13

Dot plot for principal factor values of the clusters and box plot for principal factors

Table 9 Summary for level of clusters based on principal factors

Future work

Previous study on socio-economic status measurement construct measuring components or variables based on theoretical view of socio-economic stand of an individual or community [1, 2]. However, socio-economic status measurement is still ongoing problem. So to put a hand in solving this problem statistical approach is used to construct measuring components by investigating a natural correlation exist between possible suggested variables, those can able to cluster countries based on their socio-economic status and level the status by component. Limitation of this study is a comprehensive single measure for a status is not constructed, rather levelling is component-wise and specific suggestion based on the stand of cluster countries for a component variables is given.

Conclusion

The result of Principal component analysis, factor analysis and cluster analysis reveals that, 70% of the variation is encountered by 7 principal factors (Appendix 1: Table 11), using this variation, countries are grouped in to 6 statistically significant (at 95% Confidence Interval) and stable clusters with additional three outlier clusters Table 9. Facts observed from the final out put in Fig. 13 (where the black cross shows outlier values of principal factors) and in Appendix 1: Tables suggests that, Tunisia, Mauritius and Seychelles have relatively better sustainable life (specifically on PC1 Variables listed on Appendix 1: Tables, where as South Africa and Nigeria (recently boom capital) accounts for huge Capital in Africa (specifically for PC2 Variables listed on Appendix 1: Tables. In addition the result also indicates that, Libya, Equatorial Guinea, Seychelles and Gabon have better income source (Mainly income related factors or in general on PC3 Variables listed on Appendix 1: Tables, however, except Seychelles those countries main source of income is oil. Further to these, Ethiopia, Sudan and Nigeria have low life risk or good health policy (specifically on PC4 Variables listed on Appendix 1: Tables. From the result there is also an evidence for a better performance by Ethiopia, Burundi, Burkina Faso and Botswana on literacy reduction (specifically PC5 Variables listed on Appendix 1: Tables. However, it is claimed that, water supply for domestic consumption is not in good status over the continent, even though, Djibouti, Seychelles, Lesotho and Liberia give a better focus for domestic water consumption contrast for Agricultural purpose (generally for PC6 Variables listed on Appendix 1: Tables. It is also pointed that, Angola, Equatorial Guinea and Cape Verde have a better GDP per capita growth, but with high inflation rate (specifically on PC7 variables listed on Appendix 1: Tables. However, high inflation is tackle for growth rate [17], so it needs a solution.

The general suggestion can be, Tunisia’s sustainable life policies (variables of PC1, Appendix 1: Tables, South Africa’s and Nigeria’s Strategy on building Economic Capital, Seychelles’s income source policy (oil independent economy), Ethiopia’s health and illiteracy reduction policies, Djibouti water supply policy for domestic consumption and Angola’s economy growth strategy with some intervention policies on controlling inflation for one country can help to have a better socio-economic status. Specifically, manufacturing areas are comparatively exposed to HIV and tuberculosis, so controlling mechanism should be applied to reduce prevalence rate. In another side, poor sanitation and communicable diseases have correlation with life expectancy and infant mortality rate. Hence, improving sanitation and controlling communicable disease can bring good life expectancy and reduce infant mortality. It is also observed that economic status of a country mainly GDP at market price (current US$) is affected by Foreign direct investment net inflows (BoP, current US$), Gross capital formation (current US$) and transportation system. Hence, adapting economic policy that can attract Foreign direct investment and developing good saving culture with a better transportation system can help to enhance GDP of a county (This result agree with Barro [17] and Upreti [4] suggestion). In addition producing a system which create and use high Electric power and produce high quantities of export of goods and services can help to enhance GDP per capita, PPP (current international $).

References

  1. Cowan CD, Hauser RM, Kominski RA, Levin HM, Lucas SR, Morgan SL, Spencer MB, Chapman C. Improving the measurement of socioeconomic status for the national assessment of educational progress: a theoretical foundation. 2012.

  2. Oakes JM, Rossi PH. The measurement of SES in health research: current practice and steps toward a new approach. Soc Sci Med. 2003;56(4):769–84.

    Article  Google Scholar 

  3. Haller AO. The social grading of occupations: a new approach and scale. In: John H, editors. Goldthorpe Keith Hope; 1976.

  4. Upreti P. Factors affecting economic growth in developing Countries. Major themes in economics. Berlin: Spring; 2015.

    Google Scholar 

  5. Cochran WG. Sampling techniques. New York: Wiley; 2007.

    MATH  Google Scholar 

  6. Vaseghi SV. Advanced digital signal processing and noise reduction. New York: Wiley; 2008.

    Book  Google Scholar 

  7. Lokupitiya RS, Lokupitiya E, Paustian K. Comparison of missing value imputation methods for crop yield data. Environmetrics. 2006;17(4):339–49.

    Article  MathSciNet  Google Scholar 

  8. Howell DC. The treatment of missing data. In: The sage handbook of social science methodology; 2007. p. 208–224.

  9. Jerven M. Why We Need to Invest in African Development Statistics: From a Diagnosis of Africa’s Statistical Tragedy Towards a Statistical Renaissance. African Arguments 2013.

  10. African Development Bank. Situational analysis of economic statistics in Africa: Special focus on GDP measurement. Abidjan: African Development Bank; 2013. http://www.afdb.org/fileadmin/uploads/afdb/Documents/Publications/Economic%20Brief%20-%20Situational%20Analysis%20of%20the%20Reliability%20of%20Economic%20Statistics%20in%20Africa-%20Special%20Focus%20on%20GDP%20Measurement.pdf. Accessed 7 Dec 2017.

    Google Scholar 

  11. World Bank. World Economic and Financial Surveys, Regional Economic Outlook, Sub-Saharan Africa. 2008.

  12. Miguez F. Introduction to R for multivariate data analysis. 2007.

  13. Johnson RA, Wichern DW. Prentice hall Englewood Cliffs. Applied multivariate statistical analysis. 5th ed. New Jersy: Prentice hall Englewood Cliffs; 2002.

    Google Scholar 

  14. Rencher AC. Methods of multivariate analysis, vol. 492. New York: Wiley; 2003.

    MATH  Google Scholar 

  15. Kumar S, Toshniwal D. Analysis of hourly road accident counts using hierarchical clustering and cophenetic correlation coefficient (CPCC). J Big Data. 2016;3(13):1–11.

    Google Scholar 

  16. Kumar S, Toshniwal D. A novel framework to analyze road accident time series data. J Big Data. 20016;3(8):1–11.

    Google Scholar 

  17. Barro RJ. Determinants of economic growth: a cross-country empirical study (No. w5698). Nat Bureau Econom Res. 1996.

  18. Calinski RB, Harabasz J. A dendrite method .for cluster analysis. Commun Stat. 1974;3(1):1–27.

    MathSciNet  MATH  Google Scholar 

  19. Steinley Douglas, Brusco Michael J. Choosing the number of clusters in K-means clustering. Psychol Methods. 2011;3(16):285–97.

    Article  Google Scholar 

  20. Steinley D. Validating clusters with the lower bound for sum of squares error. Psychometrika. 2007;72(1):93–106.

    Article  MathSciNet  MATH  Google Scholar 

  21. Steinley D, Brusco MJ. A new variable weighting and selection procedure for K-means cluster analysis. Multivar Behav Res. 2008;43(1):77–108.

    Article  Google Scholar 

Download references

Acknowledgements

The author forwards his heartfelt gratitude to two anonymous reviewers for their careful reading of the manuscript and their helpful comments that improve the presentation of this work. Moreover, the author is also grateful for to Prof. Dr. Axel Schumann for his valuable comments. The author also thanks International Monitory fund for free data source and AIMS-Cameroon for resources in doing a paper.

Competing interests

The author declares that he has no competing interests.

Availability of supporting data

All support data files are available.

Consent for publication

Author proves consent of publication for this research.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Habtamu Tilaye Wubetie.

Appendix 1

Appendix 1

See Tables 10, 11, 12, 13, 14, 15 and 16.

Table 10 4 principal factors loadings, variance accounted by factors, and correlation between factors
Table 11 7 principal components loadings, variance accounted by factors, and correlation between factors
Table 12 The result for correlation between key variables in each factor
Table 13 Summary for Ward cluster principal factors
Table 14 Clusters of Country sorted by principal component
Table 15 Variables of principal components and summary result
Table 16 Variables of principal components and summary result

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wubetie, H.T. Missing data management and statistical measurement of socio-economic status: application of big data. J Big Data 4, 47 (2017). https://doi.org/10.1186/s40537-017-0099-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-017-0099-y

Keywords