Missing data management and statistical measurement of socio-economic status: application of big data

Wubetie, Habtamu Tilaye

doi:10.1186/s40537-017-0099-y

Research
Open access
Published: 19 December 2017

Missing data management and statistical measurement of socio-economic status: application of big data

Habtamu Tilaye Wubetie¹

Journal of Big Data volume 4, Article number: 47 (2017) Cite this article

7591 Accesses
20 Citations
4 Altmetric
Metrics details

Abstract

Socio-economic status measurement is an ongoing problem where different suggested measurements are given by researchers. This work investigates a socio-economic status measurement derived from natural correlations of variables which can better and meaningfully cluster African countries for the level of status. The researcher used 48 African countries socio-economic yearly time series data from 1993 to 2013 of IMF 2013 data set for data management (i.e, 2737 variables for 21 years), however, the analysis is reasonably done based on recent 14 years time series data. In data management, missing values are treated (imputed) by using regression estimates, Lagrange interpolation, linear interpolation and linear spline interpolation based on the appropriate method which best fits for the trend of data with minimum error at each time level. From principal component and factor analysis of average time series data, 7 principal factors contributed by 84 variables which explain $70\%$ of the variation in the data set are suggested as a socio-economic status measuring components and as a result the considered clustering methods (K-mean Method, Average linkage method, Ward’s method and Bootstrap Ward’s method) are agreed on six clusters of countries, those are statistically significant at $95\%$, where as three countries each where suggested as outlier-countries made an individual cluster.

Introduction

Socio-economic status measurement is an ongoing problem, where different studies had been made to measure it as a single measured variable, several single measured variables, or as a composite of several measured variables. Socio-economic status is defined as one’s access to financial, social, cultural and human capital resources, and it is recommended that, family income with other indicators of home possessions and resources, parental educational attainment and parental occupational status (the “big 3”) as components of a core socio-economic status measure [1]. It has been also defined as one’s access to collectively desired resources , like, (1) material capital (income, wealth, trust funds, etc.), (2) human capital (skills, abilities, credentials, etc.) and (3) social capital (instrumental relationships such as being friends with lawyers and doctors) [2]. Duncan socio-economic .Index has been used in US as a measure, which is a subjective assessment of occupational prestige based on educational attainment and income. In 1974 Peter Rossi et al. also developed a household prestige score as a measure. Currently in UK a National Statistics Socio-economic classification (NS-SEC) is used to calculate a measure for socio-economic status based on one’s job and employment relations.

From sociological view status of the society can be levelled using status dimensions, those are power, wealth, prestige and information. American sociologists have used occupational prestige to level status, and they have observed for similarities or differences exist between levels at different time and place. Some sociologist have defined occupational prestige as resources availability (composed of both wealth and power) to each person where others relate it with prestige (composed of power, wealth and prestige). Goldthorpe and Hope define it as social standing which includes variables of standard of living, power and influence, level of qualification and value to society [3].

Barro [17] made a study on 100 countries from 1960 to 1990. He found that, the growth rate (real per capita GDP) is enhanced by higher initial schooling and life expectancy, lower fertility, lower government consumption, better maintenance of the rule of law, lower inflation and improvements in the terms of trade. In 2015 a study had been made on factors affecting economic growth in developing countries by using cross-country data for 76 countries from 2010, 2005, 2000 and 1995. The variables used to asses factors for GDP per capita growth are volume of export, government debt (% of GDP), natural resource yield (% of GDP), net foreign aid record (USD), life expectancy (years), Investment rate (% of GDP) and FDI inflow (% of GDP). From the result it was found that, high volume of exports, plentiful natural resources, longer life expectancy and higher investment rates have positive impacts on the growth of per capita gross domestic product in developing countries [4].

In constructing a measurement for socio-economic status, its reliability is more important. The reliability is based on socio-economic and statistical significance of the classifications made using appropriate method from a representative data. However, there is criticisms on representativeness on some of African data. The source for the problem is diversified. From history, in the 1980s and 1990s statistical offices didn’t received appropriate attention as a source of data, even today the data is distorted due to the shift in data demand by donors. In addition, projects on Africa has been focusing on achieving target development rather than answering an important development questions. An other problem is lack of data, as African Development Bank survey noted nearly one-fifth of the respondent countries had not conducted an industry survey since 2000. In addition to the above problems, African data have faced sampling (inappropriate sample size and sampling technique) and non-sampling error (respondent error, non response, recording error, etc.) [5]. Missing data can be a problem when there is non-response or the data is not collected for the variable mainly not at random. Lagrange interpolation method can be used to interpolate missing values when one value is dependent on its neighbour data sets. Vaseghi [6] formulate the general form of polynomial interpolator and statistical interpolators applicable for missing data imputation purpose. He considers the special forms of Lagrange, Newton, Hermite and cubic spline interpolators for polynomial interpolators. Lokupitiya et al. [7] uses NASS data for barley crop yield in 1997 where ecological variable are spatially correlated to select a better interpolator method and find out regression and Multiple imputation as a better interpolator from the interpolation methods considered (regression, kernel smoothing, universal kriging and multiple imputation) for Y (response) based on the target or control variable X (explanatory).

Howell [8] considered missing data problem for standard experimental studies and observational studies. In observational studies missing values can be treated by hot deck imputation, mean substitution and pairwise deletion, but those methods lead to bias in parameter estimation. However, expectation/maximization (EM) algorithm and multiple imputation (MI) are the most best techniques which are based on iterative solutions in which the parameter estimates lead to imputed values, which in turn change the parameter estimate. MI is an interesting approach because it uses randomized techniques to do its imputation an example of it is regression imputation, which regresses the response variable based on the explanatory variable.

The lack of reliable data on African countries economy limits knowledge on the economic effect of structural adjustment, as a result the economic growth time series for African economies does not appropriately capture changes in economic development [9]. Currently a better African socio-economic data is IMF [11], alternatively if we trust the AfDB, they may miss a few base year revision [10], though, AfDB is not really fully agree on IMF [11] report. Meanwhile, the AfDB, conclude that: “Overall, the situation with regard to GDP is not nearly as bad as has recently been suggested” [9]. In working over this problem, since considering the distribution of the data gives detailed and general information about the characteristics of interest than one value times series data is preferable than using single value. Consequently the risk of govern by only inappropriate data can be reduced.

Previous study on measurement of socio-economic status construct measuring components or variables based on socio-economic stand of an individual or community [1,2,3,4]. However, socio-economic status measurement is still ongoing problem. This paper desired to construct measuring components by investigating a natural correlation exist between possible suggested variables, those can able to cluster countries based on socio-economic status and level the status for components. However, a single measure for a status is not constructed, since the concern is to give specific suggestion based on the stand of cluster countries for components. Hence, time series data is used to manage the African socio-economic data problem and determine components of socio-economic status measurement which can classify African countries based on status through comparison across the region. Correspondingly, missing values were treated by a method which give minimum error from the true value at each time interval. Missing values are imputed using linear regression model, Lagrange interpolation, Linear interpolation and linear spline interpolation. Principal component analysis, factor analysis and cluster analysis are used in determining principal factor of socio-economic status measurement and clustering African countries based on those factors. The result reveals that $70\%$ of the variation in the data set is explained by the suggested 7 components (principle factors), which are contributed by 84 variables, and using those socio-economic components 6 cluster of African countries are formed at $95\%$ confidence level were 3 countries are consider as outlier.

Methodology

Data and variable

Data IMF [11] socio-economic yearly time series data set containing 2737 variables [File Name: 21yearData.csv] from year 1993 to 2013 for 48 African Countries were used for data management, however, the analysis is reasonably done based on the data set from 2000 to 2013. The reason of using 14 years of data instead of 21 is due to the recently growing demand for data which apparently increases outputs from statistical offices. This leads the missing value to decrease in recent years. Specifically almost all the 44 respondent countries have carried out at least one household survey of income or expenditure since 2000 [9].

The preference of data set from IMF [11] over AfDB [10] is made due to advantages listed below in (1) and (2)

1.
IMF [11] a data have best coverage (48 countries) than the AfDB [10] data (44 countries).
2.
Moreover, as it was indicated on the introduction section AfDB does also agree on the [11] report.

More over, Morten Jerven [9] also advises that 11. If we use AfDB they may have missed a few base year revisions.

Variables In this study the proposed components are selected based on suggested results from previous studies mainly by Cowan et al. [1], and Oakes and Rossi [2]. Cowan et al. [1] recommend that. the socio-economic status component should include family income, parental educational attainment and parental occupational status. More over, expanded measure of socio-economic status can be constructed by adding home neighbourhood and school socio-economic status. Where family income includes home possessions (internet access, clothes dryer, dishwasher, more than one bath-room, one’s own bedroom), presence of household member needing healthcare assistance and household composition like size of household (total, number of adults). Correspondingly Oakes and Rossi [2] recommends material capital (income, wealth, trust funds, etc.), human capital (skills, abilities, credentials, etc.) and social capital (instrumental relationships such as being friends with lawyers and doctors). Hence, relative to the IMF data components, the proposed components are related to education, economy, health, infrastructure and population demographic data category. However, current IMF [11] African socio-economic data have three major problems:

1.
Some data values are missing.
2.
Different country have different base year for their GDP. In response IMF update each country’s GDP based on its base year. Hence, there may be a loss in information, and comparison using single year data is inappropriate.
3.
The data have some discrepancies or some davit from AfDB [10] data [as AfDB [10] conclusion: IMF GDP report is not nearly as bad as has recently been suggested]. This problem is mainly raised in data collection, processing and distribution phase. It is a duty and responsibility of statistical offices or any primary data source organizations to apply appropriate data collection techniques and standardized processing method, and honest distribution based on the nature of the data as data is the public property. To manage this problem, distribution based analysis can reduce the risk of inferences and give relevant result than using a single value. Hence, considering time series of the data can help to do this job. For instance considering the time series of the data acquires the progress of the GDP, makes the comparison more appropriate in contrast to using 1 year GDP.

Missing data management

Missing data value is the absence of the data value completely at random (if missing values of any variable dose not depend on any value) or at random (if missing values in response variable does not depend on in its’ own value but dependent on other variables) or may be not at random (if missing values follow some structure or model) [8]. The data series of African socio-economic variables on fixed time-interval have high number of missing values for some countries comparing to other’s. As listed below in the sequence of missing values per country, countries such as, Somalia, South Sudan, and Sao Tome and Principe have high number of missing values compared to Tunisia, Morocco and South Africa. This suggests the probability to be missed is dependent on its’ own value.

List for the number of missing values per country:

In addition, since each socio-economic variable is expressed in time and have strong indirect correlation (r = − 0.8413), the recent year has less probability of having missing value than the old one. Hence, missing values are not at random and non-ignorable.

The time series of missing values:

Interpolation

Interpolation is the estimation of unknown values using the values of known samples at the neighbourhood points [6].

Interpolation by simple linear regression method

Linear regression estimate imputation is one of the single imputation method used the surviving creature characteristicswhen the variable with missing value has correlation with explanatory variable (time) and the series of data values follow linear trend [7]. However, socio-economic data expect to have some trends but may not exactly linear in time. Hence, applying this method may enhance correlation and under estimate the standard error of the regression coefficients by under estimating the variance of the imputed variables [8]. So with this consideration simple linear regression estimate is used to impute when the missing data is at the beginning ($t_1$) or/and at the end ($t_n$). However, missing values at the internal part were treated by comparing this method for minimum error with other exact estimation methods discussed in “Linear interpolation”, “Linear spline interpolation” and “Lagrange polynomial interpolation” sections.

The simple linear regression model for the given response variable Y (socio-economic variable) and the explanatory variable time (t) is given by:

$$\begin{aligned} Y_i= B_{0i} + B_{1ij} t_{ij} + \varepsilon _{ij} , \quad \text {for}~i = 1, 2, \, \ldots ,\, 2737, \quad \text{for} ~j = 1,2, \ldots ,~21. \end{aligned}$$

(1)

where $B_0$ and $B_1$ are intercept and slope parameters respectively, and $\varepsilon _i$ is error term which is normally distributed with mean $\mu$ and variance $\sigma ^2$, i.e, $\varepsilon _i \sim$ $N(\mu , \sigma ^2$ ).

Linear interpolation

Linear interpolation is the simplest interpolation techniques for missing data imputation using the two known neighbours. For a time series of discreet data points of socio-economic variable given by $\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}$, when there is missing value at $t_i$, for known $y_{t_{i-1}}$ and $y_{t_{i+1}}$ linear interpolation can be used to estimate $y_{t_i}$ at $t_i$ by interpolating based on it’s neighbours $y_{t_{i-1}}$ and $y_{t_{i+1}}$ by using the following formula.

$$\begin{aligned} Y_{t_i}=Y_{t_{i-1}} + \dfrac{t_i - t_{i-1}}{t_{i+1}-t_{i-1}}\left( Y_{t_{i+1}} - Y_{t_{i-1}}\right) . \end{aligned}$$

(2)

This method is employed to interpolate non-sequential missing values. An illustration example is presented on Fig. 3 and Table 2.

Linear spline interpolation

This method works in similar fashion as linear interpolation in a way that the missing value is interpolated by using its most two neighbours except it works for sequentially missed values. Here with some adjustment (i.e., the same upper neighbour) this method is applied to interpolate sequentially missed values. For a time series of discreet data points of socio-economic status given by $\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}$, when the sequence of values $y_{t_2}$, $y_{t_{3}}$ ... $y_{t_{n-1}}$ are missed.

Then for any k, $2 \le k\le n-1$, $y_{t_k}$ is interpolated as:

$$\begin{aligned} Y_{t_k}=Y_{t_{k-1}} + \dfrac{t_k - t_{k-1}}{t_{n}-t_{k-1}}\left( Y_{t_{n}} - Y_{t_{k-1}}\right) . \end{aligned}$$

(3)

That is,

$$\begin{aligned} Y_{t_2}&=Y_{t_{1}} + \dfrac{t_2 - t_{1}}{t_{n}-t_{1}}\left( Y_{t_{n}} - Y_{t_{1}}\right) \\ Y_{t_3}&=Y_{t_{2}} + \dfrac{t_3 - t_{2}}{t_{n}-t_{2}}\left( Y_{t_{n}} - Y_{t_{2}}\right) \\&\vdots \\ Y_{t_{n-1}}&=Y_{t_{n-2}} + \dfrac{t_{n-1} - t_{n-2}}{t_{n}-t_{n-2}}\left( Y_{t_{n}} - Y_{t_{n-2}}\right) . \end{aligned}$$

On this paper linear interpolation is employed to interpolate sequentially missed values. An illustration example is presented on Fig. 4 and Table 3.

Lagrange polynomial interpolation

Lagrange polynomial interpolation is one type of exact interpolation which uses all given neighbours to estimate missing values.

For $Y_{t_i}=f(t_i)$, where, $\left\{ t_1< t_2 < \cdots \right\}$: is the function given at discreet time for socio-economic variable given by: $\left\{ \left( t_1, y_{t_1}\right) , \left( t_2, y_{t_2}\right) , \ldots \left( t_n, y_{t_n}\right) \right\}$. The Lagrange polynomial (the nth order polynomial) for the given points is used to approximate or estimate a function $Y_{t_i}=f(t_i)$ at any time point $t_i$ in the range, this process is called interpolation by Lagrange polynomial. For a missing value $Y_{t_i}$ in the series of variable values the Lagrangian estimate was calculated by the following equation [6].

$$\begin{aligned} Y_{t_i}=\sum _{k=1}^{n}\prod _{{\begin{matrix} j=1\\ j\ne k \end{matrix}}}^{n}\left( \dfrac{t_i - t_{j}}{t_{k}-t_{j}}\right) Y_{t_k}. \end{aligned}$$

(4)

On this paper Lagrange interpolation is applied to interpolate when there is only one missing value in variable values or left with one missing value after other methods are employed. An illustration example is presented on Fig. 2 and Table 1.

Appreciatively, from theoretical advantage of Lagrange polynomial interpolation, since this method considers all known data value of a variable to estimate missing value at the point, the estimate is not only governed by its two most neighbouring data. However, due to the complication of the formula this method is employed when there is only one missing value in variable values or left with one missing value after other methods are employed.

Normality assumption It is known that the surviving creature characteristics is normally distributed. As usual, in our case this assumption is important to infer for a population because socio-economic status of African population and the variables that determine these characteristics are expected to be normally distributed. Moreover, from central limit theorem, we have the property that, the sampling distribution of the sample statistic approaches to normal distribution as sample size increases $(n > 30)$ and from law of large number, we have the property that, as sample size increases the sample statistic approaches to the population parameter.

Data set of socio-economic status used for analysis is a multivariate time series data set of 2737 variables from 48 African countries for a year from 2000 to 2013.

Through the analysis of socio-economic status, it is expected that some variables have high contribution or effect on the status of socio-economic well-being comparatively. In addition, some variables may be highly correlated. Therefore, to avoid complexity due to having large number of variables, it is better to consider the possible small number of variables those can reflect the needed information. This can be done specifically:

1.
When some variables are highly correlated to each other, those variables are describing the underlined characteristics which is governed by their correlation, so this characteristics will be the interest on the group. The characteristics as a new variable can be written as a linear combination of those correlated variables which can maximize the accounted variation from the total variation in the data set.
2.
When some variables are correlated to the same latent or may be new variable in describing the situation of interest (socio-economic status), the latent or new variable as a linear combination of these variables is taken, in a way that the linear combination can maximize the accounted variation.
3.
From the set of variables, some variables may accounted for large amount of variability in the data set. Hence, these variables can express larger amount of variation in the data set, so we can take those variables which can address the variation need to accounted.

In general the above three theories lead to principal component analysis and explanatory factor analysis.

In another word, we are assessing the variation between the random variables and variance of a variable. Normally, the variation between random variables is estimated by the distance variation of each random variable from their mean in units of standard deviation. This distance is a standardized and correlation free random variable [12]. In doing so, statistical distance plays an important role because the smaller this distance between the variables implies high correlation (it is observed on the off-diagonal of correlation or covariance matrix).

For the multivariate normally distributed random variables denoted by the random vector $\mathbf Y '$ $= [Y_1, Y_2, \ldots Y_p]$, with p-dimensional normal density given by:

$$\begin{aligned} f(y)= \dfrac{1}{(2\pi )^{\frac{p}{2}} \mid \Sigma \mid ^\frac{1}{2}}exp^{-\frac{(\mathbf y -\mu )' \Sigma ^{-1} (\mathbf y -\mu )}{2}} , \end{aligned}$$

where p is the total number of variables, and $y_i$, for $i=1,2, \dots ,\ p~(\text {for}~ p = 2737)$ is of an n component normal random variable with mean $\mu$ and variance $\sigma ^2.$

The squared statistical distance from $\mathbf Y$ to population mean $\mu$ for $p \times 1$ vector $\mathbf y$ of observations is given by:

$$\begin{aligned} (\mathbf y -\mu )'\Sigma ^{-1} (\mathbf y -\mu ) , \end{aligned}$$

where the $p \times 1$ vector $\mu$ represent the expected value of the random vector $\mathbf Y$ and $p\times p$ matrix $\Sigma$ is the variance–covariance matrix of $\mathbf Y$

Principal component analysis

Principal component analysis describes the correlation or variance–covariance structure between the set of variables through a few uncorrelated latent or new variables, each of which is a linear combination of the original variables which can maximize the variance accounted. Most often these new variables reveals a new interpretation that is not visible in original variables [13]. The newly created variables are called principal components.

Let the random vector $\mathbf Y '$ $= [Y_1, Y_2, \ldots Y_p]$ have covariance matrix $\Sigma$ with eigenvalue–eigenvector pair $(\lambda _1, e_1)$, $(\lambda _2, e_2)$, ..., $(\lambda _p, e_p)$, where $\lambda _1\ge \lambda _2 \ge \cdots \ge \lambda _p \ge 0$. Then ith principal component $Z_i$ for $i=1, 2, \ \ldots , \ k$ where $k \le p$ is a linear combination given by:

$$Z_i= \mathbf e _i'{} \mathbf Y =e_{i1}Y_1 + e_{i2}Y_2 \ldots e_{ip}Y_p$$

with $Var(Z_i) = \mathbf e _i' \Sigma \mathbf e _i=\lambda _i$ for $i=1, 2, \ \ldots , \ k$ and $Cov(Z_i, Z_j)= \mathbf e _i' \Sigma \mathbf e _j =0$ for $i \ne j$ which maximizes $Var(Z_i) = \mathbf e _i' \Sigma \mathbf e _i .$

Principal components are arranged in decreasing order based on the proportion of the variation they can explain, in away that the first principal component accounts for the maximum variation than any of others. Therefore, taking the most first components may can address most of the variation in the original data (like up to 80 or 90% of the population variation). However, deciding the number of principal components have no yet well stated rule, but it is also advisable to consider the size of eigenvalues and the nature of components. Most often principal components with relatively equivalent small size eigenvalues are not consider. In general, it helps to reduce the data size in variable and shows the correlation between the variables on it [12], and those new variables are used for further analysis like cluster and regression analysis.

Factor analysis

Factor analysis is used to describe the observed correlation (covariance relation) between the variables in terms of few new random variables called factors [13]. This method concerns about grouping highly correlated variables together in a way that variables in different groups are relatively slightly correlated. So in a group, those variables are addressing a characteristics which is governed by the underline correlation, called factor. Factors are sightly correlated new variables.

For the random vector $\mathbf Y '$ $= [Y_1, Y_2, \ldots Y_p]$ with mean vector $\mu$ and covariance matrix $\Sigma$. The factor model postulates that $\mathbf Y$ is linearly dependent on a $k\times 1$ random vector F called common factors and a $p\times p$ diagonal matrix $\varepsilon$ called specific factors. Then the interrelation between the elements of $\mathbf Y$ is given by a factor model:

$$\begin{aligned} \mathbf Y =\mu + \Lambda \mathbf F + \varepsilon , \end{aligned}$$

where $\Lambda$ is $p \times k$ matrix of unknown constants called loadings.

Assumptions of factor model on $\mathbf F$ and $\varepsilon$;

1.
$\mathbf F \ \sim \ N(\mathbf 0 , \mathbf I )$.
2.
$\varepsilon \ \sim \ N(\mathbf 0 , \Phi )$, where $\Phi \ = \ diag( \phi _1, \phi _2, \ldots , \phi _p)$.
3.
$\mathbf F$ and $\varepsilon$ are independent. This assumption leads us to estimate covariance matrix, given by:
$$\begin{aligned} \Sigma = \Lambda \Lambda ' + \Phi , \end{aligned}$$
and, $Cov(\mathbf F ,\mathbf Y )=\mathbf L$ or $Cov( Y_i, F_j ) = l_{ij}$,

where $h_i^2 = \sum _{j=1}^{k}l_{ij}^2$ (Communality) and $\phi _i=Var(Y_i) - h_i^2$ (Uniqueness), for $i=1, 2, 3, \ldots \ p$

The comparison of estimate of covariance to the original covariance tells us how the factor model fits the covariance matrix of original variable by the considered factors. Minimum discrepancy shows the good fit. Moreover, communality and uniqueness tell us the variance accounted by factors. Specifically the ith communality tells us the portion of the variance of $Y_i$ explained by k common factors and ith uniqueness tells about the portion of variance of Y $(Var (Y_i)\ )$ explained by the ith specific factors. Our concern is mainly looking at the factor model that explains covariance structure without much loss of information by small number of common factors.

Cluster analysis

Cluster analysis is a method of grouping of objects or variables based on similarity or distance by considering the nature of the variable or scale of measurements and the subject matter knowledge in-order to make objects in a group to be similar and objects in different groups be relatively different. Usually objects, units or cases are clustered based on sort of distance, whereas variables are clustered based on correlation coefficients with a goal to find optimal group [14].

In this paper the combined method of Hierarchical Clustering followed by Non-Hierarchical Clustering including bootstrap Ward’s method were used due to the advantages of Hierarchical Clustering is better in finding the number of groups and initial cluster members where as Non-hierarchical Clustering gives more accurate members based on initial cluster members given by hierarchical method.

1.
Hierarchical clustering method: Hierarchical clustering is an unsupervised method of grouping list of items through successive merging based on similarity or successive division based on dissimilarity. This method fall into two categories, Agglomerative hierarchical method and Divisive hierarchical method. Divisive hierarchical method start with group of items and continues by dividing the group into two subgroups by taking most similar items together in one group till each individual item make its own cluster where as Agglomerative hierarchical method start with a single item and merge most similar items together as a group, and these groups are merged successively based on similarity until the similarity is low. Then, those groups with low similarity are taken as clusters. The choice of similarity between groups or items can be measured based on average linkage or nearest neighbour linkage or the farthest neighbour linkage between the points of the groups or ward’s method. However, Agglomerative hierarchical algorithm is faster due to its computational efficiency (running time complexity $O(n^3 )$) than divisive clustering algorithm (running time complexity $O(2^n)$) [15, 16]. Hence, Agglomerative hierarchical method specifically average linkage (average euclidean distance) and Ward’s similarity measure are used. As described on the introduction section African socio-economic data have a problem, so working based on the characteristics of the distribution can give relevant information. Hence, Average linkage helps to control the impact of a single value, so the result will not be fully affected by a probably-misleading nearest (due to single linkage method) or farthest point (due to complete linkage method). e.g. End points of Chaining cluster. The result of Agglomerative hierarchical clustering can be presented by two dimensional graphs called dendrogram or by the $95\%$ confidence bounded ellipse scatter plot of the first and the second principal factors (which shows the proportion of variance in the data set explained by the first two components in determining clusters).
2.
Ward’s hierarchical clustering and it’s bootstrap extension: In this approach the focus is minimizing the information lost due to clustering. It is clear that joining dissimilar clusters results in inflated error sum of square (ESS) and leads to much information loss. Hence, a merging with smallest change in ESS results in minimum loss of information. At the beginning each item is considered as a cluster and ESS of the i cluster is zero (ESS$_i$, for i= 1, 2, ..., K) and ESS of the data set is $\sum _{i=1}^{K}ESS_i = 0$, in general if there are L clusters, $ESS = ESS_1 + ESS_2 + \cdots+ ESS_L$, and finally if all clusters are in one group, error sum of square is given as;
$$\begin{aligned} ESS= \sum _{i=1}^{K}(y_i -\overline{y})'(y_i -\overline{y}), \end{aligned}$$
where $y_i$ is the multivariate measurement associated with the ith item and $\overline{y}$ is mean of all items. The result multivariate clustering is expected to be roughly elliptical [13]. Now the equation is in how much confidence a cluster can include the items assigned by ward’s method or how assigned elements of a cluster are variable. This can be check by creating a dataset using re-sampling (re-sampling may be from empirical distribution of the data or by re-sampling with replacement from the data) and do clustering for each dataset, if the proportion of an item included in the same cluster is grater or equal to the desired level of confidence, then an item is assigned to the cluster in the given confidence level.
3.
Non-hierarchical clustering method: Non-hierarchical clustering techniques are designed to group items, rather than variables, into a collection of K clusters, which is predetermined in our case by hierarchical clustering techniques [13]. Non-hierarchical clustering is started either from random partitioning of items into K initial clusters or an initial set point which will form clusters. This paper uses the popular Non-hierarchical clustering method called K-mean method, which starts by random partitioning of items into K initial clusters and goes through the list of items for assigning an item to a cluster with a closest mean to an item.

Numerical examples for missing data management

The following examples are the realization of Lagrange interpolation, linear spline interpolation, linear interpolation and linear regression estimation for artificially made missed value/s from the known data values of one of the IMF [11] data set variables, in case of Exports of goods and services (% of GDP) for Algeria. This examples are also used to illustrate and compare the error trends made by each method in interpolating or estimating artificially missing values (Fig. 1).

Example 1

Estimation of artificially missed value for the known series of data (Y_ Actual), when a missing observation is at any point between the first and the last observation of the variable values. The result is given in Table 1 and Fig. 2.

Table 1 Result for the estimates of a missing value by Lagrange interpolation, linear interpolation and linear regression estimation: in case of one missing value

Missing data management and statistical measurement of socio-economic status: application of big data

Abstract

Introduction

Methodology

Data and variable

Missing data management

Interpolation by simple linear regression method

Linear interpolation

Linear spline interpolation

Lagrange polynomial interpolation

Principal component analysis

Factor analysis

Cluster analysis

Numerical examples for missing data management

Example 1

Example 2

Example 3

Conclusion

Result and discussion

Principal component and factor analysis

Data quality

Cluster analysis

Number of clusters

Determining elements of the cluster

Conclusion for cluster analysis

Inference for population

Future work

Conclusion

References

Acknowledgements

Competing interests

Availability of supporting data

Consent for publication

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords