 Research
 Open Access
 Published:
Bayesian zeroinflated regression model with application to underfive child mortality
Journal of Big Data volume 8, Article number: 4 (2021)
Abstract
Underfive mortality is defined as the likelihood of a child born alive to die between birth and fifth birthday. Mortality of under the age of five has been the most targets of public health policies and may be a common indicator of mortality levels. Thus, this study aimed to assess the underfive child mortality and modeling Bayesian zeroinflated regression model of the determinants of underfive child mortality. A communitybased crosssectional study was conducted using the 2016 Ethiopia Demographic and Health Survey data. The sample was stratified and selected in a twostage cluster sampling design. The Bayesian analytic approach was applied to model the mixture arrangement inherent in zeroinflated count data by using the negative Binomial–logit hurdle model. About 71.09% of the mothers had not faced any underfive deaths in their lifetime while 28.91% of the women experienced the death of their underfive children and the data were found to have excess zeros. From Bayesian Negative Binomial—logit hurdle model it was found that twin (OR = 1.56; HPD CrI 1.23, 1.94), Primary and Secondary education (OR = 0.68; HPD CrI 0.59, 0.79), mother’s age at the first birth: 16–25 (OR = 0.83; HPD CrI 0.75, 0.92) and ≥ 26 (OR = 0.71; HPD CrI 0.52, 0.95), using contraceptive method (OR = 0.73; HPD CrI 0.64, 0.84) and antenatal visits during pregnancy (OR = 0.83; HPD CrI 0.75, 0.92) were statistically associated with the number of nonzero underfive deaths in Ethiopia. The finding from the Bayesian Negative Binomial–logit hurdle model is getting popular in data analysis than the Negative Binomial–logit hurdle model because the technique is more robust and precise. Furthermore, Using the Bayesian Negative Binomial–logit hurdle model helps in selecting the most significant factor: mother’s education, Mothers age, Birth order, type of birth, mother’s age at the first birth, using a contraceptive method, and antenatal visits during pregnancy were the most important determinants of underfive child mortality.
Background
Underfive mortality is defined as the likelihood of a child born alive to die between birth and fifth birthday. Mortality of under the age of five has been the most targets of public health policies and may be the common indicator of mortality levels [1]. Child mortality is a comprehensive indication of the environmental, socioeconomic, sociocultural, and health care of the community and countries [2]. It also reflects the development status of the country and the quality of life. Child mortality is used for monitoring and evaluating population, health programs, and policies [3]. Most child mortality is related to the socioeconomic issue since preventable by vaccine [4]. The highest burden of child mortality is observed in low and middleincome countries, especially in SubSaharan Africa and Southeast Asia [5, 6]. Its burden is highest still in SubSaharan Africa which was 1 in 8 children dies before age five [7]. Child mortality in Ethiopia is declined from 166, 67, and 55.2 deaths per 1000 live births in 2000, 2016, and 2018, respectively. However, it is the highest relative to other developing countries [8]. Underfive mortality is 88 in 1000 live birth in rural areas which was largest than urban 66 dies in Ethiopia [9]. To identify the risk factor of underfive mortality, different countries conducted various studies [10, 11]. Many smallscale surveys were done on a specific set of variables. These studies investigated the risk factors of underfive mortality through binary logistic and survival analysis [12]. Though, binary logistic regression undercounts the total number of mortality since multiple mortalities are collapsed into a single unit to fulfill the requirements of binary logistic regression, provides sufficient information for studying the pattern of multiple child deaths. In this study, the count regression model is the preferred model of analysis.
The Poisson regression model is the most common model used for the analysis of the count data. One of the assumptions of Poisson regression is the mean and variance must be equal, but most of the data have a larger variance or overdispersion. The negative binomial regression model is more flexible than the Poisson model and is frequently used to study count data with overdispersion [13]. However, the Poisson regression model and the Negative Binomial model were found to be insignificant in explaining and handle overdispersion due to the high amount of zeros problems [14]. Therefore, Hurdle and zeroinflated count models are the two foremost methods used to deal with count data having excessive zero counts [15]. Zeroinflated models and hurdle models provide a way of modeling the excessive proportion of zero values and allow for overdispersion. Especially when there is a large number of zeros, these techniques are a better fit than Poisson or negative binomial regression models [16, 17].
Zeroinflated Poisson (ZIP) and Zeroinflated Negative Binomial (ZINB) models are often used when count response models having far more zeros than expected by the distributional assumptions of the Poisson and negative binomial models result in incorrect parameter estimates as well as biased standard errors [14]. Count data frequently display overdispersion and excess zeros, which motivates zeroinflated count models [18]. Zeroinflated count models offer a way of modeling the excess zeros in addition to allowing for overdispersion in a standard parametric model. ZINB regression model to overdispersion count data caused by excess zero. However, the hurdle model is flexible and can handle underdispersion, overdispersion, and excess zeros problem. In particular, a hurdle model is mixed by a binary outcome of the count being below or above the hurdle. The negative BinomialLogit Hurdle Regression Model is better than the Poisson Regression Model to handle the problems of overdispersion and excessive zeros [15].
Negative BinomialLogit Hurdle (NBLH) are flexible models for dealing with zeroinflated and overdispersed count data [19]. In estimating the parameters, Bayesian methods can be applied by the Markov Chain Monte Carlo (MCMC) simulation that can generate random values with the Gibbssampling algorithm. Hence the Bayesian method is more flexible for parameters estimation [20]. Therefore, the Bayesian parameter estimation method was implemented for the Negative Binomial–Logit Hurdle model. In many cases because of many zeros in the dependent variable, the mean is not equal to the variance value of the dependent variable. Due to that, the Poisson model is no longer suitable for this kind of data. Thus, we suggest using an NBLH regression model to overcome the problem of over dispersion [21]. Therefore, this study aimed to assess the status of underfive child mortality and modeling Bayesian zeroinflated regression model of the determinants of underfive child mortality.
Method
Study design and source of data
The dataset used for this study was obtained from 2016 Ethiopian Demographic Health Surveys conducted from January 18 to June 27, 2016, across the country. The survey was a populationbased crosssectional study. For the surveys, the 2016 EDHS sample was stratified and selected in two stages. In the first stage, a total of 645 clusters (202 in urban and 443 in rural) were randomly selected proportional to the household size from the sampling strata and in the second stage, 28 households per cluster were selected using systematic random sampling. In this survey, a total of 10,641 children under age 5 of mothers selected from 645 clusters were included in this study.
Variables of the study
Dependent variable
The dependent variable for this study was the number of deaths of underfive per mother. That is the number of underfive children death was defined as the death of children less than 60 months in the last 5 years preceding the survey.
Independent variable
The main predictors explored for underfive mortality have been grouped into demographic and socioeconomic. The demographic factors for this study are the mother’s age, birth order number, and mother’s age at the first birth. The socioeconomic factors are the mother’s level of education, residence mother, and household wealth index. Frequency of ANC visits, using the contraceptive method, type of birth were the variables that were included in the utilization of maternal health services by the mother.
Statistical method
In this study, the variable of interest was count data. When the dependent variable is a count, it is appropriate to use nonlinear models based on nonnormal distribution to describe the relationship between the response variable and a set of predictor variables. For count data, the standard framework for explaining the relationship between the outcome variable and a set of explanatory variables includes the Poisson, negative binomial regression, ZIP, ZINB, and hurdle models. The advanced models for this study count data are the NBLH model and the Bayesian negative binomialLogit hurdle model [22].
Poisson and Negative Binomial Regression Model
Poisson regression has been widely used for fitting count data. It is traditionally conceived as the basic count model upon which a variety of other count models are based [15]. The Poisson probability mass function, with rate parameter μ_{i}, is given by:
where, y_{i} is the number of underfive deaths the ith mother in a given time with rate parameter μ_{i}, the mean and variance of the Poisson distribution is given as E(Y) = Var(Y) = μ. Poisson regression model derives from Poisson distribution and relates \({\upmu }_{\text{I}}\), β, and \({X}_{i}^{T}\) through:
β are the vector coefficients \({X}_{i}^{T}\), which, unfortunately, in much of the cases, the number of underfive death data produces the variance which is greater than the mean, well known as overdispersion. The overdispersion is a result of extra variation in the number of underfive death means which can be caused by various factors like model misspecification, the omission of important covariates, and excess zero counts [23]. During this case, applying a Poisson regression model for the number of underfive death data would result in an underestimation of the standard error of the regression parameters. Therefore, the negative binomial model is introduced with:
The mean and variance of the negative binomial distribution are E [yμ, \(\varnothing \)] = μ and V [yμ, \(\varnothing \)] = μ (1 + \(\varnothing \) μ). Where \(\varnothing \) is the dispersion parameter (if \(\varnothing \) > 0 and μ > 0). Special cases of the negative binomial include the Poisson (\(\varnothing \) = 0) and the geometric (\(\varnothing \) = 1). The method of maximum likelihood is used to estimate the parameters in the negative binomial regression model [24].
In some cases, excess zeros in the number of underfive death data exist and are considered as a result of overdispersion. In this case, the NB model cannot be used to handle the overdispersion which is due to the high number of zeros. To do this, zeroinflation models including ZIP and ZINB models can be alternatively used. Both the ZIP and ZINB models assume that all zeros count come from two different processes: the process generating excess zero counts derived from a binary model, and the process generating nonnegative counts for the number of underfive death including zero values.
Zero Inflated Regression Model
Poisson regression and negative binomial model with many zero outcomes on the response variable. The ZIP regression model is more effective for many zero outcomes than Poisson regression. While the ZINB regression model is more effective for many zero outcomes than negative binomial regression [25].
ZeroInflated Poisson and Negative Binomial Regression Model
In ZIP regression, the counts Y_{i} equal 0 with probability p_{i} and follow a Poisson distribution with mean \({\mu }_{i}\), with probability 1 − p_{i} where i = 0, 1, 2,..., n. ZIP model can thus be seen as a mixture of twocomponent distributions, a zero part, and nozero components, given by [15]:
The ZINB distribution is a mixture distribution assigning a mass of p to ‘extra’ zeros and a mass of (1 − p) to a negative binomial distribution, where 0 ≤ p ≤ 1. Based on the probability function of the zero –modified distribution, then the probability mass function for ZINB is:
where \({\phi }^{1}\), µ and \(\Gamma \left(.\right)\) representing dispersion, mean, and gamma function respectively. Assume that there are p predictors for logistic regression function and negative binomial regression function. Hence, ZIP or ZINB regression model can be written as follow:
where β are the vector coefficients \({X}_{i}^{T}\) and \(\gamma \) are the vector coefficients \({Z}_{i}^{T}.\)
Poisson and Negative Binomial logit hurdle model
A hurdle model consists of two components—a point mass at zero and a distribution that generates nonzero counts. The first component is a binary component that generates zeros and ones (here “ones” correspond to nonzero values in data) and the second component generates nonzero values from a zerotruncated distribution. The most widely used hurdle models are those with the hurdle value at zero [4]. All zeros in the hurdle model are assumed to be “structural” zeros, i.e., they are generated from a single process, and are observed since the condition is absent. We explore two zerotruncated count distributions for the hurdle model specification [22]. The Hurdle Model of count data can be expressed as follows for the Poisson and Negative Binomial distribution. We consider a Poisson Hurdle Regression Model in which the response variable y has the distribution:
where \({\mu }_{i}\) is the mean of the untruncated Poisson distribution.
A negative binomial hurdle distribution is given by:
where \(\phi \)(≥ 0) is a dispersion parameter that is assumed not to depend on covariates. Zero and truncated hurdle model:
where β are the vector coefficients \({X}_{i}^{T}\) and \(\gamma \) are the vector coefficients \({Z}_{i}^{T}.\) The parameter \(\varnothing \) is a measure of dispersion.
The Maximum Likelihood Estimation (MLE) method is used to estimate parameters in the count models. This study includes Poisson, Negative Binomial, ZIP, ZINB, Hurdle Poisson, and NBLH to accommodate the excess zeros for the number of underfive death count data. In this paper, Akaike’s information criteria (AIC) and log‐likelihood values are used for model selection measures. It is also used dispersion parameters to test for overdispersion. The generalized Pearson χ2 statistic which is the standard measure of goodness of fit is used to evaluate the sufficiency of the analyzing methods. AIC and log‐likelihood are basic methods for assessing the performance of the models and model selection [15].
Bayesian Negative Binomial–Logit Hurdle Model
The number of deaths of underfive per mother is a count variable. For modeling of count data, twopart models are applied in the presence of excessive zeros. Therefore, for a better fit an overdispersed model that incorporates excessive zeros, i.e. Negative BinomialLogit Hurdle (NBLH) Regression Model is used. The hurdle model is flexible and can handle both underdispersion and overdispersion problem. The NBLH model is used on data with either excessive zero counts in the response or at times too few zero counts. In the case where there are too few zero counts, a zeroinflated model cannot be used. The hurdle model is a good way to deal with such data [22]. It uses twopart. The first part estimates zero elements from the dependent variable are zero hurdle model and the second part estimates not zero elements (nonnegative integer) from the dependent variable is called truncated negative binomial models [26]. The probability density function of the negative binomiallogit hurdle model is:
where ϕ μ, and Γ(.) representing dispersion parameter, mean, and gamma function,
respectively. The most natural choice to model the probability of excess zeros is to use the Zero hurdle model with logit link function and a truncated negative binomial model with log link function respectively.
where β are the vector coefficients \({X}_{i}^{T}\) and \(\gamma \) are the vector coefficients \({Z}_{i}^{T}.\) The parameter \(\varnothing \) is a measure of dispersion. When \(\varnothing \) = 0, the NBLH model reduces to the Poisson regression model. For \(\varnothing \) > 0, the NBLH model can be used to fit overdispersed count data. When \(\varnothing \) < 0, the NBLH model can be used to fit under dispersed count data. The likelihood function of the negative binomiallogit hurdle distribution is as follows:
The first and most important step in the Bayesian approach is choosing appropriate prior distributions. Let β and γ are the set of parameters for the abovementioned model. We assume independent priors for these parameters. Since there is no prior information from historical data or previous experiments, then all parameters will use conjugate noninformative priors. The prior distribution for β and γ is assumed to be normal, while ϕ is assumed to be gammadistributed. So, the joint prior distribution for NBLH regression parameters is:
where \(\varnothing \)~Gamma (a, b) with a = 0.001 and b = 0.001. but our a priori judgment was that knowledge of the slope parameter \({\varvec{\upgamma}}\) does not provide any information about \({\varvec{\beta}}\). The regression tool for full Bayesian inference was based on the posterior distribution of all parameters. Markov Chain Monte Carlo techniques were used to draw samples from the full conditionals of all parameter distribution which were then summarized to obtain model estimates in the posterior analysis that is:
where \(\mathrm{f}\left(\upbeta ,\upgamma ,\mathrm{ \varnothing }/y,\mathrm{ X}\right)\) is the joint distribution of all parameters in the observation model, \(\mathrm{L}\left(\beta ,\gamma ,\varnothing /y, X\right)\) is the likelihood for all observable data (Y, X) and \(\mathrm{f}\left(\upbeta ,\upgamma ,\mathrm{ \varnothing }\right)\) is the joint prior distribution. Gibbs sampler was used to draw samples from the full conditionals. The posterior distribution is difficult to be solved analytically. Therefore, a numerical simulation using the Markov Chain Monte CarloGibbs sampling is used to update the parameters given initial values, and to sample the parameters given the simulation is convergent. The most commonly used of this sampling technique is the Gibbs sampling algorithm. Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables, to approximate the joint distribution. Gibbs sampling is applicable when the joint distribution is not known explicitly, but the conditional distribution of each variable is known. Moreover, the Gibbs sampling algorithm is a method to generate an instance from the distribution of each variable in turn, conditional on the current values of the other variables [27].
The convergence of the algorithm
Flexible software for Bayesian analysis of complex statistical models by using MCMC methods. We use these tools to estimate the NBLH regression models. MCMC is based on a combination of Markov chain and Monte Carlo estimation which eventually converges to the target distribution (the posterior distribution). If a chain becomes convergent means the produced sample from the target distribution has been obtained correctly. The Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling from posterior distributions and computing posterior quantities of interest. MCMC methods sample successively from a target distribution. Each sample depends on the previous one, hence the notion of the Markov chain. The Markov chain method has been quite successful in modern Bayesian computing. Only in the simplest Bayesian models can you recognize the analytical forms of the posterior distributions and summarize inferences directly. In moderately complex models, posterior densities are too difficult to work with directly. With the MCMC method, it is possible to generate samples from an arbitrary posterior density and to use these samples to approximate expectations of quantities of interest. Several other aspects of the Markov chain method also contributed to its success. Most importantly, if the simulation algorithm is implemented correctly, the Markov chain is guaranteed to converge to the target distribution [28,29,30,31]. MCMC technique depends on the approximate distribution which is improved by a simulation of each step until a convergence of the posterior distribution is achieved.
Appropriate diagnostics such as; the GellmanRubin convergence diagnostic test, Heidelberger Welch (stationarity test), HeidelbergerWelch (halfwidth test), monitoring the Markov Chain (MC) error, checking for autocorrelation, and observing the trace plots, can be used.
Results
Information on the number of deaths of underfive children obtained from a total of 10,274 women in Ethiopia was studied. Table 1 showed the frequency and percentage distribution of the number of underfive deaths in Ethiopia based on information from 10,274 women. In this study, 71.09% of them never faced any child death, while the remaining 28.91% have at least one child death. This indicates zero outcomes were large in number. However large observations (i.e. large numbers of underfive deaths per mother) are observed less frequently. This leads to a positively skewed distribution. This indicates that the data could be fitted better by a negative binomial hurdle which takes into account excess zeroes.
From Fig. 1, we visualized that an overdispersion of the response variable. Since the histogram is highly peaked at zero, we can state that the overdispersion is due to an excess of zeroes. Due to a large number of zero outcomes, the histogram is highly picked at the very beginning (about the zero values).
This leads to having a positive (or right) skewed distribution. This was an indication that the data could be fitted better by count data models which take into account excess zeroes and the distribution of the number of underfive deaths has a rapidly decreasing tail and is highly skewed to right with excess zeros.
Test for overdispersion
In Poisson regression analysis, Deviance and Pearson Chisquare goodness of fit statistics indicate there was overdispersion (Table 2). Since the Pearson Chisquare statistic divided by the degreesoffreedom is higher than one and the observed value of 1.165, then the mentioned goodness of statistics represents that there was an overdispersion in the data set. Even if the Deviance and Pearson Chisquare goodness of fit statistics of 7552.41and 9939.28 respectively in NB regressions is dropped considerably, still significant overdispersion exists; because we would like to divide this value by the degrees of freedom to be close to one. Moreover, the ratio of the Deviance and Pearson Chisquare statistic to their corresponding degrees of freedom are greater than one, indicating overdispersion in the data and the NB regression model is preferred over the Poisson model.
Figure 2a showed how well the model predicts the count values by overlaying the predicted probabilities for each underfive child death category on the frequency histogram of the actual underfive child mortality data. It appears that the typical regression model underpredicts the 0–5 underfive child mortality categories overpredicts all the other categories. The plots of the predicted probability of each model against the observed probability of the outcome show that the Poisson and the NB model underestimated zero counts.
The zeroinflated models captured almost all zero values. Based on predicted probabilities, the differences in model fit between the six models were remarkable. Still, the Poisson model and the NB model do not fit the data reasonably well because of high zero counts. The Poisson predicted about 68% zeros and the NB model predicted about 70% zeros compared to NB hurdle ZIP and ZINB about 71.09% observed zeros (Fig. 2b, c).
A Table 3 summary of the model comparisons based on Vuong’s statistics for the six regression models explored. The rankings of the model are as follows: Poisson < Negative binomial < ZINB = ZIP = NBLH = Poissonlogit Hurdle. [32]states that if the corresponding pvalue is bigger than a prespecified critical value such as 0.05, then one can conclude that the two models fit the data equally well with no preference given to either model. But, if V yields a pvalue smaller than the thresholds 0.05, then one of the models is better. Therefore, the ZINB, ZIP, NBLH, PLH was chosen as the best model.
Model selection criteria
The AIC values for the Poisson (PR), negative binomial (NB), ZIP, ZINB, PLH, negative binomiallogit hurdle (NBLH) were given in Table 4. The AIC obtained from the PR model was determined to be greater than that obtained from the other regression models. The model with the smallest AIC was NBLH.
Interpretation of Count Model coefficients (truncated negative binomial with log link)
According to the findings of this study, the wealth index of the household has a significant influence on the number of underfive mortality. The expected number of nonzero underfive deaths for women in rich households was 0.84 times lower than the poor households. A mother’s age was a significant positive association with underfive mortality. When we look at the age of mothers, the expected number of nonzero underfive death for mothers aged 30–39 increased by 26.1% as compared to mothers aged less than 29 and equal controlling other variables in the model. Also, the expected number of nonzero underfive deaths for mothers aged 40–49 increased by 91.2% as compared to mothers aged less than 29 and equal by controlling other variables in the model.
The result also revealed that the expected number of nonzero underfive death whose mothers visited the health institution during pregnancy was 0.915 times lower compared to whose mothers who have not received any antenatal. The finding of this study also revealed that mother’s levels of education have a significant factor in the number of underfive death. The expected number of nonzero underfive death for mothers with primary and secondary education is 0.704 times lower as compared to those with noneducated. Contraceptive use is found one of the important significant predictors of underfive mortality.
The expected number of nonzero underfive death for mothers who were used contraceptives was 0.731 times lower than mothers who have not used a contraceptive. The result also shown the expected number of nonzero underfive death in multiple births was 1.571 times greater as compared to the single birth. When we see the age of mothers at first birth, the expected number of nonzero underfive deaths for mothers aged above or equal 26 years decreased by 37.9% as compared to mothers aged less than 15 years. Besides, the expected number of nonzero underfive deaths for mothers age 16–25 years decreased by 16% as compared to mothers aged less than 15 years (see Table 5).
Interpretation of Zero hurdle model coefficients (binomial with logit link)
The Zero hurdle model indicated that the estimated odds of the number of nonzero underfive deaths of women who lived in rural was 1.248 times more than those who lived in urban. In addition to this; as birth order increases the underfive mortality also increases. The estimated odds number of nonzero underfive deaths with children’s birth order 2–4 and 5 + are 3.106 and 1.024 times more than the first order; respectively. According to the findings of this study, the wealth index of the household has a significant influence on the number of underfive mortality. The estimated odds that the number of nonzero underfive deaths for women in the rich households were decreased by a factor of 0.897 times the estimated odds number of nonzero underfive deaths for women in the poor households while holding all other variables in the model constant.
The finding also showed that estimated odds that the number of nonzero underfive death for mothers who were used contraceptives was about 0.761 times lower than mothers who were not used. And also, the probability of underfive death decreased with the increasing educational level of the mother. The estimated odds that the number of nonzero underfive death with mothers who have a primary and secondary education are decreased by 16.7% than noneducated mother.
Finally, the result revealed that the type of birth and age at first birth has a significant factor for the likelihood of underfive death. The estimated odds that the number of nonzero underfive death with children born in multiple births is 2.496 times more as compared to children born in a single birth. The odds of the number of nonzero underfive death among children whose mother’s age at first birth greater than 26 and 16–25 years were decreased by 46.9 and 30%e than as compared to children whose mother’s age at first birth was less than 16 years. And also, estimated odds that the number of nonzero underfive death with mothers age 30–39 and 40–49 is 1.25 and 2.233 times more than as compared to mothers aged < = 29 respectively.
Results of Bayesian Negative Binomial–Logit Hurdle Model
Bayesian approach results, it is needed checking the convergence assessment, that involves checking that the sequence or chain has converged to and provides a representative sample from the posterior distribution. Table 6 shows the Heidelberger and Welch stationarity tests for the Bayesian MCMC.
Timeseries: It is one of the tests used to diagnosis the convergence of Bayesian analysis. The time series plot indicates a good convergence three independent generated channels will mix or overlapped (Appendix: Fig. 3a and b). Here, the diagnostic graphs conclude the simulation draws are reasonably converged, and therefore, we can be more confident about the accuracy of posterior inference.
Interpretation of Bayesian Count Model coefficients (truncated negative binomial with log link)
The finding of the analysis, it was shown that the most effective variable on the number of underfive child death. Women ages at first birth, the estimated coefficients of age groups of women are statistically significant for the number of underfive death. The results in Table show that the age category of women has a significant impact on the number of underfive death per woman. The expected number of underfive deaths those women aged 30–39 years had decreased by 98.04% as compared to the expected number of underfive deaths in the age group < = 29 while holding all other variables in the model constant. Similarly, the expected number of underfive deaths those women aged 40–49 years had decreased by 25.3% as compared to the expected number of underfive deaths in the age group < = 29 while holding all other variables in the model constant.
The finding of this study also revealed that the mother’s level of education had a significant factor in reducing the number of underfive mortality. The expected number of underfive mortality for women with primary and secondary education was decreased by 31.96% as compared to those with no education controlling other variables in the model. Similarly, the finding of this study, the wealth index of the household has a significant influence on reducing the number of underfive mortality. The expected numbers of underfive deaths for women in the rich households were decreased by 11.57% as compared to the expected number of underfive deaths for women in the poor households while holding all other variables in the model constant.
The finding of this study also revealed that types of birth had a statistically significant impact on the number of underfive mortality. The expected numbers of underfive deaths for multiple births were increased by a factor of 1.556 as compared to the expected number of underfive mortality for the single birth while holding all other variables in the model constant. Besides, the age of mothers at first birth, the expected number of nonzero underfive deaths for mothers aged 16–25, and ≥ 26 years are decreased by 16.64 and 29.25% as compared to mothers aged ≤ 15 years. The result also revealed that the expected number of nonzero underfive death whose mothers visited the health institution during pregnancy was 0.831 times lower compared to mothers who have not received any antenatal. In addition to this, as birth order increases the underfive mortality also increases. The expected number of nonzero underfive deaths with children’s birth order 2–4 and 5 + is 35.52 and 113.86 times more than to the first order; respectively.
Contraceptive use is found one of the important significant predictors of underfive mortality. The estimated number of nonzero underfive death for mothers who were used contraceptives is about 0.733 times lower than mothers who were not used.
Interpretation of Bayesian Zero hurdle model coefficients
This implies convergence and accuracy of posterior estimates are attained and the model was appropriate to estimate posterior statistics. Because of the result with noninformative prior given in Table 3, considering the credible interval, the table shows that the following variable: mothers age, education level, birth order number, type of birth, age of respondent at 1st birth, current contraceptive method of using, number of antenatal visits during pregnancy were the significant predictors of the determinants of underfive child death. From the Bayesian zero hurdle model we found that the number of nonzero underfive death whose mothers visited the health institution during pregnancy was 0.831 times lower compared to mothers who did not receive any antenatal (OR = 0.762; HPD CrI 0.690, 0.842). Additionally, the effects of maternal education on the underfive child mortality, we found that higher education level, primary and secondary education level women were 0.810 and 0.317 times less likely to the number of nonzero underfive child death compare to no educated women respectively. Furthermore, as the level of education increases, the odds of the number of nonzero underfive death also decreased by 0.810 and 0.317 respectively.
Regarding the effects of the age of the respondents at first birth on child mortality, mothers aged 16–25 were 0.744 times less likely to the number of nonzero underfive child death compare to mothers aged ≤ 15 years. Besides, the estimated odds of the number of nonzero underfive deaths for mothers aged ≥ 26 years are decreased by 27.02% as compared to mothers aged ≤ 15 years. Mothers who used contraceptives had decreased odds (OR = 0.756; HPD CrI 0.670, 0.851) of the number of non zero underfive child death compared with mothers did not use. The estimated odds that the number of nonzero underfive death with children born in multiple births was 2.404 times more as compared to children born in a single birth. The estimated odds number of underfive deaths those women aged 30–39 years had decreased by 89.6% as compared to the age group < = 29. Similarly, the estimated odds of the number of underfive deaths those women aged 40–49 years had decreased by 20.07% as compared to the age group < = 29. The estimated odds number of nonzero underfive deaths with children’s birth order 2–4 & 5 + are 4.031 & 16.151 times more than to the first order; respectively (Table 7).
Discussion
In this study, we found that Poisson and NB models are insufficient in the presence of excess zero counts. A previous study reported the performance of count models for health data concluding the ZINB model to be best fitted for overdispersed and zeroinflated response variables. However, in the presence of overdispersion and excess of zeros, the NBLH model is better fitted the data which is characterized by excess zeros and high variability in the nonzero outcome than any other models, it also should be noted that NBLH which allows for overdispersion and also accommodates the presence NBLH of excess zeros, is more appropriate among all zeroadjusted models and therefore, NBLH is selected as the best parsimonious model to predict the number of underfive death in Ethiopia [33,34,35].
Results that there was a significant difference between the two approaches. The comparison between the two approaches had to better apprehend the determinants of the number of underfive deaths highlights lower standard errors of the estimated coefficients in the Bayesian Negative Binomial–Logit Hurdle Model. Thus, the Bayesian Negative Binomial–Logit Hurdle Model is more stable. On the other hand, the results from Bayesian Negative Binomial–Logit Hurdle Model and NBLH are difficult to compare because of both utilized different tools for decisionmaking. Moreover, when both approaches produce similar results, findings from the Bayesian NBLH model are given preference because the technique is more robust and precise than the NBLH. Our results also give some support to previous findings [22, 36].
Besides of priors was to reduce the variance of the model and thereby lead to a better model in the Bayesian approach. Based on the prior definition and the result from our analysis; we concluded that the Bayesian approach gives a better result. Findings from Bayesian and classical inference are not significantly different which could be due to the covariates or noninformative prior utilized in the model. Despite the similarities in their results, it was still difficult to compare the two approaches because classical inference makes use of confidence interval to decide while Bayesian uses credible intervals. Moreover, when both techniques produce similar results, findings from Bayesian are given more attention because it is more robust compared to the classical. It was also possible to assess convergence of models under the Bayesian which could also make its result better than the classical inference [37].
According to the results, the mother education level was an important socioeconomic predictor of the number of underfive child death, that was the mortality rate decreases with an increase in mother education level. The higher the level of education of the woman, the lower the risk of mortality. Educated mothers will be well informed about factors such as antenatal care, family planning and others that will lead to a reduction of child mortality. Similar results were obtained from previous studies [38,39,40].
Mother’s age at first birth is negatively correlated with child mortality that decreased the risk of child mortality as an increase in mother’s age at first birth. The estimated result also show that increases mothers’ age at first birth reduced the risk of child mortality and mothers who gave birth to their first child at a younger age face higher child mortality risk which is similar to the previous studies conducted by different scholars in developing countries including Ethiopia, Nigeria and Bangladesh [41,42,43,44,45].
The risk of underfive death associated with multiple births was very high relative to single births and this study is similar to the previous studies that birth type to be linked with underfive child death as multiple births is associated with a higher risk of child mortality [42]. Child death with multiple births is higher relative to single ones. Because multiple births have a lower weight due to nutritional intake competition [46]. In addition to the current study, those underfive children, including infants, whose births were multiple had a higher rate of odds of mortality than those who were singleton births. So, these findings indicate the importance of meticulous identification and investigation of high Maternal and child determinants of underfive mortality risk pregnancies, including multiple pregnancies, during the prenatal period to take appropriate action.
The finding of the study revealed that the death of underfive children from mothers use contraceptives was significantly less compared to the death of children from mothers who did not use a contraceptive [47]. Birth order was another important factor positively associated with underfive child mortality. Underfive child mortality increased as birth order increased. Birth order of greater than or equal to five (> = 5) has been said to experience significanthigh childhood mortality, possibly due to less care, since the woman has more children to attend [48, 49]. More so, as the birth order increases, the age of the mother also increases.
The result also revealed that the number of underfive death whose mother’s antenatal visited during pregnancy was lower than not received any antenatal check. Hence, increased attendance at antenatal clinics reduced child mortality [31, 35]. According to the results, underfive mortality risk is higher for children of poor mothers compared to children of medium and rich mothers. In this study, the AIC statistic and predictive probability curve indicated that the Hurdle negative binomial model was the best model for the number of underfive death with about 71.09% zero counts. Several studies reported similar results that the Hurdle negative binomial model was the best model for count outcomes [50].
Conclusion
This article considered several count data models to examine the factors associated with the number of underfive child mortality using a dataset with overdispersion and inflated with zeros from EDHS, 2016 in Ethiopia. Six count regression models were compared in terms of AIC. The model comparison identified that NBLH models are better fitted for modeling the observed data with excess zeros and overdispersion. In this study, we proved that there was a significant difference between the Bayesian and classical NBLH models. A comparison between the two approaches had better apprehend the socioeconomic determinants of the number of underfive deaths highlights lower standard errors of the estimated coefficients in the Bayesian NBLH Model. Thus, the Bayesian NBLH Model was more stable. Moreover, when both approaches produce similar results, findings from the Bayesian model are given preference because the technique is more robust and precise than the classical statistics. Furthermore, Using the Bayesian Negative Binomial–logit hurdle model helps in selecting the most significant factor: mother’s education, Mothers age, Birth order, type of birth, mother’s age at the first birth, using a contraceptive method, and antenatal visits during pregnancy were the most important determinants of underfive child mortality.
Availability of data and materials
The data that support the findings of this study are available from the Measure DHS website (www.measuredhs.com but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Measure DHS.
Abbreviations
 CrI:

Credible Interval
 CI:

Confidence Interval
 EDHS:

Ethiopia Demography and Health Survey
 MLE:

Maximum Likelihood Estimation
 OR:

Odds Ratio
 NBLH:

Negative BinomialLogit Hurdle
 ZIP:

Zero Inflated Poisson
 ZINB:

ZeroInflated Negative Binomial
 PLH:

Poisson Logit Hurdle
 NB:

Negative Binomial
 HPD:

Highest posterior Density
 MCMC:

Markov chain Monte Carlo
References
Garenne M, Gakusi E. Health transitions in subSaharan Africa: an overview of mortality trends in children under 5 years old (1950–2000). Bull World Health Organ. 2006;84:470–8.
Kumar PP, File G. Infant and child mortality in Ethiopia: a statistical analysis approach. Ethiop J Edu Sci. 2010. https://doi.org/10.4314/ejesc.v5i2.65373.
World Health Organization. World health statistics 2015. Geneva: World Health Organization; 2015.
ESPO. Infant mortality and its underlying determinants in rural Malawi. Tampere: Tampere University Press; 2002.
You D, New J, Wardlaw T. Report on Levels and trends in child mortality, the United Nations Interagency Group for Child Mortality Estimation. 2014.
World Health Organization. Neonatal and perinatal mortality: country, regional and global estimates. Geneva: World Health Organization; 2006.
UNICEF. UNICEF Annual Report 2010. New York: UNICEF; 2010.
CSACE. Ethiopia demographic and health survey 2016. Addis Ababa and Rockville, MA: CSA and ICF; 2016.
Fikru C, Getnet M, Shaweno T. Proximate determinants of underfive mortality in Ethiopia: using 2016 Nationwide Survey Data. Pediatric Healt Med Ther. 2019;10:169.
Mekonnen D. Infant and child mortality in Ethiopia: the role of socioeconomic, demographic and biological factors in the previous five years period of 2000 and 2005. Lund: Lund University; 2011. p. 68.
Getachew Y. Survival analysis of underfive mortality of children and its associated risk factors in Ethiopia. 2016;7(213):2.
Bedada DT. Determinant of underfive child mortality in Ethiopia. Am J THeor App Stat. 2017;6(4):198–204.
Pudprommarat C, Khamkong M, Bookkamana P, Zeroinflated Poisson regression in road accidents on a major road in the north of Thailand. IRCMSA Proc. 2005:323–330.
Prasetijo J, Musa WZ. Modeling Zero–Inflated Regression of Road Accidents at Johor Federal Road F001. In MATEC web of conferences. 2016. EDP Sciences.
Hilbe JM. Negative binomial regression. Cambridge: Cambridge University Press; 2011.
Hofstetter H, et al. Modeling caries experience: advantages of the use of the hurdle model. Caries Res. 2016;50(6):517–26.
Sarul LS, Sahin S. An application of claim frequency data using zeroinflated and hurdle models in general insurance. J Business Econ Finance. 2015;4(4):732–43.
Greene WH. Accounting for excess zeros and sample selection in Poisson and negative binomial regression models; 1994.
Bhaktha N. Properties of hurdle negative binomial models for zeroinflated and overdispersed count data. Columbus, OH: The Ohio State University; 2018.
Shafira SA, Lestari D. Bayesian zero inflated negative binomial regression model for the parkinson data.
Ehsan Saffari S, Adnan R, Greene W. Hurdle negative binomial regression model with rightcensored count data. SORT. 2012;36(2):181–94.
Hilbe JM, De Souza RS, Ishida EE. Bayesian models for astrophysical data: using R, JAGS, Python, and Stan. Cambridge: Cambridge University Press; 2017.
Lam K, Xue H, Cheung YB. Semiparametric analysis of zeroinflated count data. Biometrics. 2006;62(4):996–1003.
Cameron AC, Trivedi PK. Essentials of count data regression. A companion to theoretical econometrics; 2001. p. 331.
Hilbe JM. Modeling count data. Cambridge: Cambridge University Press; 2014.
Rusdiana RY, Zain I, Purnami SW. Censored Hurdle Negative Binomial Regression (Case Study: Neonatorum Tetanus Case in Indonesia). JPHCS. 2017;855(1):012039.
Gelman A, et al. Bayesian data analysis. Boca Raton: CRC Press; 2013.
Chen MH, Shao QM, Ibrahim JG. Monte Carlo methods in Bayesian computation. Berlin: Springer Science & Business Media; 2012.
Congdon P. Bayesian statistical modelling. New York: John Wiley & Sons; 2001.
Congdon P. Bayesian statistical modeling, vol 704. New York: John Wiley & Sons; 2007.
Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. 1984;6:721–41.
Tang W, He H, Tu XM. Applied categorical and count data analysis. Boca Raton, FL: CRC Press; 2012.
Joe H, Zhu R. Generalized Poisson distribution: the property of a mixture of Poisson and comparison with negative binomial distribution. Biom J. 2005;47(2):219–29.
Gurmu S, Trivedi PK. Excess zeros in count models for recreational trips. J Business Econ Stat. 1996;14(4):469–77.
Kanmiki EW, et al. Socioeconomic and demographic determinants of underfive mortality in rural northern Ghana. BMC Int Health Human Rights. 2014;14(1):24.
Acquah HDG. Bayesian logistic regression modeling via Markov chain Monte Carlo algorithm. J Soc Dev Sci. 2013;4(4):193–7.
GordóvilMerino A, et al. Classical and Bayesian estimation in the logistic regression model applied to the diagnosis of child Attention Deficit Hyperactivity Disorder. Psychol Rep. 2010;106(2):519–33.
Gebresilassiea YH, Nyatanga P. Explaining interregional differentials in child mortality in rural Ethiopia: a count data decomposition analysis.
Mondal MNI, Hossain MK, Ali K. Factors influencing infant and child mortality: a case study of Rajshahi District, Bangladesh. J Human Ecol. 2009;26(1):31–9.
Dabral S, Malik SL. Demographic study of Gujjars of Delhi: VI. Factors affecting fertility, infant mortality and use of BCM. J Human Ecol. 2005;17(2):85–92.
Getiye T. Identification of risk factors and regional differentials in underfive mortality in Ethiopia using multilevel count model. 2011, Citeseer.
Gebretsadik S, Gabreyohannes E. Determinants of underfive mortality in high mortality regions of Ethiopia: an analysis of the 2011 Ethiopia Demographic and Health Survey data. Int J Population Res. 2011;2016:2016.
Bereka SG, Habtewold FG. Underfive mortality of children and its determinants in Ethiopian Somali regional state, Eastern Ethiopia. Health Sci J. 2017;11(3):1.
Yaya S, et al. Prevalence and determinants of childhood mortality in Nigeria. BMC Public Health. 2017;17(1):485.
Alam M, et al. Statistical modeling of the number of deaths of children in Bangladesh. 2014;1.
Berhie KA. Statistical analysis on the determinants of under five mortality in Ethiopia. Am J Theor App Stat. 2017;6(1):10–21.
Aheto JMK. Predictive model and determinants of underfive child mortality: evidence from the 2014 Ghana demographic and health survey. BMC Public Health. 2019;19(1):64.
Kaldewei C. Determinants of infant and underfive mortality—the case of Jordan. Technical note, February, 2010.
Adhikari R. Influence of women’s autonomy on infant mortality in Nepal. Reprod Health. 2011;8(1):7.
Kareem Y, Yusuf A. Statistical modeling of fertility experience among women of reproductive age in Nigeria. 2018;8(1):23–33.
Acknowledgements
The authors would like to acknowledge the Measure Demographic and Health Survey for providing online permission to use the Ethiopia Demographic and Health Survey 2016 data set.
Funding
No funding was received for this study.
Author information
Authors and Affiliations
Contributions
MSW was responsible for the formulation of the methodology. MSW and AGA made substantial contributions to the conception and design of the study, analyzed and interpreted the data, and was a major contributor in writing the manuscript. MSW and AGA drafted the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Fig. 3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Workie, M.S., Azene, A.G. Bayesian zeroinflated regression model with application to underfive child mortality. J Big Data 8, 4 (2021). https://doi.org/10.1186/s40537020003894
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537020003894
Keywords
 Underfive death
 Bayesian approach
 Zeroinflated regression
 MCMC
 Ethiopia