 Research
 Open access
 Published:
Unlocking the potential of Naive Bayes for spatio temporal classification: a novel approach to feature expansion
Journal of Big Data volume 11, Article number: 106 (2024)
Abstract
Prediction processes in areas ranging from climate and disease spread to disasters and air pollution rely heavily on spatial–temporal data. Understanding and forecasting the distribution patterns of disease cases and climate change phenomena has become a focal point of researchers around the world. Machine learning models for prediction can generally be classified into 2: based on previous patterns such as LSTM and based on causal factors such as Naive Bayes and other classifiers. The main drawback of models such as Naive Bayes is that it does not have the ability to predict future trends because it only make predictionsin the present time. In this study, we propose a novel approach that makes the Naive Bayes classifier capable of predicting future classification. The process of expanding the dimension of the feature matrix based on historical data from several previous time periods is performed to obtain a longterm classification prediction model using Naive Bayes. The case studies used are the prediction of the distribution of the annual number of dengue fever cases in Bandung City and the distribution of monthly rainfall in Java Island, Indonesia. Through rigorous testing, we demonstrate the effectiveness of this TimeBased Feature Expansion approach in Naive Bayes in accurately predicting the distribution of annual dengue fever cases in 30 subdistricts in Bandung City and monthly rainfall in Java Island, Indonesia with with both accuracy and F1score reaching more than 97%.
Graphical Abstract
Introduction
Prediction of future events has become an interesting topic by many researchers [1,2,3] because with the prediction results obtained, many parties will obtain information about future events, which is very important in preparing appropriate strategies or mitigation. Most of the prediction methods are found in statistical sciences, and usually predictions are made based on time and spatial. The predictive models are used in various fields such as health, business, climate, transportation. Predictive modeling is a statistical technique that is commonly used to predict future behavior. Predictive model development is one of the statistical techniques used to predict future behavior. The resulting solution is a data mining technique that analyzes historical and current data.
Prediction can also be implemented with a machine learning approach. In addition to prediction, machine learning can also be used to solve problems in regression, classification and clustering [4]. This method can be computationally intensive, as it involves large and complex data, so it can play an important role in solving spatial problems in various application areas, from multivariate prediction to image classification to spatial pattern detection [5,6,7]. But the predictions model obtained based on machine learning is limited in predicting cases at the present time, cannot be used for future predictions. This makes prediction models based on machine learning inappropriate for predicting future events. It is very interesting to develop a prediction model based on machine learning that can be used to predict future events.
Time feature based learning that represents and analyzes the property of time elements, including mapping time series properties, such as trends, seasonal, and stationary [3, 5, 8,9,10,11] has implemented the engine learning method In various spatial time data, namely the fields of geology, epidemiology, health care, climate science, enviromental science, precision agriculture, neuroscience, social media, etc.
Several studies conducted by [2, 3, 12, 13] used machine learning methods that were applied to time spatial data, with the aim of making predictions for the future. But in all of these studies there is no entanglement and continuity between the machine learning model and the prediction process. Prediction system performance is measured based on classifier model accuracy, while the prediction process used linear regression with the independent variables is classification results, time, without involving its features. In these studies, the classification methods used are Artificial Neural Network, Naïve Bayes, K Nearst Network, Logistic Regression, Super Vector Machine, Decssion Tree, and Random Forest [2, 3]. Studied classification predictions on spatialtime data, using simple linear regression, with a predictive time period of 100 years and 6 days, respectively. In both studies, before predicting using regression, a classification study was carried out using machine learning and Random Forest. Classification model obtained by machine learning, using all the features in the data train, while in the predicting process of target data, the regression method is used with only involves the time feature. Predicted results with regression for the future do not contain features used in previous machine learning models. Of all the classification methods used, Naïve Bayes is the simplest classification method. Naïve Bayes is one of the most popular classification algorithms, simple but very practical. Its efficiency comes from the assumption of feature independence, although this may be met in some studies using big data [14, 15]. The Naïve Bayes classification method has the advantage of adjusting parameter freedom and is more robust. But Naïve Bayes is also still quite reliable for use in small data sets [16].
For prediction purposes, the Naïve Bayes classification method is adaptive for feature weighting and makes feature selection easier, simpler and more efficient [16,17,18]. Empirical results show that the Naive Bayes selection method shows high classification accuracy [15]. When compared to other feature selection approaches, Naïve Bayes obtains more competitive results regarding accuracy, sparity, and time for balanced data sets. But for the case of classimbalanced data sets, Naive Bayes still works well, with different levels of classification for different classes still achievable [19]. Meanwhile [20] uses the Naïve Bayes Classifier with Maximal Time Series Motif to classify ECG abnormalities, with features such as record numbers and discrete sequences. The Time Series Motif Detection is proposed as Feature extraction and when combined with NB classifier it is superior with 98% accuracy, than feature selection with classifier. While in Electroencephalogram (EEG) classification research [12] and urban waterlogging classification [21], the use of feature extraction and the weighted Niave Bayes method on spatial time data has been shown to provide high accuracy. Another technique to increase the accuracy of Naive Bayes is to reduce the problem of interdependence between its features [22].
Based on previous studies, it can be concluded that the use of classifier methods such as Naive Bayes in future predictions has never been done fully based on the training model. In this study, the process of developing the feature expansion method is carried out and combined with the Naive Bayes classifier so that the obtained classifier training model can be used to predict future class events.
The main contributions of this research are as follows: (1) Produce a feature expansion method based on spatial time data that can be used to build a classification prediction model for some time in the future based on the Naive Bayes classifier, (2) Produce a prediction model for the number of annual dengue fever cases based on the Naive Bayes classifier, (3) Produce a monthly rainfall level prediction model based on the Naive Bayes classifier. The baseline used in this research is prediction model research using regression on the same data or similar data.The data used in this research are data on the number of dengue fever cases and rainfall, where both data are spatialtime data and have been used as data sets in previous studies Network [23,24,25]. Prediction of dengue case classes using Naive Bayes in [23] provided an accuracy of 74% while the use of a Votingbased Hybrid method for Naïve Bayes, KNearest Neighbor, and Artificial Neural Networks resulted in an accuracy of 90%. Meanwhile, for the rainfall dataset, there are 2 researches that have used the data. Prediction of rainfall levels using Logistic Regression [24] resulted in an accuracy of 72%, while the use of Naive Bayes and hybrid Naive BayesC4.5 methods [25] resulted in an accuracy of 52.98% and 64.95% respectively.
Related work
Feature expansion
In recent years, the idea of feature engineering has confirmed the outstanding performance of machine learning techniques, which can automate several applications. Feature engineering techniques such as feature extraction, feature selection and feature expansion are often applied to machine learning classification [26]. Feature extraction is the process of selecting the best subset of features from the overall feature set [27]. This approach is the creation of a new feature which is also known as feature construction. Whereas feature expansion is combining additional features from the input data, which combines the different relationships between the original features of the two objects. The goal is to extend the original vector or form a new feature, which is related to the distance from each data sample to the number of centroids found by the clustering algorithm. The use of this approach is usually applied to pattern recognition problems [28, 29], and in certain contexts, such as sentence retrieval, intrusion detection and sentiment analysis. Research [28] discusses classification with a combination of feature extraction and feature expansion, to perform a transformation from the original feature, which can increase the classification similarity score independently.
The feature expansion technique for the classification of onedimensional time series data proposed by [30], extended features can include temporal, frequency, and statistical characteristics. The study stated that the value of classification accuracy obtained was higher than the results of conventional machine learning. Feature expansion techniques allow the classifier to consider multiple dimensions that are not feasible in lowdimensional data. Feature expansion works by taking features in the original data and doing something with or on those features, then adding additional dimensions, to see if there is an increase in the accuracy of the resulting hyperplane [31]. In feature expansion it is possible to use linear classifiers on some data by creating new features in new dimensions.
Regarding the possibility of irrelevant features appearing in the data set, feature engineering also requires substantial manual effort in designing and selecting features [32]. According to [33,34,35], feature selection provides an effective way, by removing irrelevant and redundant data. Feature selection is the process of selecting certain features that are considered the most influential in the classification process.
Impact of feature expansion
In some cases, [36] states that merging all features into one feature space does not guarantee optimal performance, because of dimensional problems. To overcome these problems, a variation of Bayesian approach to the multinomial probit model is used, for base expansion and kernel combinations. This model has a solid foundation in a hierarchical Bayesian framework and is capable of instructively combining available sources of information, for multinomial classification. Meanwhile [37] discusses adding features to the classification process with machine learning, to increase the accuracy or precision value of the original data classification results. The scenario used is to compare the accuracy of the addition of 2dimensional and 4dimensional features in the classification process. There are 5 classification methods used, namely CART, RF, Gradient Boosting, SVM, Logistics Regression, where the scenario used is with and without feature selection. The result obtained is a classification with the addition of feature dimensions can increase the value of F1score. The other result is the classification using feature selection research apparently can not increase the value of f1score, although in others research [33,34,35], feature selection can increase the value of accuracy, validity of extracted information and reduce computational costs (processing time). Adding features automatically causes additional dimensions, so it is necessary to pay attention to the balance between the problem of dimensions and the addition of new features. The simplest way to add a feature is to add the degree and logarithm function of the original feature. This process has several stages that can increase the degree of new features or the number of multipliers in new features [32]. Meanwhile, to reduce the dimensions, the main component analysis is implemented after the feature addition procedure.
The consequence of applying feature expansion to the classification process is the addition of feature dimensions. Highdimensional data analysis is a challenge in the fields of machine learning and data mining. Complex multidimensional data usually has four types of features [34] namely highweighted features, moderately weighted features, lessweighted features, zeroweighted features. With regard to these types of features, feature engineering requires substantial manual effort in designing and selecting features [32].
Naive Bayes classifier
Naive Bayes learning refers to the Bayesian probabilistic model that determines the posterior class probability, namely \(P\left({y}_{j}{x}_{i}\right)\). The simple Naïve Bayes classifier uses these probabilities to assign a class sample [37,38,39]. The Bayes theorem obtained is.
where \({x}_{i}\):feature at \(i\); \({y}_{j}\):class at \(j\); \(P\left({y}_{j}{x}_{i}\right)\):probability of even \({y}_{j}\) given \({x}_{i}\) has occurred; \(P \left({x}_{i}{y}_{j}\right)\):probability of even \({x}_{i}\) given \({y}_{j}\) has occurred; \(P\left({y}_{j}\right)\):probability of event \({y}_{j}\); \(P \left({x}_{i}\right)\):probability of event \({x}_{i}\).
It is known that the feature set are\({x}_{i}={x}_{1}, {x}_{2}, {x}_{3}, \dots ., {x}_{n}\), and class \({y}_{j}={y}_{1}, {y}_{2}, {y}_{3}, \dots ., {y}_{m}\), then the relationship between class \({y}_{j}\) and attribute \({x}_{i}\) can be described [40], as follows:Based on the naive Bayes network structure in Fig. 1, equation (1) can be developed into
The denominator in Eq. (2) does not depend on the target class, but acts as a scaling factor ensuring that the posterior probabilities \(P\left({y}_{j}{x}_{i}\right)\) are properly scaled. So that the maximum posterior rule can be used, namely assigning each instance to exactly one class, by simply calculating the value of the quantifier for each class, then selecting the class with the maximum value [38]. The resulting selected class is referred to as the Maximum A Posteriori (MAP) class with the following formula.
Maximum A Posteriori (MAP) Estimation can also be used as an estimate of \(P(y)\) and \(P\left({x}_{i}y\right)\) [37, 39].
The proposed methods
Basically, the machine learning methods that have been used in previous research are only limited to predicting classification at this time. Likewise, the feature expansion method used in previous research was only limited to feature expansion for text data. Research on developing the Naive Bayes classification model is currently only used to predict classification at this time. Several previous studies carried out a prediction process for some time in the future using a linear regression model based on timeindependent variables, by first carrying out classification using machine learning methods. In research that has been carried out previously regarding the development of classification prediction models, to date there has been no research that has developed a prediction model for classification of spatialtime data for the future directly using the timebased Naive Bayes method algorithm.
Therefore, in this research, a Naive Bayes classification prediction model was developed for the future with the scenario of expanding timebased features on spatialtime data, taking into account the stationarity of feature data over time. The proposed procedure is the development of a classification prediction model for time \(t+k\), with the identification of a classification prediction model for time \(tk\), namely predicting the target class \({Y}_{t}\) using a combination of previous time \(tk\) features. The most optimal model combination of the previous \(tk\) classification prediction models, namely the model with the best accuracy, is selected as a candidate classification prediction model for the future \(t+k\) time. The data matrix development framework with extended timebased features, the Naive Bayes classification model with expanded timebased features and its architecture will be explained in the algorithm and Fig. 2.
Experiment
This section describes the development of a feature expansion model based on the previous \(tk\) time, the development of the Naive Bayes TimeBased Feature Expansion model, and experiments using Naive BayesTimeBased Feature Expansion to build a classification prediction model for some future time. This study used 2 data sets, namely the Dengue Hemorrhagic Fever and Rainfall which are secondary data from the Bandung City Health Office and the Meteorology, Climatology and Geophysics Agency (BMKG). These two data are used as an example of implementing the development of a timebased classification prediction model to improve the performance of Naïve Bayes classification, which so far has not been able to predict classification in the future. The use of these two data is based on the same characteristics to be tested using the proposed model. This data is research data that has been used in previous studies which meets the characteristics of time series and spatial factors, so that it can be used to build timebased classification prediction models.
Timebased feature expansion process
The feature expansion technique developed in this research makes the feature dimension (N) to be used in the Naïve Bayes model increase k times (which depends on the time characteristics of the data). This is of course very different from traditional feature selection techniques which at most only produce Ndimensional features. Feature expansion in this research is carried out to obtain a feature matrix as input to Naïve Bayes classification as a t + k classification prediction model. The feature selection technique in this study is still carried out but the selection process is on the expanded feature matrix, which is used when determining the combination of features that produce the best classification performance for the t + k prediction model.
In performing feature expansion, one must consider that the resulting output must appropriate with the input of matrix format in the classification process using machine learning methods. There are two stages carried out in this feature expansion process. The first stage is carried out to determine the shape of the feature matrix for the standard classifier model. The second stage is carried out to develop new features based on the feature matrix of the first stage. The development of the feature matrix is conducted by expanding the feature column partition based on time k from the previous target class.
Table 1 shows that to classify data in the same time, it takes n features based on the original dataset at the same time. There are k model classifiers for different times that can be obtained from this dataset. In the first scenario, all data from all time are merged and one classifier model for all time is obtained based on this dataset.
Table 2 shows that the feature matrix will have a maximum size if it is used to predict the one next period. For example, if the dataset is monthly data, then the size of the matrix will reach the maximum if it is used to predict events in the \(t+k\) month later.
Naïve BayesTimeBased Feature Expansion
It is known that the data set consists of \({x}_{i}\) features, namely \({x}_{i}={x}_{1}, {x}_{2}, {x}_{3}, \dots ., {x}_{n}\) and class \({y}_{j}={y}_{1}, {y}_{2}, {y}_{3}, \dots ., {y}_{m}\). The prediction model on machine learning will generate a numerical score for each feature, so that it can quantify the degree of class membership above in \({y}_{j}\) class. If the dataset only consists of positive and negative classes, then the predictive model can be used as a classifier.
The model in Eq. (2) can only be used as a classification model for a moment, and cannot be used for a classification model for the future. So the contribution of this research is to develop a classification model that can be used for classification in the future, if the features are known several years before. The classification prediction model is developed by extending the features based on time.
If the features and records of the data set are expanded over time, there will be an expansion of the dimensions. For example, attribute based on time \({x}_{i(tk)}\), where \(t=1, 2, 3, 4, \dots .k\), \(i=1, 2, 3, 4, \dots n\),and \(j=1, 2, 3, 4, \dots .m\) then \(\begin{aligned}{x}_{i(t1)}&={x}_{1(t1)}, {x}_{2(t1)}, {x}_{3(t1)},\dots , {x}_{n\left(t1\right)}:{x}_{i\left(t2\right)}={x}_{1\left(t2\right)}, {x}_{2\left(t2\right)},{x}_{3\left(t2\right)}, \dots , {x}_{\left(t2\right)n}:\dots : {x}_{i\left(tk\right)} \\ &={x}_{1\left(tk\right)}, {x}_{2\left(tk\right)},{x}_{3\left(tk\right)}, \dots , {x}_{n(tk)}\end{aligned}\) and class of data set is \({y}_{jt}={y}_{1t}, {y}_{2t}, {y}_{3t}, \dots ., {y}_{mt}\). It is assumed that \({x}_{i(tk)}\) and \({y}_{jt}\) are stasionary, then analogous to the relationship between class \({y}_{jt}\) and attribute \({x}_{i(tk1)}\) can be described as in Fig. 3.
Analogous to the translation of Eq. (1) into Eq. (2), the prediction model of Naïve Bayes classification based on time is a model development obtained by expanding the features and records based on \(tk\) time before the target time, and is explained in Eq. (4).
By using the joint probability of the numerator of Eq. (4) and describing Eq. (4) as follows
In addition to fulfilling the stationary assumption, in Eq. (5) it must also be assumed that each feature is independent of each other. So the equation can be written in the form
Analogous to Eq. (3), where the denominator in (4) does not depend on the target class, but acts as a scaling factor, so that the maximum posterior rule which produces the Maximum A Posteriori (MAP) class can be expanded as follows
If the features used are continuous, then the development of a classification prediction model based on time is also carried out on the normal distribution model or Gaussian distribution [41, 42] namely
where: μ: the mean of all attributes σ: standard deviation 1.
Data set
The DHF data set is a collection of data on the number of DHF cases from 30 subdistricts in Bandung City, West Java, Indonesia from 2012 to 2018. Meanwhile, the features used as factors that accelerate the spread of dengue cases are rainfall (mm), humidity (C), temperature (C), altitude (mdpl), number of people who are male, total population, number of people who do not have an education certificate, number of people with elementary school education, number of people with high school education, number of people with education high school, the number of people with undergraduate education. Table 3 contains an explanation of class index rate (IR) labeling for the number of DHF cases. IR is categorized as low if the number of cases is less than 55 per 100,000 population, moderate if the number of cases is in the range of 55 to 100 per 100,000 population, and high if the number of cases is more than 100 per 100,000 population.
The rainfall dataset in this study was obtained from the Meteorology, Climatology and Geophysics Agency at 28 location points on the island of Java, the time period from June 2021 to March 2022. The features used are the percentage of humidity (%), the length of time the sun shines ( hours), wind direction (°), maximum speed, average wind speed (m/s), maximum temperature, minimum temperature, and average temperature. The data set in this study is labeled into 3 classes of rainfall (RR) that fall to the surface. Table 4 describes the distribution of data categories for class labeling.
After the labeling process is carried out, the data is transformed into timeseries data, and a classification training data model will be developed to predict the class distribution of DHF and rainfall. Model development is carried out in two scenarios. The first scenario is to develop a model with feature column expansion based on the previous amount of time spent predicting the target. The second scenario is to develop a model by expanding the row of records and column of features based on the previous amount of time. The purpose of this scenario is to form a data set with new features based on time and a combination of time and records. This scenario has an effect on the formation of input data, which is made by providing a time range as input and as a boundary between training data and test data. Expansion of timebased features that are used as inputs in annual or monthly classification prediction models, based on the time range and characteristics of the data set.
The next process is data separation, which is conducted by dividing the form of time series data into several parts. It aims to form several data models before being implemented into the Naive Bayes classification method. Separation of data is done by dividing all data into several models according to the time range and characteristics of the data set. Meanwhile, feature selection in this study is used to solve multicollinearity problems, or conditions where there are several correlated variables and remove some irrelevant features. Feature selection can reduce feature dimensions, there by saving resources for storing, processing data, and increasing the interpretability of the selected features.
The feature selection method is used to select a combination of features that affect the target column to be predicted. By selecting the relevant features, it will reduce the time complexity and can provide good accuracy for the system [39, 40].
Meanwhile, the implementation of the Naive Bayes classification method is applied to time series data with new features formed from the feature expansion process. In the data set with the new feature, each record contains the location id and the target Cases class of DHF in the DHF data set. Meanwhile, in the rainfall data, the target class is RR.
The two datasets used in this research are characterized as imbalanced data. In the DHF dataset, the "High" class has a proportion of 75%, much larger than the other two classes. In the Rainfall dataset, the "light_rain" class also has a proportion of 88%, much larger than the other 2 classes. To handle unbalanced data, an oversampling technique (SMOTE) is applied at the preprocessing stage. Figures 4 and Fig. 5 describes the class proportions of the DHF and Rainfall data sets.
Performance of Naïve BayesTimeBased Feature Expansion
The prediction case performed in this research is classificationbased prediction. The focus of the classification is not related to the prediction result of a particular class, so the accuracy metric is actually quite appropriate in this case, which is also often used in other data mining research. However, because the data used is imbalanced, to complement the model performance measurement results, the F1Score measurement is also added in this research. Evaluation of the performance of the feature expansion scenario in developing a classification prediction model based on \(tk\) time was previously based on the values of classification accuracy and F1Score. This accuracy is a measure that describes the system's performance in producing correct predictions. Calculation of classification accuracy in this study uses a multiclass confusion matrix, because the number of target classes is more than two. The multiclass confusion matrix is described in Table 5, which has dimensions \(NxN\), where \(N\) is the number of different class labels \({C}_{1}, {C}_{2}, \dots , {C}_{N}\). Equation (9) and (10) are the formulas for calculating classification accuracy and F1Score based on a multiclass confusion matrix [41].
Meanwhile, to test the effect of the number of feature combinations and the classification prediction model of the Naive Bayes—Timebased Feature Expansion method. The experiment is carried out to see the performance of the classification prediction model for each scenario and the number of the best combinations of features. Comparisons were made using analysis of variance of two groups [42] namely based on the model and number of features. The hypotheses is defined as follows, null hypotheses (H_{0}) is all scenarios give the same response and alternative hypotheses (H_{1}) is at least there are pairs of scenarios that give different responses. Table 6 is a variance analysis table to test the significance of the influence of the number of features in the classification prediction model based on time.
The efficiency of developing a prediction model for the previous \(tk\) classification in predicting the future \(t+k\) classification, is based on a comparison between the prediction accuracy using feature expansion based on time and the linear regression prediction model described in Eq. (11) [43, 44].
Vizualization of classification prediction
Results of developing a timebased classification prediction model using Naive Bayes TimeBased Feature Expansion implemented on spatialtime data. The time factor in the data has been implemented in the feature expansion process to develop a classification prediction model based on the previous \(tk\) times. Spatial factors are used to visualize the prediction results of the future \(t+k\) classification and are implemented on the spatial location map.
Spatialtime data is measurement data that contains location and time information [45, 46]. This spatialtime data becomes input in the estimation process. For example \({S}_{i}\), where \(i=1, 2, 3, \dots n\), is a location with coordinates \(({x}_{i}, {y}_{i})\). Then \({Y}_{t}({S}_{i})\) is the prediction data for the classification class \({Y}_{t}\) at location or coordinates \({S}_{i}\). Spatial data is a dependent data model, because spatial data is collected from different spatial locations which indicates a dependency between measurement data and location. Semivariogram in spatialtime analysis is a tool for measuring variability in distance, direction and time. The semivariogram model is an empirical model obtained from data, and this model is used to estimate the target class of a location. This estimation procedure is called kriging interpolation.
Several theoretical semivariogram models used to fit experimental semivariogram models are Nugget Effect, Spherical, Exponential, Gaussian and Linear. The kriging interpolation method used in this research is ordinary kriging. This method is used to visualize the classification prediction results at each location.
where \(\widehat{{Y}_{t}}\left({S}_{0}\right)\) is estimated classification class at \({S}_{0}\) point, \({\omega }_{b}^{OK}\) data weight value from the OK system and \({Y}_{t}({S}_{b})\) is sample location classification class.
Experiment result
This section describes the results of implementing the data set on the system developed in this study. The results of selection and implementation of the Naive Bayes classification method with new features resulting from feature expansion and feature selection are shown in Figs. 6 and 7. The Table 11 explains the significance of the effect of the number of features and the prediction model. classification with twoway Analysis of Variance. Meanwhile, Tables 12 and 13 shows the selected classification prediction model that is used to predict the classification for some time in the future for each data set. For spatialtime visualization of the classification prediction map shown in Figs. 8 and 9.
The performance of the classification prediction model for the previous \({\varvec{t}}{\varvec{k}}\) times
The development of the classification prediction model in this study uses the Naive Bayes classifier method, using two feature expansion scenarios. This feature expansion process will certainly increase the number of features where these additional features are not necessarily important. Furthermore, feature selection is used to maintain the balance of the addition of dimensions, which occurs during the development of the classification prediction model. The use of feature selection will optimize the number of influential feature combinations and increase the accuracy of the classification prediction model with Naive Bayes.
Development of a classification prediction model on the data set of the number of cases of DHF, expansion of features based on time in units of years. Meanwhile, in the rainfall data set, feature expansion is based on time in months. The expansion scenario is adjusted to the characteristics of each data set. Table 7 show models combination of the DHF data sets. The \(tk\) model in Table 7 shows the prediction model built based on the previous \(t\) years. For example, in the DHF3A model, in order to predict dengue cases in 2018, features from the previous 3 years, namely 2015, 2016 and 2017, are used. The goal is that we can have a prediction model for the next 3 years based on data from the last 3 years.
Table 8 show models combination of the Rainfall data set. The \(tk\) model in Table 8 shows the prediction model built based on the previous \(t\) months. For example, the RF2A model states the weather prediction model for the next 2 months (March 2022). The model is trained using features obtained in January and February of 2022.
Feature and record expansion scenarios are used in the training data when inputted to the Naive Bayes classifier. Simple linear regression was implemented on both data sets, used as a comparison of the prediction results with the feature expansion method in the Naive Bayes classification.
Figure 6 show the accuracy and F1Score for \(t3\), \(t4\) and \(t5\) prediction models using NBTBFE and regression for DHF data sets while Fig. 7 show the accuracy and F1Score for \(t2\) until \(t5\) prediction models using NBTBFE and regression for Rainfall data sets.
Table 9 shows the average performance of the previous tk time classification prediction model using Naïve Bayes Time Based Feature Expansion (NBTBFE) and Regression on both datasets. Meanwhile, Table 10 shows the performance values of the selected models by considering the influence of the number of features originating from feature expansion. Table 9 shows that in general the performance of the timebased classification prediction model using NBTBFE outperforms all regression models.
Evaluation of the result
This chapter discuss the effect of the number of feature combinations and the classification prediction model of the Naive Bayes Timebased Feature Expansion method. The experiment is carried out to see the performance of the classification prediction model for each scenario and model performance. Comparisons were made using Analysis of Variance (ANOVA) of two groups, namely based on the model and evaluation metrics. Table 11 shows that there is a significant influence on the classification prediction model scenario with Naive Bayes Timebased Feature Expansion and the model performance values (Accuracy and F1Score).
Tests using a 95% confidence interval show that the average of the two scenarios, both on dengue fever and rainfall data, has a significant influence. This is indicated by a P value of less than 0.05 and a large F statistical value, both based on the classification prediction model and model performance.
Features characteristics of the classification prediction model for the previous \({\varvec{t}}+{\varvec{k}}\) times
The feature selection process for each model is basically done by considering the accuracy or F1Score value obtained and also the number of features. The number of features that is too large, even though it has the highest accuracy or F1Score value tends to make the resulting model overfitting. Conversely, too few features will make the model underfitting, which is when a model is too simple to capture the patterns in the data. The appearance of a feature in several models also indicates that the feature is an important feature to choose.
Tables 12 and 13 shows the number of features for the \(tk\) classification prediction model for the DBD and Rainfall dataset, which was selected by considering the characteristics of underfitting, overfitting, and the frequency of occurrence of potential features.
Tables 14 and 15 show the prediction results from the selected models in Tables 12 and 13. The prediction results in these two tables will be visualized on a classification prediction map for the distribution of the number of dengue cases and changes in rainfall.
Visualization of classification prediction
The prediction results of the classification model obtained by the Naive Bayes TimeBased Feature Extension method which are implemented on DHF and rainfall data are presented in the form of a map. The aim is to visually describe the transition of classification changes over time at each location. Maps are a form of visualization of prediction classification results at the spatial location of each data set, which are obtained by kriging interpolation. Meanwhile, the map development process uses ArcGis software ((https://pro.arcgis.com/).
The results of the best model predictions in Table 14, are used as the basis for making prediction maps for their classification for the next 1 to 3 years in the DHF dataset, namely 2021, 2022 and 2023 and are presented in Fig. 8. Classification of a subdistrict is indicated by 3 colors, where the color is green. indicates class 0, yellow indicates class 1, and red indicates class 2. While Fig. 9 is a form of visualization of the prediction of rainfall classification in Java from May to August 2022, which is made based on Table 15.
Discussion and conclusions
The problems studied in this research are the development of time series data matrices using the feature expansion method for developing Naïve Bayes timebased (NBTBFE) classification prediction models and also to find out the significance of the influence of the number of features of classification prediction models. The resulting Naive Bayes classification prediction model (NMTBFE) was applied to spatiotemporal data. The selection of the DHF and Rainfall datasets for the experiment was based on their characteristics as spatiotemporal data and this data has also been implemented in previous studies [23,24,25] using Naive Bayes classifiers or others. The DHF and rainfall datasets used were unbalanced, so oversampling (SMOTE) was performed, and high performance was obtained in almost all models. Based on the imbalance characteristics of the two data sets, accuracy and F1Score are used to measure the performance of the NBTBFE model. As a base model, a regression model that uses the RSquare value as a performance measure is used. This RSquare value is then compared with the accuracy and F1Score values of the NBTBFE model.
Timebased feature matrix expansion is used to develop a feature expansion model for the previous time. Furthermore, the feature expansion model for time tk with the best performance will be used for NBTBFE for the future time t + k, at observed and unobserved locations. Currently, there is no previous research that develops Naive Bayes classification prediction models with timebased feature data input expansion like this.
In general, the performance of the NTBBFE model implemented on the data set gives better results than the regression model as a baseline model. The level of significance of the resulting model performance varies depending on the type and quality of the data used. The implementation NBTBFE on DHF and rainfall dataset show that the average accuracy and F1Score of the NBTBFE obtained is superior to the RSquare linear regression model for the previous \(tk\) time. The performance of the NBTBFE model on the same dataset in present time prediction is also outperforms compare to the traditional Naive Bayes classifier, other classifiers and hybrids classifier [23,24,25].
The accuracy and F1 of feature selection results using Naïve Bayes TimeBased Feature Expansion shows very significant changes in model combinations in each time period. On the DBD data set based on the annual classification prediction model, optimal performance was obtained in the timebased feature expansion scenario with an expansion of the previous four years. Meanwhile, the optimal performance of the monthly classification prediction model on the Rainfall data set can be achieved in the timebased feature expansion scenario with an expansion of the previous three years. These characteristics are used as the basis for feature selection in the t + k prediction model. The basis for selecting optimal features in the \(t+k\) prediction model is the conditions of underfitting, overfitting and the frequency of appearance of features that have the potential to improve the classification target [31].
Testing the influence of the model and the number of features in each model developed based on feature expansion in the previous \(tk\) shows that the optimal number of features greatly influences the performance of the previous \(tk\) classification prediction model. Timebased feature expansion methods can help determine the optimal number of factors and time expansion.
For the case study of the two datasets tested, the following results were obtained. The classification prediction the number of DHF cases in 2021 uses the DHF3C model, in 2022 uses the DHF4C model, while in 2023 uses the DHF5A model. Based on this, it can be seen that the combination of features from 2015, 2014, 2013 is a potential feature, because each of them was the best model, with an accuracy and F1Score of more than 97%. However, after being expanded again with the 2016 and 2017 features, the accuracy fell by 8.67% and the F1Score fell by 8.68%. Meanwhile, the prediction model for monthly classification of rainfall data sets, the models used are RF2C, RF3A, RF4C,and RF5C. Based on these results, it can be explained that the features of February 2022, January 2022, December 2021, are very potential, because they dominate all models.
The obtained \(t+k\) classification prediction model can be visualized well in the form of classes and distribution maps to determine the pattern of vulnerability status of the increase in the number of annual dengue cases and monthly climate changes.
Based on the experimental results and discussions, it was found that the Naïve Bayes method developed with the timebased feature expansion scenario and the concept of developing a prediction model on time series, can improve its ability to predict classification for the future. The prediction results obtained using the proposed model can also outperform the prediction results using regression analysis.
The limitation of the proposed feature expansion method is related to the reduced amount of data used for training the \(t+k\) prediction model. Although the results of the feature expansion method can potentially provide more accurate results and improve the ability to predict the future, it has the consequence that building a robust \(t+k\) future prediction model requires a larger data size (\(k\) times) than building a model to predict the present For future research, it is still open to the use of other classification methods and comparison with deep learning methods such as CNN, RNN or LSTM. The development of feature expansion formulas for other classifier models and model validation using various types of data can also be done as future work of this research.
Availability of data and materials
The original dataset used for this study is available in: 1. Naïve Bayes TimeBased Feature Expansion implemeted in python (https://drive.google.com/file/d/1YvkKsZ7yVMancoSBDxKivZsuljpazk/view?usp=sharing). 2. Rainfall Dataset (https://docs.google.com/spreadsheets/d/1aHLgvCQD5lSqLt_XIeAuAEMPslmZ23/edit?usp=drive_link&ouid=115398213045183747364&rtpof=true&sd=true). 3. DHF Dataset (https://docs.google.com/spreadsheets/d/1L3rPktXYhwo2reYvp38sN9vmafhJgug/edit?usp=drive_link&ouid=115398213045183747364&rtpof=true&sd=true). 4. ArcGis Software (https://pro.arcgis.com/). Data is provided within the manuscript or supplementary information files.
Abbreviations
 NB:

Naïve Bayes
 ANOVA:

Analysis of variance
 TB:

Timebased
 FE:

Feature expansion
 NBTBFE:

Naïve bayes timebased feature expansion
 DHF:

Dengue Hemorrhagic Fever
 TP:

True positive
 P:

Probability
 F:

Fisher
 BG:

Between groups
 WG:

Within groups
 T:

Total
 SS:

Sum square
 SSBG:

Sum square between groups
 SSWG:

Sum square within groups
 SST:

Sum square total
 DF:

Degree of freedom
 MS:

Mean square
 MSBG:

Mean square between groups
 MSWG:

Mean square within groups
References
RobnikŠikonja M. Explanation of prediction models with explain prediction. Inform. 2018;42(1):13–22.
Akhter M, Ahanger MA. Climate modelling using ANN. Int J Hydrol Sci Technol. 2019;9(3):251–65. https://doi.org/10.1504/IJHST.2019.102316.
Yesilkanat CM. Spatiotemporal estimation of the daily cases of COVID19 in worldwide using random forest machine learning algorithm. Chaos Solitons Fractals. 2020. https://doi.org/10.1016/j.chaos.2020.110210.
Nikparvar B, Thill JC. Machine learning of spatial data. ISPRS Int J GeoInformation. 2021;10(9):1–28. https://doi.org/10.3390/ijgi10090600.
Ahn S, Ryu DW, Lee S. A machine learningbased approach for spatial estimation using the spatial features of coordinate information. ISPRS Int J GeoInformation. 2020. https://doi.org/10.3390/ijgi9100587.
Pourghasemi HR, et al. Spatial modeling, risk mapping, change detection, and outbreak trend analysis of coronavirus (COVID19) in Iran (days between February 19 and June 14, 2020). Int J Infect Dis. 2020;98:90–108. https://doi.org/10.1016/j.ijid.2020.06.058.
Alkhamis MA, et al. Spatiotemporal dynamics of the COVID19 pandemic in the State of Kuwait. Int J Infect Dis. 2020;98:153–60. https://doi.org/10.1016/j.ijid.2020.06.078.
Atluri G, Karpatne A, Kumar V. Spatiotemporal data mining: A survey of problems and methods. ACM Comput Surv. 2018;51(4):1–37. https://doi.org/10.1145/3161602.
Kolesnikov AA, Kikin PM, Portnov AM. Diseases spread prediction in tropical areas by machine learning methods ensembling and spatial analysis techniques. Int Arch Photogramm Remote Sens Spatial Inf Sci. 2019;42:221–6.
Mohajane M, et al. Application of remote sensing and machine learning algorithms for forest fire mapping in a Mediterranean area. Ecol Indic. 2021;129:107869. https://doi.org/10.1016/j.ecolind.2021.107869.
Fouedjio F. Classification random forest with exact conditioning for spatial prediction of categorical variables. Artif Intell Geosci. 2021;2(October):82–95. https://doi.org/10.1016/j.aiig.2021.11.003.
MinminMiao F, et al. Discriminative spatialfrequencytemporal feature extraction and classification of motor imagery EEG: an sparse regression and Weighted Naïve Bayesian Classifierbased approach. J Neurosci Methods. 2017;278:13–24.
AbMunag JI, Prasadb VNK, Nickolasa S, Gangadharan GR. Representational primitives using trend based global features for time series classification. Expert Syst Appl. 2021. https://doi.org/10.1016/j.eswa.2020.114376.
Gao CZ, Cheng Q, He P, Susilo W, Li J. Privacypreserving Naive Bayes classifiers secure against the substitutionthencomparison attack”. Info Sci. 2018. https://doi.org/10.1016/j.ins.2018.02.058.
Chen S, Webb GI, Liu L, Ma X. A novel selective naïve Bayes algorithm. Knowl Based Syst. 2020;192:105361.
Karabatak M. A new classifier for breast cancer detection based on Naïve Bayesian. Measurement. 2015;72:32–6.
Tsangaratos P, Ilia I. Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: the influence of models complexity and training dataset size. CATENA. 2016;145:164–79.
Zhang L, Jiang L, Li C, Kong G. Two feature weighting approaches for naive Bayes text classifiers. Knowl Based Syst. 2016;100:137–44.
Blanquero R, Carrizosa E, RamírezCobo P, SilleroDenamiel MR. Variable selection for Naïve Bayes classification. Comput Oper Res. 2021;135: 105456. https://doi.org/10.1016/j.cor.2021.105456.
Padmavathi S, Ramanujam E. Naïve Bayes Classifier for ECG abnormalities using multivariate maximal time series motif. Procedia Comput Sci. 2015;47:222–8. https://doi.org/10.1016/j.procs.2015.03.201.
Tang X, Shu Y, Lian Y, Zhao Y, Fu Y. A spatial assessment of urban waterlogging risk based on a Weighted Naïve Bayes classifier. Sci Total Environ. 2018;15(630):264–74.
Viet TN, Le Minh H, Hieu LC, Anh TH. The naÏve bayes algorithm for learning data analytics. Indian J Comput Sci Eng. 2021;12(4):1038–43. https://doi.org/10.21817/indjcse/2021/v12i4/211204191.
Inayah FN, Prasetiyowati SS, Sibaroni Y. Classification of Dengue Hemorrhagic Fever (DHF) Spread in Bandung using Hybrid Naïve Bayes, KNearest Neighbor, and Artificial Neural Network Methods, Int J Inf Commun Technol 2021;7(1):10–20. https://doi.org/10.21108/ijoict.v7i1.562.
Gumilar A, Prasetiyowati SS, Sibaroni Y. Performance analysis of hybrid machine learning methods on imbalanced data (rainfall classification). Jurnal RESTI. 2022;6(3):481–90.
Sidik DD, Sen TW. Penggunaan stacking classifier Untuk Prediksi Curah Hujan. IT Soc. 2019;4(1):21–7. https://doi.org/10.33021/itfs.v4i1.1180.
Storcheus D, Rostamizadeh A, Kumar S. A survey of modern questions and challenges in feature extraction. 1st Int Feature Extr Mod Quest Challenges. 2015;44:1–18.
Guyon I. CrossRef List. Deleted. 2000, https://doi.org/10.1162/153244303322753616.
Yao K, Lu W, Zhang S, Xiao H, Li Y. Feature expansion and feature selection for general pattern recognition problems. ICNNSP. 2003. https://doi.org/10.1109/ICNNSP.2003.1279205.
Tsai CF, Lin WY, Hong ZF, Hsieh CY. Distancebased features in pattern classification. EURASIP J Adv Signal Process. 2011;1:2011. https://doi.org/10.1186/16876180201162.
Jung D, Lee J, Park H. Feature expansion of single dimensional time series data for machine learning classification. IEEE Xplore. 2021. https://doi.org/10.1109/ICUFN49451.2021.9528690.
Eden J. Expand Your Horizons 2021. .
Kaul A, Maheshwary S, Pudi V. Autolearn—automated feature generation and selection. Proc IEEE Int Conf Data Mining ICDM. 2017. https://doi.org/10.1109/ICDM.2017.31.
Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018. https://doi.org/10.1016/j.neucom.2017.11.077.
Kumar N, Maurya V, Maurya VK. A review on machine learning (Feature Selection, Classification and Clustering) approaches of big data mining in different area of research journal of critical reviews a review on machine learning (Feature Selection, Classification and Clustering) approach. Artic J Crit Rev. 2020. https://doi.org/10.31838/jcr.07.19.322.
Zhao S, Wang M, Ma S, Cui Q. A feature selection method via relevantredundant weight. Expert Syst Appl. 2022. https://doi.org/10.1016/j.eswa.2022.117923.
Damoulas T, Girolami MA. Combining feature spaces for classification. Pattern Recognit. 2009;42(11):2671–83. https://doi.org/10.1016/j.patcog.2009.04.002.
Petrusevich DA. Features addition and dimensionality reduction in classification. IOP Conf Ser Mater Sci Eng. 2020. https://doi.org/10.1088/1757899X/919/4/042018.
Berrar D. Bayes’ theorem and naive bayes classifier. Encycl Bioinforma Comput Biol ABC Bioinforma. 2018;1–3(September):403–12. https://doi.org/10.1016/B9780128096338.204731.
Chakrapani HB, Chouraisa S, Saha A, Swathi JN. Predicting performance analysis of system configurations to contrast feature selection methods. Int Conf Emerg Trends Inf Technol Eng ICETITE. 2020. https://doi.org/10.1109/icETITE47903.2020.106.
Le Minh T, Van Tran L, Dao SVT. A feature selection approach for fall detection using various machine learning classifiers. IEEE Access. 2021;9:115895–908. https://doi.org/10.1109/ACCESS.2021.3105581.
Markoulidakis I, Kopsiaftis G, Rallis I, Georgoulas I. Multiclass confusion matrix reduction method and its application on net promoter score classification problem. ACM Int Conf Proceeding Ser. 2021. https://doi.org/10.1145/3453892.3461323.
Sawye S. Analysis of variance : the fundamental concepts. 2017, https://doi.org/10.1179/jmt.2009.17.2.27E.
Hallman J. A comparative study on Linear Regression and Neural Networks for estimating order quantities of powder blends. 2019.
Xiao Y, Jin Z. The forecast research of linear regression forecast model in national economy. OALib. 2021;8:1–17. https://doi.org/10.4236/oalib.1107797.
Chowdhury AI, et al. Analyzing spatial and spacetime clustering of facilitybased deliveries in Bangladesh. Trop Med Health. 2019;9:1–12.
Cressie N, Moores MT, Moores MT. Spatial Statistis. 2021.
Acknowledgements
We would like to thank Telkom University for providing full support for this research, so that we can complete this research.
Funding
This research was not funded by a specific grant from any funding agency, whether public, commercial, or notforprofit sectors.
Author information
Authors and Affiliations
Contributions
The author declares full responsibility for the creation of this manuscript which includes learning the concept and design, collecting and preprocessing data sets, analyzing and interpreting results, and writing the manuscript. The author has read and approved the final manuscript. A developed concepts, formulas and wrote the main manuscript text B developed the algorithm, prepared images and finalized the text of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Prasetiyowati, S.S., Sibaroni, Y. Unlocking the potential of Naive Bayes for spatio temporal classification: a novel approach to feature expansion. J Big Data 11, 106 (2024). https://doi.org/10.1186/s4053702400958x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4053702400958x