Detecting heterogeneity parameters and hybrid models for precision farming

Precision farming (PF) plays a crucial role in the field of agriculture to solve the challenges of food shortages in society. Heterogeneity, multicollinearity, and outliers are problems in PF because they can cause bias and lead to incorrect inferences. However, traditional methods typically assume it to be a homogenous model, and in machine learning, data scientists ignore heterogeneity. In this study, the aim is to identify the heterogeneity parameters and develop hybrid models before and after heterogeneity. Data on seaweed is collected using sensor smart farming technology attached to v-Groove Hybrid Solar Drier (v-GHSD). There are 29 drying parameters, and each parameter has 1914 observations. We considered the highest order up to the second order interaction, and the parameters increased to 435 parameters from 29 parameters. In high-dimensional data, the number of observations is less than the number of parameters. The authors proposed a method using the variance inflation factor to identify the heterogeneity parameters. Seven predictive models such as ridge, random forest, support vector machine, bagging, boosting, LASSO and elastic net are used to select the 15, 25, 35 and 45 significant drying parameters for the moisture content removal of the seaweed, and hybrid models are developed using robust statistical methods. For before heterogeneity, the hybrid model random forest M Hampel with 19 outliers is the best, because it performs better when compared to other models. For after heterogeneity, the hybrid model boosting M Hampel with 19 outliers is the best, because it performs better when compared to other models. These results are vital to seaweed precision farming. The study of heterogeneity will not only help us to comprehend the dynamics of the large number of the drying parameters, but also gives a way to leverage the data for efficient predictive modelling.


Introduction
Farming involves the growing of crops and the rearing of livestock.It is a source of raw materials for industries.The traditional methods used by farmers are not precise, which leads to manual labour and the consumption of time [1].Precision farming (PF) plays a vital role in the field of agriculture to solve the challenges of food shortages in society.The PF method is a subset of smart farming technologies (SFTs) that deals with information systems, the internet of things (IoT), precision agriculture systems, artificial intelligence, cloud computing, farm management, wireless sensor networks, robotics, and automation of agriculture [2,3].The merit of the method is that it boosts farm profits and cuts down the cost of production [6].
Seaweeds are also called macroalgae.They are like plant organisms attached to rocks or rock layers.In addition, they grow in lakes, oceans, rivers, and water bodies [7,8].It a crucial source of fat, carbohydrates, vitamins, fibre, and ash, as well as proteins and beta-carotene [9].For example, seaweed is useful in many forms (for example, powder, fresh, salted, canned, dried or extracts) for eating by humans or as feeds, biofuels, medicines, and fertilisers [10].(See Fig. 1 for the stages involved in seaweed pre-harvest and post-harvest of seaweed).
One of the post-harvest problems with seaweed is the high moisture content.According to [11], seaweed is easily damaged when it is very fresh.Therefore, this demands that seaweed be dried after harvesting.The drying of seaweed is used to reduce the moisture content [15].The biomass weight of seaweed during transportation will be decreased, which makes it available for additional processing [12].Drying also reduces storage, transportation, and processes to prevent losses and increase value [14].The types of drying are freeze-drying (direct drying method), conventional drying and microwave-assisted drying (solar).See Fig. 2 for details.A solar drier is the most efficient drying method for seaweed and can dry the water content faster Turb r ine Two-way solar collector with ab a sorb r er Fig. 2 Types of existing drying methods for seaweed [16].These authors [13,[17][18][19] have employed solar driers in their studies.The drying parameters using v-Groove Hybrid Solar Drier (v-GHSD) were monitored effectively by [13,17].Furthermore, the internet of things (IoT) based solar drying system using the v-Groove Hybrid Solar Drier (v-GHSD) was more effective in monitoring the drying behaviour [13,15].All the parameters involved in solar drying should be studied to reduce the moisture content of seaweed, improve food quality and quantity.However, the methods Density-Based Spatial Clustering of Applications (DBSCAN), Clustering Large Applications (CLARA), Partitioning About Medoids (PAM) and multiple linear regression were used to find the optimal parameters to increase the production of crops [24].ML algorithms are used to model complicated problems that humans cannot understand because of their complexity.In addition, these algorithms are useful to detect diseases, predict soil parameters, predict crop yield, and detect species [1,6].
A study conducted by [26] on fish drying investigated the moisture content using ridge regression in conjunction with eight selection criteria.The most significant factors influencing the moisture content and the interaction terms were investigated.From the results, the important drying parameters can be predicted from the moisture content of fish.Research by [27] on the drying parameters that determined the moisture content removal of seaweed was investigated.From the results, bagging performed better than boosting in determining the drying parameters of the seaweed, but heterogeneity was not considered.
Big data analysis comes with many challenges, such as outliers, and multicollinearity.Many studies have been conducted on how to handle these problems.Another problem facing big data is heterogeneity and there is insufficient knowledge about heterogeneity, especially in the field of agriculture using seaweed big data.In addition, the data obtained in big data has varied sources and some are structured and unstructured [28].All these complexities make the data complicated to analyse.Heterogeneity refers to variation in the data.This variability needs to be investigated to avoid wrong results and inferences.
Heterogeneity is a problem in the field of agriculture.For example, [29] found that there is substantial heterogeneity driving the forces of the rice ecosystem.The results showed that the adoption of each management method has heterogeneity.According to [30], heterogeneity was based on the spatial characteristics and behaviour of the participants, which influenced decision making.In the study of hydrological response to heterogeneity using a variable infiltration capacity model by [31], accounting for heterogeneity in land use gives better responses to hydrology and evapotranspiration.A study on the effects of ignoring heterogeneity showed that ignoring heterogeneity results in overestimation of the technical efficiency and underestimation of the parameters of the models [32].The study on farmland heterogeneity revealed that under different ecosystem services (ES).The changes in heterogeneity are not the same, there is a need for improvement in the ES to understand the market, especially for pest regulation and crop production [33].According to [34], the effect of temperature on yield was a significant heterogeneity and it was an eye opener for adaptation between cooler and warmer counties.The study on bird diversity by [35] revealed that the community is affected by cropland heterogeneity and cropland size.
Additionally, there is little research on the parameters influencing the moisture content removal of seaweed.Even in the literature found, few researchers have worked on seaweed big data.Also, few studies considered the interaction terms in seaweed drying.There is no study that compared the outliers before and after heterogeneity.Finally, there is no study on heterogeneity using big data in agriculture, especially on the moisture content removal of seaweed.
A lot of studies have been done on outliers and multicollinearity, but not on heterogeneity.In fact, we do not find any literature in the agricultural field that addresses heterogeneity using drying parameters.Hence, this study focuses on how to detect the heterogeneity of drying parameters and develop hybrid models to determine the significant parameters of the moisture content removal of seaweed.Interaction effects up to the second order for the seaweed big data are incorporated into the model.In addition, hybrid models using seven supervised ML algorithms with robust estimation are utilised to determine the significant parameters that determine the moisture content removal of the seaweed and reduce the number of outliers.The accuracy of the ML algorithms is also investigated via evaluation metrics.Finally, the impact of the errors is also compared before and after heterogeneity.

Materials and methods
Seven supervised machine learning algorithms such as ridge, random forest, support vector machine, bagging, boosting, LASSO and elastic net will be used to determine the significant parameters for the moisture content removal of the seaweed before and after heterogeneity.In addition, robust methods are utilised for the development of the hybrid models.The flowchart in Fig. 3 states the procedure and methodology used in this research.

Data description
The data are collected from 8th April 2021 to 12th April 2021, between the hours of 8:00 am to 5:00 pm during the drying of seaweed by using v-Groove Hybrid Solar Drier (v-GHSD) at Semporna, South-Eastern Coast of Sabah, Malaysia.Some of the parameters are temperature, relative humidity ambient, relative humidity chamber, and solar radiation.Table 1 shows the 29 main parameters, and each parameter has 1914 observations in this study, which is equivalent to 536,870,912 equations.Each observation area is evaluated as a parameter and the region is considered to simplify the system.This is not feasible to deal with because of the time and complexity.The addition of the second order interaction to the main 29 seaweed drying parameters increased all the parameters to 435.Optimization by selecting the first 15, 25, 35 and 45 high-ranking important variables is performed.

Phase I
This involves the addition of all possible models up to second order and testing of assumptions.According to [15], the total number of models can be calculated by using Eq. 1.  where N represents number of possible models, k is the total number of explanatory variables and j = 1, 2, 3, . . ., k .The assumptions of linearity, errors, observations, inde- pendent variables, and heterogeneity are checked in the R programming language.Then ridge, random forest (RF), support vector machine (SVM), boosting, bagging, ridge, LASSO and elastic net are used to select the significant parameters that determine the moisture content removal.The 15, 25, 35 and 45 parameters are selected because features selection can only provide the rank of important variables and does not tell us the number of significant factors [36].Next, the validation metrics are computed using mean absolute percentage error (MAPE), mean squared error MSE and coefficient of determination (R 2 ).

Phase II
Next, the computation of VIF is done with vif from the car library in R using the original data.This gives the range of the values for the variances before we compute the R-squared and 90% confidence interval.If the model has a value that falls below the maximum R-squared, then it exhibits heterogeneity.The models that exhibit heterogeneity are excluded and the models that do not exhibit heterogeneity are included.Then, the ML algorithms in phase I are used to select the 15, 25, 35 and 45 significant parameters.

Phase III
Next, the hybrid models are developed for before and after heterogeneity using robust methods.Data with outliers can be analysed by using robust estimation [37,38].The robust methods that are used are M Bi-Square, M Hampel, M Huber, MM and S. Finally, the validation metrics are computed using the 3-sigma limits to identify the number of outliers.The sigma limits are used for quality improvement [41].

The v-Groove Hybrid Solar Drier (v-GHSD)
In this study, v-Groove Hybrid Solar Drier (v-GHSD) was used for drying the seaweed.Solar drier is a used in precision agriculture to dry foods by using solar energy to improve the quality of food and reduce wastage.The v-GHSD drier (Fig. 4) comprises a solar panel, a v-aluminium roof, a drying chamber solar collector, and sensors using the internet of things to retrieve data.All the parameters are to receive data from different locations of the drying drier.The sensors are positioned to measure the data for temperature, solar radiation, relative humidity, and moisture content.IoT cloud database was used to understand the performance and the interaction of drying parameters during identified drying period and then, the data are stored in cloud database for every second and later converted to thirty minute intervals for performing analysis and identifying heterogeneity parameters and reduce the multicollinearity and outliers, using the proposed model to determine the moisture content removal.

Heterogeneity identification
Heterogeneity refers to variability of observations.This variability leads to inconsistent estimates and distort conclusion [42].Suppose we have this multiple linear regression (MLR) where Y is the moisture content, estimates β′s are the regression coefficients, T ′s are the drying parameter, a j denote heterogeneity, that is, the parameters that exhibit heteroge- neity and ε is the random error.In Eq. 2, a common problem is the issue of multicollin- earity, and this happens when many variables that are correlated and significant not only with dependent variable, but also with each other.Our interest in this equation is a j .In Eq. ( 2), if we estimate the regression equation and omit a crucial variable, then the estimate of β will be biased and inconsistent.According to [43], the variance inflation factor in multiple regression is used to quantify the level of severity.It can be computed with Which means that R 2 = 1 − 1 VIF .If the R 2 satisfied certain conditions, then the parameter is said to exhibit heterogeneity.

Evaluation metric
The suitability and accuracy of the models were evaluated using the mean absolute percentage error (MAPE), mean squared error (MSE) and coefficient of determination ( R 2 ).The metrics are stated in Table 2, where y i is the actual value and y is the mean of the actual value and y i is the forecast value. (2)

Metrics Equations Description
It is widely used because it is easy to interpret and due to its scale-independency [44].
This is good for given weights to outliers that need to be identified [45].
This gives the proportion of variance in the dependent variable which can be predicted from the independent variables.R 2 lies between 0 and 1 [45,46].

Statistical power for percentage change and absolute change
Statistical power is the probability of a test to reject a false null hypothesis.Statistical power = P(reject H 1 |H 1 is false) where H 1 is the null hypothesis.For a t-test, the equa- tion becomes P(|t|> t α/2 ) = P(P t < α) where t α/2 represents t-value under the level of significance α and P t is the t-test p-value.
where B MAPE and A MAPE are the MAPE before and after heterogeneity.
To know the best indicator to use between percentage and absolute change, the statistical power must be compared [47].Statistical power was compared through simulation [48].According to [49], absolute change was used to study the weight change.Absolute change was used to investigate change in obesity by [50].Percentage change was used to study change in loss of fat by [51].A test statistic that compared the maximum likelihood of an absolute change to a percentage change was developed by [52].According to [53], the percentage change is not affected by the unit of measurement, but the paper did not explain how to choose between absolute and percentage change.
For the evaluation, if R = Statistical power of absolute change Statistical power of percentage change > 1 [47], then absolute change has a better statistical power than percentage change, then we choose absolute change, otherwise, we choose percentage change.

Results and discussion
In this research, the assumptions of linear regression are verified to understand the data.The heterogeneity parameters among the seaweed drying parameters are identified.To determine the significant factors that determine the moisture content removal of the seaweed, seven popular supervised machine learning algorithms such as ridge, random forest, support vector machine, bagging, boosting, LASSO, and elastic net are utilized.Furthermore, metric validations were conducted, and hybrid models were developed.
The variability of the 29 main parameters is shown in Fig. 5.Each box-plot represents each drying parameter for the seaweed and helps to understand the heterogeneity among the main parameters.The points outside the box-plot are the outliers.A boxplot uses the 5-number summary of Q1, Q2, Q3, minimum and maximum value to summarise the data.The assumptions of linearity between the dependent and independent variables are checked.No linear relationship exists between them.The assumption of no multicollinearity among the independent variables is not satisfied.The values of the variance inflation factor (VIF) are high, the highest value of the VIF was 75,337.29.It shows the high level of multicollinearity.The assumption that the observations are independent is also checked using the Durbin Watson Test.From the results we obtained, the p-value of 0 is less than the significance level α = 0.05, which shows that the residuals are autocorrelated.It means that the observations are not independent.In addition, the normality assumption is also checked with the Kolmogorov-Smirnov test.The the p-value = 2.2e−16 which is less than 0.05 means we have enough evidence to say that the Based on these results in Table 3, the parameters T7, T11, H5, T6, T8, H1, and PY exhibit heterogeneity.This is also evident in Fig. 5.After removing the seven parameters that exhibit heterogeneity and including the second order interaction, there are 253 parameters that determine the moisture content removal of the seaweed.The selection of important features was used by [54,55].The summary of the assessment results for the ML models is stated in Table 4.However, before the heterogeneity parameters are removed, all validation model measures reveal that random forest outperforms other models in predicting the significant parameters.In addition, evaluation measures with MAPE (2.125891), MSE (7.330011) and R-squared (0.9732063), indicate that significantly better results are obtained by random forest for the 45 highest important variables when compared to the 45 highest important variables for other models for significant parameters that determine the moisture content removal.After the heterogeneity parameters are removed, all validation model measures also reveal that random forest outperforms ridge, support vector machine, bagging, boosting, LASSO, and elastic net in predicting the significant parameters that determine the moisture content removal of the seaweed.
In addition, evaluation measures with MAPE (7.588079), MSE (44.39000) and R-squared (0.8377405) indicate that significantly better results are obtained by random forest for the 45 highest important variables when compared to the 45 highest important variables for ridge, support vector machine, bagging, boosting, LASSO, and elastic net significant parameters that determine the moisture content removal.Since the random forest algorithm performed better than the other methods based Fig. 7 Comparison between the standardized residuals for 45 highest ranking variables for random forest before and after heterogeneity Fig. 8 Comparison between the standardized residuals for 45 highest ranking variables for support vector machine before and after heterogeneity on the results of the metrics, the 15, 25, 35 and 45 highest important variables for random forest are the most important parameters that accurately forecast the moisture content removal of the seaweed.This also confirms the results of [27,54,59,60] where random forest absolutely performed better than the other methods.All the values for MAPE random forest are less than 10.It is sufficient to say that this is a high prediction accuracy for the predictive model.This is in line with [61] which claims that if MAPE value is less than 10, it is a high prediction accuracy.
By comparing the metric validation for after and before heterogeneity parameters are removed, generally for ridge, random forest, support vector machine, bagging, boosting, LASSO, and elastic net in Table 3, the MAPE and MSE after the heterogeneity parameters are removed are higher than the values of MAPE and MSE when the Fig. 9 Comparison between the standardized residuals for 45 highest ranking variables for bagging before and after heterogeneity Fig. 10 Comparison between the standardized residuals for 45 highest ranking variables for boosting before and after heterogeneity heterogeneity parameters have not been removed in the model.Also, the R-squared values after heterogeneity parameters are removed are lower than the R-squared before heterogeneity is removed.The results have shown that the removal of some variables can reduce the accuracy of the model.
The heterogeneity parameters that were removed did not increase the accuracy of the model.According to [62], if an MAPE validation is equal or less after the removal of a  parameter, it does not mean that the parameter has no effect on the response variable.It means that the variability level in the data was not enough to be explained by the model.The percentage change for ridge 15, bagging 15, LASSO 15 and elastic net is positive.This represents 14.3% of the total number of models and the few cases where MAPE before heterogeneity is higher than MAPE after heterogeneity.The percentage change of 24 models is negative, which means that the MAPE before heterogeneity is lower than the MAPE after heterogeneity.This represents 85.7% of the total number of models.Random forest 15, 25, 35 and 45 models have the highest negative percentage change compared to other models.
In summary, through the validation metrics, the ability of ridge, random forest, support vector machine, bagging, boosting, elastic net, and LASSO is evaluated to accomplish more substantial and significant conclusions.The results are shown in Table 4 for all models.It is observed that random forest shows higher accuracy than other models models.This proves the superiority of random forest before and after heterogeneity over the other models and it leads to higher accuracy with the lowest errors.According to [54] the number of parameters is crucial because it will reduce the training time and avoid the curse of dimensionality.The comparison of the statistical power is shown in Table 5.The ratio of the test statistic for absolute change to percentage change is less than 1.This shows that percentage change has better statistical power than absolute change to explain the results and draw valid conclusions.
Table 6 shows the results of the hybrid model and the original model before and after heterogeneity for 45 high-ranking variables.The 3-sigma limits are also provided to identify the number of outliers and make comparisons.For the ridge before heterogeneity, the best robust estimator is M Hampel with 16 outliers, while the original has 23 outliers.
For the random forest before heterogeneity, the best robust estimator is M Hampel with 19 outliers, while the original has 45 outliers.For the support vector machine before heterogeneity, the best robust estimator is M Hampel with 23 outliers and the original has 24 outliers.For the elastic net before heterogeneity, the best robust estimator is M Hampel and M Huber with 33 outliers, while the original has 29 outliers.With these results.For before heterogeneity, M Hampel robust estimation performs better than M Bi-Square, M Huber, MM and S.
For the ridge after heterogeneity, the best robust estimators are M Bi-Square and MM with 22 outliers, while the original has 29 outliers.For the random forest after heterogeneity, the best robust estimator is M Hampel with 29 outliers, while the original has 41 outliers.For the support vector machine after heterogeneity, the best robust estimator is M Bi-Square with 27 outliers, while the original has 24 outliers.For the bagging after heterogeneity, the best robust estimator is M Hampel with 21 outliers, while the original has 28 outliers.For the elastic net after heterogeneity, the best robust estimator is M Hampel and M Huber with 23 outliers, while the original has 33 outliers.With these results.For after heterogeneity, the ridge performs better with M Bi-Square and MM.Random forest, bagging and boosting perform better with M Hampel.Support vector machine and LASSO perform better with M Bi-Square.The elastic net performs better with M Hampel and M Huber.
Generally, the outliers using the 3-sigma limits for before and after heterogeneity indicate that for the original model, the number of outliers increases from before heterogeneity to after heterogeneity for ridge, LASSO, and elastic net.It is constant for support vector machine.It decreases for random forest, bagging and boosting.

Conclusions and future work
The heterogeneity parameters are identified, and hybrid models were developed to forecast the significant drying parameters that determine the moisture content removal of the seaweed after drying.Seven predictive models, such as ridge, random forest, support vector machine, bagging, boosting, LASSO, and elastic net are used for determining the significant parameters in conjunction with robust methods.These hybrid models are useful for determining the significant parameters that determine the moisture content removal of the seaweed.For before heterogeneity, the hybrid model random forest M Hampel with 19 outliers is the best, because it performs better when compared to other models.For after heterogeneity, the hybrid model boosting M Hampel with 19 outliers is the best, because it performs better when compared to other models.For future studies, the traditional statistical methods and machine learning models for predicting the moisture content removal of seaweed can be compared.The number of selected drying parameters can be increased or all the parameters with interaction can be used.Other robust estimators such as least trimmed squares (LTS), least absolute deviation (LAD) and least median of squares (LMS) estimators can be used to develop a hybrid model.

Fig. 3 Table 1
Fig. 3 Flowchart for the study

( 5 ) 100 ( 6 )
Percentage change P c = B MAPE − A MAPE B MAPE × Absolute change A c = |B MAPE − A MAPE | residuals do not come from a normal distribution.Figures 6, 7, 8, 9, 10, 11 and 12 show the standardised residual plots for the ridge, RF, SVM, bagging, boosting, LASSO and elastic net for before and after heterogeneity.

Fig. 11
Fig. 11 Comparison between the standardized residuals for 45 highest ranking variables for LASSO before and after heterogeneity

Table 4
Determination of optimal machine learning models before and after heterogeneity

Table 5
Comparison of statistical power

Table 6
Comparison between the number and percentage of outliers outside the 3-sigma limits for the original and hybrid models for 45 high-ranking variables