 Research
 Open Access
 Published:
Shortterm photovoltaic power production forecasting based on novel hybrid datadriven models
Journal of Big Data volume 10, Article number: 26 (2023)
Abstract
The uncertainty associated with photovoltaic (PV) systems is one of the core obstacles that hinder their seamless integration into power systems. The fluctuation, which is influenced by the weather conditions, poses significant challenges to local energy management systems. Hence, the accuracy of PV power forecasting is very important, particularly in regions with high PV penetrations. This study addresses this issue by presenting a framework of novel forecasting methodologies based on hybrid datadriven models. The proposed forecasting models hybridize Support Vector Regression (SVR) and Artificial Neural Network (ANN) with different Metaheuristic Optimization Algorithms, namely Social Spider Optimization, Particle Swarm Optimization, Cuckoo Search Optimization, and Neural Network Algorithm. These optimization algorithms are utilized to improve the predictive efficacy of SVR and ANN, where the optimal selection of their hyperparameters and architectures plays a significant role in yielding precise forecasting outcomes. In addition, the proposed methodology aims to reduce the burden of random or manual estimation of such paraments and improve the robustness of the models that are subject to under and overfitting without proper tuning. The results of this study exhibit the superiority of the proposed models. The proposed SVR models show improvements compared to the default SVR models, with Root Mean Square Error between 12.001 and 50.079%. Therefore, the outcomes of this research work can uphold and support the ongoing efforts in developing accurate datadriven models for PV forecasting.
Introduction
The tendency toward embracing emissionfree energy from different renewable energy technologies, such as solar photovoltaic (PV), has resulted in necessary changes in the distribution system operation. These operational obstacles are due to the intermittency nature of the power coming from the sun, which requires additional ancillary services to control the variability in the PV system generations [1]. However, these services are economically unfeasible, and adopting them may discourage installing PV systems in the distribution networks [2]. Therefore, an accurate prediction of the amount of energy from the PV system would facilitate mitigating the technical issues of these PV systems [3]. There are various forecasting objectives in electrical power systems, such as electrical load consumption [4,5,6], wind power [7,8,9], solar irradiance [10,11,12], and electricity market forecasts [3]. In this study, the solar PV power forecast is of focus.
PV power output is highly correlated with meteorological variables, such as solar irradiance, wind speed, humidity, and temperature. These variables depend mainly on the geographical location and the climate condition at the site in question. In terms of the PV power forecasting horizon, four main categories are considered: very shortterm forecasting (1 s–< 1 h), shortterm forecasting (1 h–24 h), mediumterm forecasting (1 week–1 month), and longterm forecasting (1 month–1 year). According to [13, 14], the PV output prediction horizon should be identified before choosing the forecasting technique because the forecasting accuracy decreases as the forecasting time increases. Furthermore, the choice of forecasting time depends on the desired application. For instance, very shortterm forecasting can be applied for power smoothing, realtime dispatch and control, and regulation services, while shortterm is primarily focused on loadfollowing and zonecontrol purposes [15]. For mediumterm forecasting, it is useful for persevering the power system planning and maintenance schedule, whereas longterm forecast assists in generation planning, energy bidding, and security operation [13].
Concerning the forecasting techniques, physical, statistical, and hybridbased prediction models can be employed for PV power production. Physical approaches are mathematical models that use weather forecast data attained from numerical weather prediction (NWP), while statistical methods utilize historical data to predict future behavior without prior knowledge about the system state [16]. The hybrid method combines two independent forecasting methods to overcome each other drawbacks and strengthen the advantages by adding some optimization algorithms [3]. For the statistical methods, they are divided into \((i)\) time series models, i.e., autoregressive, autoregressive moving average (ARMA), and autoregressive integrated moving average (ARIMA), and \((ii)\) machine learning methods, i.e., artificial neural network (ANN), support vector regression (SVR), and extreme learning machine.
A systematic literature review of PV power production forecast can be found in [17]. The authors in [18] compare statistical approaches, namely ARMA, ARIMA, and seasonal ARIMA, with six different ANN to forecast the output power of a PV plant. Eighttime delays in power production from a PV plant are used as the input variables to generate the forecasting results. The analysis shows that ANN performs better than time series models with less computation time. The paper in [19] uses ANN and NWP data to predict the power output of a PV system located in Puglia, Italy. They use temperature and solar irradiation as predictors of the forecasting algorithm. Results show that the proposed model provides good prediction results with a 10% error value.
The authors in [20] present a methodology for PV power forecasts using machine learning algorithms and statistical postprocessing. They use as forecasting methods the ANN and linear regression correction to enhance the accuracy of forecasting. Results show that the proposed model has good accuracy values as the Mean Absolute Percentage Error (\(MAPE\)) was 4.7% using the historical dataset. The study in [21] examines the performance of different machine learning algorithms to forecast the hourly production of a PV system, including knearest neighbor (kNN), multiple regression (MLR), and decision tree regression (DTR). They employ weather data as input variables to the forecasting models such as solar irradiance and temperature. Results exhibit that the kNN has superior performance compared to MLR and DTR with a Root Mean Square Error (\(RMSE\)) of 18.68%, Mean Absolute Error (\(MAE\)) of 80.6%, and a normalized \(RMSE\) error (\(nRMSE\)) of 13.2%. The recent study in [22] compares 24 machine learning algorithms for a dayahead power forecast using numerical weather predictions (NWP). The study concludes by stating that the selection of input variables and hyperparameter tuning is more important than the model selection. In their study, the model that considers the sun position angles and irradiance reading after statistical processing results in a 13.1% decrease in RMSE compared to the basic case (Global Horizontal Irradiance (GHI), temperature, and wind speeds).
Machine learning algorithms proved their effectiveness in different forecasting objectives as they can capture and deal with the nonlinearity in forecasting problems compared to other forecasting methods. The statistical approaches have the advantage of handling high data volume [23]. The SVR, on the other hand, has superior forecasting performance with small data samples [24]. In addition, ANN can conduct any nonlinear mapping using the learning process [25]. This makes SVR and ANN favorable to be employed. However, the main drawback of applying SVR and ANN algorithms is that they are sensitive to specific parameters. For instance, SVR depends highly on Kernel function hyperparameters, namely the error penalty parameter \(\left(C\right)\) and the width \((\gamma )\). Also, ANN performance is greatly influenced by the number of hidden layers and neurons at each hidden layer. Therefore, the hybrid models have been investigated by the literature recently to overcome the overmentioned disadvantages of SVR and ANN.
Metaheuristic optimization algorithms (MOA), such as Simulated Annealing (SA), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), and Grasshopper Optimization Algorithm (GOA), have been used to select the appropriate parameters. For example, authors in [26] use GA to tune the SVR parameters to forecast the price of electricity in Australia. For the PV power generation forecast, a hybrid model is created in [27] between GA and SVR (GASVR) to optimize different Kernel function parameters. Study results demonstrate that GASVR is more accurate than the conventional SVR, with improvements in \(RMSE\) value of 669.624 and 98.7648% in the \(MAPE\). In addition, the study by Netsanet et al. [28] proposes a hybrid PV power forecasting model using variation mode decomposition with ANN and Ant Colony Optimization (ACO). The role of ACO is to improve the performance of ANN by optimizing its weight and biases during the training phase. The proposed model shows highaccuracy outcomes with the coefficient of determination, \({R}^{2},\) of 0.9768.
Motivation and contributions of the study
From the above discussion, the hybrid forecasting methods have shown a good performance compared to other methods. In addition, different MOA methods have been applied in the literature to improve the SVR and ANN prediction performance. The primary objective of such optimization algorithms is to determine the optimal parameters of SVR and ANN. However, there is no clear consensus on which algorithm should be used to estimate these parameters. Therefore, in this study, hybrid PV forecasting models are proposed based on machine learning algorithms, which utilize SVR and ANN optimized with four MOA, namely Social Spider Optimization (SSO), PSO, Cuckoo Search Optimization (CSO), and Neural Network Algorithm (NNA). These algorithms are used to improve the predictive efficacy of the selected algorithms, where the optimal selection of their hyperparameters and architectures plays a significant role in yielding precise forecasting outcomes. Hence, the following are the primary contributions of this study to the field of PV power forecasting:

1.
The SVR and ANN are machine learning algorithms used in this study to exploit the underlying big data patterns and forecast future values of PV power outputs.

2.
As the prediction performance of SVR and ANN depends highly on their hyperparameters and architectures, respectively, an intelligent framework is proposed in this study to facilitate the burden of manual parameter setting and expedite the forecasting process.

3.
As the optimal selection of their hyperparameters and architectures plays a significant role in yielding precise forecasting outcomes, this paper uses four MOA, namely SSO, PSO, CSO, and NNA, to improve the predictive efficacy of the selected algorithms.

4.
This paper uses different independent combinations of variables as inputs to identify the suitable variables that give the best PV power forecasting outcomes. This will help overcome the computational burden and complexity that may exist in the input features. These variables are time, weather, and historical data of the PV power generation.

5.
Despite that this work aims to forecast the output power of a PV system located in Riyadh city, Saudi Arabia, the proposed framework is useful for determining the best forecasting models in various locations.
The rest of the paper is organized as follows: In “Methodology” section, the study framework is described together with SVR and ANN algorithms and the MOA methods, including SSO, PSO, CSO, and NNA. The main findings and the comparison outcomes among the prediction models and MOA approaches are in “Results and discussion” section. Finally, “Conclusion and future work” section contains the conclusion of this study.
Methodology
This section presents the proposed hybrid forecasting techniques and other forecasting algorithms used in this study to predict the PV power output. The proposed forecasting methods include a hybrid method between SVR and backpropagation neural network (PBNN) with four MOA. These algorithms are SSO, PSO, CSO, and NNA. Initially, the forecasting framework is highlighted. After that, the fundamental of the BPNN and SVR are explained together with the MOA. Finally, the criteria to evaluate the forecasting models’ accuracy are described.
Framework of the proposed forecasting models
Sixteen hybrid and three default models have been developed to enhance the accuracy of the prediction. In this study, therefore, the forecasting approaches used are as follows:

SVR based on RB function with SSO, PSO, CSO, and NNA—\((SSOSV{R}_{RB})\), (\(PSO SV{R}_{RB}\)), (\(CSO SV{R}_{RB}\)), and (\(NNA SV{R}_{RB}\)).

SVR based on linear function with SSO, PSO, CSO, and NNA—(\(SSO SV{R}_{linear}\)), (\(PSO SV{R}_{linear}\)), (\(CSO SV{R}_{linear}\)), and (\(NNA SV{R}_{linear}\)).

BPNN model with one hidden layer with SSO, PSO, CSO, and NNA—\(\left(SSOBPN{N}^{1}\right)\), PSO \(\left(PSOBPN{N}^{1}\right)\), \(\left(CSOBPN{N}^{1}\right)\), and \(\left(NNABPN{N}^{1}\right).\)

Hybrid Model 10: BPNN model with two hidden layers with SSO, PSO, CSO, and NNA—\(\left(SSOBPN{N}^{2}\right)\), \(\left(PSOBPN{N}^{2}\right)\), \(\left(CSOBPN{N}^{2}\right)\), and \(\left(NNABPN{N}^{2}\right)\).

Default SVR model based on RB function \(\left(SV{R}_{RB}^{D}\right).\)

Default SVR model based on linear function \(\left(SV{R}_{linear}^{D}\right).\)

Default BPNN model \(\left(BPN{N}^{D}\right).\)
This study implements SVRbased kernel functions and BPNN by employing MATLAB R2020a and LIBSVM tools [29]. The framework that explains the proposed PV power output forecast is depicted in Fig. 1. This framework can be used for any forecasting objective in other countries. The process is described as follows:
 Step 1::

Data preparation: input data are initially collected, checked, cleaned, and normalized to reduce the numerical burden during the training phase by the forecasting algorithms and the searching process of the parameters.
 Step 2::

Correlation values: the importance of data features are investigated against the output feature (PV power). The Pearson Correlation Coefficient is used in this study; see “Feature combinations” section.
 Step 3::

Data splitting: the input data are divided into training and testing datasets. The training data are used to train the forecasting algorithms, while testing data are used to test the forecasting models' performance. To validate the stability of the forecasting model, tenfold crossvalidation is used. The crossvalidation process is described in “Crossvalidation” section.
 Step 4::

Parameters tuning: SSO, PSO, CSO, and NNA algorithms are applied to determine the SVR best hyperparameters and BPNN best network configurations for all the considered feature combinations. SVR parameters are \(C\) and \(\gamma \) for the RB function and \(C\) for the linear function. BPNN parameters are the number of neurons at each hidden layer. In this study, one and two hidden layers are assumed.
 Step 5::

Building the forecasting models: by using the best parameters mentioned in Step 3, sixteen hybrid models are generated for each of the considered feature combinations, namely SSO\(SV{R}_{RB},\) \(PSOSV{R}_{RB}, CSOSV{R}_{RB}\), \(NNASV{R}_{RB}\), SSO\(SV{R}_{linear}\), PSO\(SV{R}_{linear}\), CSO\(SV{R}_{linear}\), \(NNA SV{R}_{linear}\), \(SSOBPN{N}^{1}\), \(PSOBPN{N}^{1}\), \(CSOBPN{N}^{1}\), \(NNABPN{N}^{1}\), \(SSOBPN{N}^{2}\), \(PSOBPN{N}^{2}\), \(CSOBPN{N}^{2}\), and \(NNABPN{N}^{2}\).
 Step 6::

Generating results: the hybrid forecasting models created in Step 4 are tested under the testing dataset determined in Step 2. Their output is the prediction results.
 Step 7::

Results comparison: the forecasting models are then compared with the actual values of the PV power output utilizing \(RMSE\), \(nRMSE\), \(MAE\) and normalized \(MAE\) s \((nMAE)\). The results are compared and then analyzed.
Detail description of each process is explained in the subsections below.
Study site and dataset
The datasets of the PV power output are collected from a rooftop PV system placed on a mosque in Riyadh city, Saudi Arabia. Five PVinverters are installed on this site, making the PV system have a capacity of 120kWp. This location is operated by both King Abdelaziz City for Science and Technology and Saudi Electricity Company. The PV output data gathered from the unit are in 1h intervals for the period between June 03rd, 2017, and August 31st, 2018. The maximum power production from the system was found to be on March 25th, 2018, with a total active power production of 105.09285 kW at 11:00 A.M. The hourly data show numerical readings from the PV system at night hours when no irradiance is expected. To deal with such data, all data below 100 W are omitted and set to zero, implying that there is no output power from the PV system. after sunset and before the sun rises. The metrological weather data used in this study are recorded hourly at the same location as the PV system. They are collected from a solar station operated by the King Abdullah City for Atomic and Renewable Energy (K.A.CARE). Figure 2 shows the solar map of Saudi Arabia and the site.
Data preparation
The weather and PV power data are required to be prepared. Two main steps are necessary for data preparation. These steps are data cleaning and data normalization. Each of these steps is described below:
Data cleaning
Data cleaning is a very significant step in creating a successful forecasting model. Since we are dealing with historical data from different sources, these data could be imprecise, impacting the performance of the forecasting models. This step removes all the missing PV power data with the associated time and weather variables.
Data normalization
Input data normalization is critical for preparing the data before investigating the performance of forecasting models. This step aims to reduce the likelihood that features with high numerical values will outnumber those with lower ones [31]. The input data listed in Table 1 are normalized between 0 and 1 using Eq. (1).
where \({x}_{i}\) is the observed value; \({x}_{i}^{n}\) is the reading value after normalization, while \({x}_{max}\) and \({x}_{min}\) are the maximum and minimum values corresponding to the observed dataset, respectively.
Forecasting models input variables
In this work, the forecasting objective is a onehour ahead forecasting of PV power generation from a PV panel located in Riyadh city, Saudi Arabia. Therefore, to obtain the best forecasting model of PV power output, the proposed models are trained and tested using three types of variables considered at the study location. As mentioned in the literature, the PV output is greatly influenced by time, weather, and historical data of PV power generation. Table 1 lists the input variables used in this study.
The variable \({(v}_{i}^{4})\), for example, is the air temperature \((\mathrm{^\circ{\rm C} })\), and \(i\) is the temperature value at each hour. On each day, we have 24 values of air temperature. After that, the dataset (\(V\)) is split into two groups, namely: the training dataset, \({v}_{train}\), and the test dataset, \({v}_{test}\), such that \(V={v}_{train}\cup {v}_{test}\). In this paper, 80% of the data are considered for the training phase, while 20% are used in the testing phase. The CrossValidation technique is utilized to tune the hyperparameters of the SVR models and the BPNN network configuration.
Feature combinations
To forecast the PV output (\({P}_{out})\), various independent combinations of variables are used as inputs. As more input variables do not always indicate good forecasting outcomes [32], the primary goal of combining different sets of input variables is to identify the suitable variables that give the best forecasting results at this site. In this study, Pearson Correlation Coefficient is used to measure the importance of each variable with the observed values of the PV power. Pearson correlation formula is in Eq. (2), where cov is the covariance, σ_{features} and σ_{PVpower} are the standard deviations of input variables, x_{features}, and the PV power readings, x_{PVpower}, respectively. Figure 3 displays the correlation results between the input variables and the PV power readings at the study site.
Table 2 contains the variables for each feature combination. Considering the feature combination (12), for example, the \({P}_{out} ({v}_{i}^{1}, {v}_{i}^{2},{v}_{i}^{3},{v}_{i}^{8},{v}_{i}^{10})\) is a function of Month, Day, Hour, global horizontal irradiance, GHI, (Wh/m^{2}), and PV power output at the same hour on the previous day (kW).
Crossvalidation
To evaluate the performance of the proposed forecasting models, it is not ideal for conducting this evaluation based on one test set. Therefore, to examine the forecasting model performance over different test data, \(k\)fold CrossValidation should be employed. Crossvalidation is a procedure in which the data are split into more \(k\)subsets [33]; see Fig. 4. These \(k\)subsets are further divided into testing and training groups. In the training group, a single subset is used as a validation data set, while the remaining \(k\)subsets are used as training subsets. This technique is repeated \(k\) times until the entire \(k\)subsets are used as a validation set. Hence, the overall result is independent of only one training set, which may affect the robustness of the forecasting models [34].
It is worth mentioning that the \(k\)fold CrossValidation procedure is conducted in the absence of the testing dataset. The primary goal of CrossValidation is to examine the generalization of a forecasting model. In this study, tenfold crossvalidation is used. In other words, the training dataset is divided into 10subsets. One subset is considered the test set, while the remaining nine subsets are utilized for training the forecasting model. This process is repeated ten times resulting in ten training and testing folds, where the \(nRMSE\) is recorded for each of them. The average of the tenfold \(nRMSE\) results is then reported.
Backpropagation neural network
The artificial neural network (ANN) has been used in various forecasting applications. ANN is an information computing system. ANN mimics approaches that the human brain analyzes information [35]. ANN is created similar to the human brain, where a huge number of neuron nodes are interconnected to tackle problems that represent the uniqueness of this network. Backpropagation is one of the most widely used ANN methods in the learning process. Figure 5 depicts a multilayer feedforward neural network.
Three different layers are the main construction of the ANN, namely input, hidden, and output layers, such as the input layer \({\left[{x}_{1}, {x}_{2}, \dots ,{x}_{N}\right]}^{T}\), the hidden layer \({\left[{h}_{1}, {h}_{2}, \dots ,{h}_{N}\right]}^{T}\) and the output layer \({\left[{y}_{1}, {y}_{2}, \dots ,{y}_{N}\right]}^{T}\). The model output, therefore, can be calculated by Eq. (3) [36]:
where \(m\) is the number of nodes at the input layer, while \(n\) is the number of nodes at the hidden layer. \(f\) is a sigmoid transfer function, which will be the logistic function in this study,\(f\left(x\right)=\frac{1}{1+\mathrm{exp}(x)}\). \(\{{\alpha }_{j},j = \mathrm{0,1}, ...,n\}\) is the weights vector that links the hidden layer and output layer and \(\{{\beta }_{ij}, i =1, 2, .. .,m;j = \mathrm{0,1}, ...,n\}\) are the weights that link the input nodes with the hidden nodes.\({\alpha }_{0}\) and \({\beta }_{0j}\) are weights magnitude of arcs leading from the bias terms, which have values equal to 1.
The number of nodes in each hidden layer is optimized using SSO, PSO, CSO, and NNA. This study identifies the multilayer perceptron (MLP) for the BPNN model, while the Levenberg–Marquardt method is chosen as the training function.
Support vector regression
Support vector machine (SVM) is a supervised learning approach utilized for classification, regression problems, or outliers' detection. When two classes cannot be separated, a kernel function is employed to map the input space to another high dimension space [37]. In that new space, the input space can be separated linearly. There are three known kernel functions to conduct the separation: linear, polynomial, and radial kernel functions [38]. Hence, SVR inherently employs some of the SVM properties. However, unlike SVM, SVR conducts the classification based on the regression process error measures based on the predefined threshold, see Fig. 6 [39].
The leading optimization can be formulated in Eq. (4), while the kernels used with the SVR are provided in Eqs. (5) and (6).
The SVR requires solving the following optimization problem:
where \(C > 0\) is a constant that identifies the tradeoff between the flatness of \(f\) and assesses the tolerated amount of deviation to values larger than \(\varepsilon \).
As we have mentioned, our input space represented by the input features, or the training dataset, is transferred into a new space with high dimensions, where the function \(\phi \) is used. This is known as the kernel trick \(\left({x}_{i},{x}_{j}\right)=\) \(\phi {\left({x}_{i}\right)}^{T}\left({x}_{j}\right)\). This research work uses kernel functions, namely radial basis (\(RB\)) and the linear (\(linear\)). They can be written as [40]:
where, \(\gamma \left(Gamma\right)\) is the kernel parameter and is estimated by the study optimization algorithms.
The choice of the two hyperparameters, \(C\) and \(\gamma \), is critical in enhancing the accuracy of the forecasting models. The parameter \(C\) governs the empirical risk of SVR, while parameter \(\gamma \) controls the width of the radial basis function [41]. Researchers are accustomed to determining these parameters either by their insights, prior knowledge from other studies [42], or by using approaches such as grid search [39]. Hence, \(C\) and \(\gamma \) are optimally selected by utilizing SSO, PSO, CSO, and NNA to build the hybrid models. This is described in the next Sections.
Metaheuristic optimization algorithms
The setup of metaheuristic optimization algorithms, including SSO, PSO, CSO, and NNA, is described in this section. The evaluation function of these algorithms tries to minimize the \(nRMSE\); see Eq. (8).
SSO, PSO CSO, and NNA are explained in [43,44,45,46], respectively. The considered optimization algorithms are initiated with 50 maximum iterations. For the linear function, the upper and lower bounds for \(C\) are between \([\mathrm{1,10000}],\) and for the RB function, the boundaries are in the range of \([\mathrm{1,10000}]\) and \([\mathrm{0.01,3}]\) for \(C\) and \(\gamma \), respectively. For the BPNN models, the upper and lower bounds of neurons at each hidden layer are set to be [1,50]. For CSO, the following paraments are set: \(h\) = 20 and \(p\) = 0.25.
Figure 7 depicts the hybrid forecasting algorithm that consists of ANN with PSO. During the algorithms, the optimal number of nodes at each hidden layer is developed, and their values are obtained until the lowest \(nRMSE\) are attained. For parameter tuning, tenfold crossvalidation is used. Similar steps are used with other optimization algorithms and SVR.
Model accuracy criteria
The forecasting methods under consideration are evaluated for accuracy and efficiency using the following statistical indicators: \(RMSE\), \(nRMSE\), \(MAE\), and \(nMAE\). These metrics show how close the measured values are to the predicted PV power output produced by the proposed models. These metrics are defined in the following equations [19]:
where \(n\) is the number of testing datasets; \({y}_{i}\) is the observed value of the PV power; \({y}_{i, max}\) is the maximum value in the testing dataset and \({f}_{i}\) is the forecasted value generated by the forecasting models. \(RMSE\) measures the deviation between observed PV power readings and predicted values [47], while the \(MAE\) is the mean of absolute value of the residuals (forecasting errors).
Results and discussion
The BPNN model was built using a multilayer perceptron (MLP) and the backpropagation algorithm, with the Levenberg–Marquardt method as the training function. Regarding the number of layers, this study assumes three cases for BPNN:

Case 1: \(BPN{N}^{1}\)—one input layer, one hidden layer, and one output layer.

Case 2: \(BPN{N}^{2}\)—one input layer, two hidden layers, and one output layer.

Case 3: \(BPN{N}^{D}\)—default BPNN.
The number of neurons (nodes) in the hidden layers in Case 1 (\(BPN{N}^{1})\) and Case 2 (\(BPN{N}^{2}\)) are obtained based on the optimal number of nodes generated by SSO, PSO, CSO, and NNA. For Case 3 (\(BPN{N}^{D})\) is assumed to have one input layer, one hidden layer, and one output layer. In \(BPN{N}^{D}\), the number of neurons selected equals the number of input features listed in Table 2. For example, the number of nodes is one with the input feature combination (F1), which consists of one input feature. Similarly, five nodes are set for the feature combination (F12), which has five input features. The input data in the \(BPNN\) are the same as those used in \(SVR\) models. Table 4 summarizes the \(BPNN\) best models’ configuration by using different algorithms.
For SVR models, the default parameters are selected based on the default values used in the LIBSVM tool. The default value of the parameter \(C\) is set to 1 for the radial basis and linear functions, while parameter \(\gamma \) is equal to (1/number of features). After that, the \(SVR\) and \(BPNN\) models with the optimal parameters and network configurations were employed to forecast the PV power generation. To evaluate the level of agreement between the predicted data and measured data, the models are examined based on \(RMSE\), \(nRMSE\), \(MAE\), and \(nMAE\). Table 3 compares the performance indices of the SVR and BPNN models.
Analysis of the forecasting models
In this section, the forecasting models are compared according to some criteria to examine their performance to predict the PV power output from the solar system. Results of the best forecasting models, corresponding optimized parameters of\(SV{R}_{RB}\),\(SV{R}_{Linear}\),\(BPN{N}^{1}\), and \(BPN{N}^{2}\) models and the statistical errors of the forecasting models are shown in Tables 3, 4, and 5, respectively. Figures 8 and 9 display graphical representations of the goodness of fit tests of \(RMSE\) (in Fig. 8) and \(MAE\) (in Fig. 9) in the form of heat maps. These figures compare all the 323 models considered at the study site. These are four forecasting models, \(SV{R}_{RB}\),\(SV{R}_{Linear}\),\(BPN{N}^{1}\), and \(BPN{N}^{2},\) which obtain their parameters for each of the 17 feature combinations using four optimization approaches, SSO, PSO, CSO, and NNA, in addition to the three default models \(SV{R}_{RB}^{D}\),\(SV{R}_{Linear}^{D}\), and\(BPN{N}^{D}\).
The statistical errors are reported for the best feature combination, which is F14 in Table 2.
The SVR parameters and BPNN network configurations are reported for the best feature combination, which is F14 in Table 2.
Hybrid forecasting models vs. default forecasting models
Tables 3, 5, Figs. 8, and 9 show that: \((i)\) Overall, the proposed forecasting models optimized by SSO, PSO, CSO, and NNA outperform the default forecasting models in predicting the PV power output with low \(RMSE\) and \(MAE\) values. Regarding models fitting accuracy with the \(SV{R}_{RB}\) models, the proposed models with the optimized hyperparameters show improvements compared to default models, where \(RMSE\) improved between 12.001 and 50.079% and \(MAE\) improved between 1.80291 and 59.8847%. Similarly, the prediction models with the \(BPN{N}^{1}\) and \(BPN{N}^{2}\) using the proposed models with the optimal network configurations have better performance with 1.883–46.964% and 2.0576–47.007% improvement in the \(RMSE\) and \(MAE\) values, respectively, compared to the \(BPN{N}^{D}\) models.
Using \(SV{R}_{RB}\) models with different optimization algorithms and different feature schemes, Table 6 and Fig. 8 show that \(RMSE\) values are \(\le 23.12 \; \mathrm{ kW}\), while \(RMSE\) values with the default models are \(\le 28.24 \; \mathrm{ kW}\). With the feature combination (12), for example, the value of \(RMSE\) with the best model (\(SSO SV{R}_{RB}\)) is 4.7500 kW, and \(MAE\) is 2.7617 kW. On the other hand, the \(SV{R}_{RB}^{D}\) gives an \(RMSE\) value of 9.206 kW and \(MAE\) of 5.269 kW. Similarly, the reduction in the error metrics values has been attained with the proposed \(BPNN\) models. For instance, considering the feature combination (9), the value of \(nRMSE\) using \(BPN{N}^{1}\) and \(BPN{N}^{2}\) are 5.767% and 5.804%, respectively, while the \(BPN{N}^{D}\) generates an error value of 7.25%. Nevertheless, the degree of improvement is somewhat low in the \(SV{R}_{Linear}\) models with the optimized parameters as a comparison to the \(SV{R}_{Linear}^{D}\) models. This can be attributed to the value of the optimized hyperparameter, \(C\), which has close values to the default ones.
In addition, Figs. 8, 9, and Table 5 show that the BPNN with the default models has better performance than the default models of the SVR with the radial base and the linear functions. This is due to the default parameters that are selected for the SVR models. Hence, the associated parameters should be chosen appropriately to obtain the best forecasting performance of the SVR models. The BPNN default models, on the other hand, show good performance compared to the proposed models. For instance, the best model using \(SSOBPN{N}^{1}\) and \(CSOBPN{N}^{2}\) give \(RMSE\) values of 4.8460 kW and 4.5692 kW, respectively, while the \(BPN{N}^{D}\) model generates an error value of 5.2289 kW.
Performance analysis using proposed models
From Tables 3, 6, Figs. 8, and 9, and by comparing different forecasting models using the proposed models, \(SV{R}_{RB}\) models can be considered the best prediction model to estimate the PV power generation in the study site. \(SV{R}_{RB}\) has better error metrics values than \(SV{R}_{Linear}\), \(BPN{N}^{1}\), and \(BPN{N}^{2}\) with low \(RMSE\) and \(MAE\) values. For instance,\(PSO SV{R}_{RB}\) models are better than \(PSO SV{R}_{Linear}\), \(PSOBPN{N}^{1}\), and \(PSOBPN{N}^{2}\) for all the considered feature combinations. This result is also found with other optimization methods. \(BPNN\) models, on the other hand, have a better performance than the \(SVR\) models with the linear function and promising performance compared to the RB models. As a comparison between the proposed \(BPNN\) models, overall \(BPN{N}^{2}\) models have led to better prediction capability than the \(BPN{N}^{1}\). \(PSOBPN{N}^{2}\) with the feature combination (11), for example, gave an \(nRMSE\) of 3.251% and \(nMAE\) values of 3.251%, while \(PSOBPN{N}^{1}\) resulted in error values of 6.297% and 3.819%, respectively. This implies that using more than one hidden layer with optimized node numbers leads to higher forecasting accuracy than a single hidden layer.
Performance analysis of optimization algorithms
Furthermore, as a comparison between the performance of the optimization algorithms, more than one optimization approach has the same accuracy and performance in estimating the parameters of the forecasting models. \(\mathrm{PSO}\) and \(\mathrm{SSO}\) methods have a similar or negligible difference in terms of estimating parameter values. Nevertheless, the three optimization algorithms show different performances in obtaining the \(C\) parameter of the linear function. Furthermore, linear models performed the worst because of their limited ability to deal with nonlinearity in input data. Figure 10 shows the performance of all four optimization algorithms with all the forecasting models. This figure proves that the proposed hybrid forecasting models where the hyperparameters of \(SV{R}_{Linear}\) and \(SV{R}_{RB}\) and the configuration of \(BPN{N}^{2}\) are selected optimally can track the actual values of the PV power output precisely compared to default models.
Best feature combination
From Tables 3, 6, Figs. 8, and 9, and by comparing different feature combinations, we can see that the best forecasting model is attained with the feature combination (F14). This combination includes the month in the year, the day of the month, the hour of the day, air temperature \((\mathrm{^\circ{\rm C} })\), global horizontal irradiance, GHI (Wh/m^{2}), and PV power output at the same hour on the previous day (kW). Results show that the best forecasting outcomes for all considered models are obtained using this feature combination. Regarding other feature combinations, using only global horizontal irradiance (F3), PV power output at the same hour on the previous day (kW) (F4), or a mix between them could lead to satisfactory forecasting results. GHI provided good accuracy results due to its significant impact on the PV system production, while the lag power observation is due to the nature of solar radiation in the study site. In Riyadh city, the nature of the weather is less variable, and there are two seasons in the year, summer and winter. Therefore, the power output of the previous day may influence the production of the next day. This is depicted in Fig. 11 through the scatter plots of the measured vs. predicted PV power output values acquired by the \(SV{R}_{RB}\) with the CSO algorithm. The subplot in green displayed in Fig. 12 indicates the best prediction model with the best feature combination (\(CSO SV{R}_{RB}\) with F14).
Furthermore, the Decision Tree (DT) algorithm for feature selection is used to validate the conclusion on the combination of the optimal features [48]. Figure 11 displays the scores for input features according to how relevant they are to predicting the PV power output. Figure 11 reveals that the features: PV power output at the same hour on the previous day (kW), global horizontal irradiance, GHI \((W\mathrm{h}/{m}^{2})\), Direct normal irradiance, DNI \((W\mathrm{h}/{m}^{2})\), the hour of the day, and air temperature \((\mathrm{^\circ{\rm C} })\) have the best five scores among the other features. This correspond to the best feature combination obtained in this study.
Conclusion and future work
Support Vector Regression (SVR) with radial basis and linear kernel functions and Backpropagation Nural Network (BPNN) models were investigated in this study to predict the PV output power of the rooftop PV unit placed on a mosque located in Riyadh city, Saudi Arabia. The penalty factor (\(C)\) and kernel parameter (\(\gamma )\) of the SVR models with the radial and linear functions and the number of hidden nodes of the artificial neural network were optimized using four optimization algorithms. These algorithms are Social Spider Optimization (SSO), Particle Swarm Optimization (PSO), Cuckoo Search Optimization (CSO), and Neural Network Algorithm (NNA). Different combinations of input variables are used in this study to select the optimal set of input features. By analyzing the results of the best forecasting model and the performance of the estimation algorithms, the conclusion can be summarized as follows:

1.
According to the model accuracy criteria, the proposed hybrid PV power forecasting models outperform the default models using SVR with the RB, linear functions, and BPNN algorithms. Overall, results indicate that the proposed models with the optimized hyperparameter of the SVR with radial basis outperform other models in forecasting PV power output at the study site.

2.
Regarding the model fitting accuracy with the \(SV{R}_{RB}\), the proposed models show improvements compared to the default models, where \(RMSE\) improved between 12.001 and 50.079% and \(MAE\) improved between 1.80291 and 50.8847%. Similarly, the prediction models with the \(BPN{N}^{1}\) and \(BPN{N}^{2}\) using the proposed models, with optimal network configurations, have better performance with 1.883–46.964% and 2.0576–47.007% improvement in the \(RMSE\) and \(MAE\) values, respectively, compared to the default \(BPNN\) models.

3.
The proposed BPNN models exhibit a good forecasting outcome that can be compared with the SVR radial basis models. On the other hand, the SVR based on the linear function showed limited forecasting performance due to its limited capability to capture the nonlinearity in the input dataset.

4.
As a comparison between the estimation algorithms, the four optimization algorithms almost have the same performance, demonstrating their capacity to select SVR parameters and BPNN network configurations.
Finally, the framework proposed in this study can be used to forecast the PV power output in other countries. However, there is still room for further investigation to develop a model that provides highaccuracy results to predict the PV power forecast. Furthermore, even though the current work primarily focuses on the possible improvement of SVR and ANN by optimizing their parameters, the parameters of other algorithms can also be investigated, such as random forests and decision trees. Another possible direction is to use dimensionality reduction models to select the features for our input vector. In this study, different feature combinations are formulated based on their correlation with the output power. Thus, other researchers can examine the performance of some dimensionality reduction models, such as the Monte Carlo algorithm, Boruta feature selection algorithm, and grouping genetic algorithm, to obtain the optimal set of features. Finally, further forecasting approaches can be applied, including the currently popular deep learning methods based on neural networks, such as recurrent neural networks and longterm short memory.
Availability of data and materials
The data that support the findings of this study are available from King Abdelaziz City for Science and Technology and Saudi Electricity Company but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of King Abdelaziz City for Science and Technology and Saudi Electricity Company.
Abbreviations
 PV:

Photovoltaics
 NWP:

Numerical weather prediction
 ARMA:

Autoregressive moving average
 ARIMA:

Autoregressive integrated moving average
 ANN:

Artificial Neural networks
 BPNN:

Backpropagation Neural Network
 SVR:

Support Vector Regression
 kNN:

Knearest neighbor
 MLR:

Multiple regression
 DTR:

Decision tree regression
 GHI:

Global Horizontal Irradiance
 MOA:

Metaheuristic optimization algorithms
 SSO:

Social Spider Optimization
 PSO:

Particle Swarm Optimization
 CSO:

Cuckoo Search Optimization
 NNA:

Neural Network Algorithm
 RMSE:

Root Mean Square Error
 nRMSE:

Normalized Root Mean Square Error
 MAE:

Mean Absolute Error
 nMAE:

Normalized Mean Absolute Error
References
M. E. I. (MITEI). Managing largescale penetration of intermittent renewables. 2011.
Haque MM, Wolfs P. A review of high PV penetrations in LV distribution networks: present status, impacts and mitigation measures. Renew Sustain Energy Rev. 2016;62:1195–208.
Antonanzas J, Osorio N, Escobar R, Urraca R, MartinezDePison FJ, AntonanzasTorres F. Review of photovoltaic power forecasting. Sol Energy. 2016;136:78–111.
Wu J, Wang YG, Tian YC, Burrage K, Cao T. Support vector regression with asymmetric loss for optimal electric load forecasting. Energy. 2021;223:119969.
Fathi S, Srinivasan R, Fenner A, Fathi Rinker Sr SM. Machine learning applications in urban building energy performance forecasting: a systematic review. Renew Sustain Energy Rev. 2020;133:110287.
Cai M, Pipattanasomporn M, Rahman S. Dayahead buildinglevel load forecasts using deep learning vs. traditional timeseries techniques. Appl Energy. 2019;236:1078–88.
Ferreira M, Santos A, Lucio P. Shortterm forecast of wind speed through mathematical models. Energy Rep. 2019;5:1172–84.
Dhiman HS, Deb D, Guerrero JM. Hybrid machine intelligent SVR variants for wind forecasting and ramp events. Renew Sustain Energy Rev. 2019;108:369–79.
Doucoure B, Agbossou K, Cardenas A. Time series prediction using artificial wavelet neural network and multiresolution analysis: application to wind speed data. Renew Energy. 2016;92:202–11.
Alrashidi M, Alrashidi M, Rahman S. Global solar radiation prediction: application of novel hybrid datadriven model. Appl Soft Comput. 2021;112:107768.
Alfadda A, Rahman S, Pipattanasomporn M. Solar irradiance forecast using aerosols measurements: a data driven approach. Sol Energy. 2018;170:924–39.
Ghofrani M, Ghayekhloo M, Azimi R. A novel soft computing framework for solar radiation forecasting. Appl Soft Comput. 2016;48:207–16.
Akhter MN, Mekhilef S, Mokhlis H, Shah NM. Review on forecasting of photovoltaic power generation based on machine learning and metaheuristic techniques. IET Renew Power Gener. 2019;13(7):1009–23.
Ahmed R, Sreeram V, Mishra Y, Arif D. A review and evaluation of the stateoftheart in PV solar power forecasting: techniques and optimization. Renew Sustain Energy Rev. 2020;124:109792.
Sampath Kumar D, Gandhi O, RodríguezGallegos CD, Srinivasan D. Review of power system impacts at high PV penetration Part II: Potential solutions and the way forward. Sol Energy. 2020;210:202–21.
Sobri S, KoohiKamali S, Rahim NA. Solar photovoltaic generation forecasting methods: a review. Energy Convers Manag. 2017;156:459–97.
de Freitas Viscondi G, AlvesSouza SN. Sustainable energy technologies and assessments. A systematic literature review on big data for solar photovoltaic electricity generation forecasting. Sustain Energy Technol Assess. 2018;31:54–63.
Sharadga H, Hajimirza S, Balog RS. Time series forecasting of solar power generation for largescale photovoltaic plants. Renew Energy. 2020;150:797–807.
Gómez JL, Martínez AO, Pastoriza FT, Garrido LF, Álvarez EG, García JAO. Photovoltaic power prediction using artificial neural networks and numerical weather data. Sustainability. 2020;12(10295):1–19.
Theocharides S, Makrides G, Livera A, Theristis M, Kaimakis P, Georghiou GE. Dayahead photovoltaic power production forecasting methodology based on machine learning and statistical postprocessing. Appl Energy. 2020;268:115023.
Abubakar Mas’ud A. Comparison of three machine learning models for the prediction of hourly PV output power in Saudi Arabia. Ain Shams Eng J. 2022;13(4):101648.
Markovics D, Mayer MJ. Comparison of machine learning methods for photovoltaic power forecasting based on numerical weather prediction. Renew Sustain Energy Rev. 2022;161:112364.
Fan GF, Qing S, Wang H, Hong WC, Li HJ. Support vector regression model based on empirical mode decomposition and auto regression for electric load forecasting. Energies. 2013;6(4):1887–901.
Sch B, Williamson RC, Bartlett PL. New support vector algorithms. Neural Comput. 2000;12:1207–45.
Almeida MP, Muñoz M, de la Parra I, Perpiñán O. Comparative study of PV power forecast using parametric and nonparametric PV models. Sol Energy. 2017;155:854–66.
Saini LM, Aggarwal SK, Kumar A. Parameter optimisation using genetic algorithm for support vector machinebased priceforecasting model in National electricity market. IET Gener Transm Distrib. 2010;4(1):36.
VanDeventer W, et al. Shortterm PV power forecasting using hybrid GASVM technique. Renew Energy. 2019;140:367–79.
Netsanet S, Zheng D, Zhang W, Teshager G. Shortterm PV power forecasting using variational mode decomposition integrated with Ant colony optimization and neural network. Energy Rep. 2022;8:2022–35.
Chang CC, Lin CJ. Libsvm. ACM Trans Intell Syst Technol. 2011;2(3):1–27.
Solar resource maps and GIS data  Solargis. https://solargis.com/mapsandgisdata/download/saudiarabia. Accessed 03 Oct 2020.
Hsu, CW, Chang CC, Lin CJ. A practical guide to support vector classification.
Niu D, Wang K, Sun L, Wu J, Xu X. Shortterm photovoltaic power generation forecasting based on random forest feature selection and CEEMD: a case study. Appl Soft Comput. 2020;93:106389.
Miraftabzadeh SM, Longo M, Foiadelli F. Adayahead photovoltaic power prediction based on long short term memory algorithm. In: SEST 2020—3rd international conference on smart energy systems and technologies. 2020. p. 1–6.
Konstantinou M, Peratikou S, Charalambides AG. Solar photovoltaic forecasting of power outputusing LSTM networks. Atmosphere. 2021;12(1):124.
Faraji J, Abazari A, Babaei M, Muyeen SM, Benbouzid M. Dayahead optimization of prosumer considering battery depreciation and weather prediction for renewable energy sources. Appl Sci. 2020;10(8):1–22.
Leva S, Dolara A, Grimaccia F, Mussetta M, Ogliari E. Analysis and validation of 24 hours ahead neural network forecasting of photovoltaic output power. Math Comput Simul. 2017;131:88–100.
Tesfaye Eseye A, Zhang J, Zheng D. Shortterm photovoltaic solar power forecasting using a hybrid waveletPSOSVM model based on SCADA and meteorological information. Renew Energy. 2017;118:357–67.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Math Intell. 2001;27(2):83–5.
Abuella M, Chowdhury B. Solar power forecasting using support vector regression. In: Proceedings of the American Society for Engineering Management 2016.
Smola AJ, Scholkopf B. A tutorial on support vector regression. Stat Comput. 2004;14:199–222.
Hong WC. Electric load forecasting by support vector model. Appl Math Model. 2009;33:2444–54.
Wang J, Li L, Niu D, Tan Z. An annual load forecasting model based on support vector regression with differential evolution algorithm. Appl Energy. 2012;94:65–70.
Cuevas E, Cienfuegos M, Zaldívar D, PérezCisneros M. A swarm optimization algorithm inspired in the behavior of the socialspider. Expert Syst Appl. 2013;40:6374–84.
Kennedy J, Eberhart R. Particle swarm optimization. In: IEEE international conference on, neural networks, 1995, proceedings. vol. 4, 1995. p. 1942–8.
Yang XS, Deb S. Cuckoo search via levy flights. 2010.
Sadollah A, Sayyaadi H, Yadav A. A dynamic metaheuristic optimization model inspired by biological nervous systems: neural network algorithm. Appl Soft Comput. 2018;71:747–82.
Renno C, Petito F, Gatto A. Artificial neural network models for predicting the solar radiation as input of a concentrating photovoltaic system. Energy Convers Manag. 2015;106:999–1012.
Zhou HF, Zhang JW, Zhou YQ, Guo XJ, Ma YM. A feature selection algorithm of decision tree based on feature weight. Expert Syst Appl. 2021;164:113842.
Acknowledgements
The author extends his appreciation to the Deputyship for Research & Innovation, Ministry of Education, Saudi Arabia for funding this research work through the project number (QUIF43331464). The authors also thank to Qassim University for technical support.
Funding
The author extends his appreciation to the Deputyship for Research& Innovation, Ministry of Education, Saudi Arabia for funding this research work through the project number (QUIF4–33–31464). The authors also thank to Qassim University for technical support.
Author information
Authors and Affiliations
Contributions
M.S.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review and editing, visualization. S.R.: validation, formal analysis, supervision, project administration. All authors read and approved the final manuscript.
Authors information
Saifur Rahman (Life Fellow, IEEE) received the B.Sc. degree in electrical engineering from Bangladesh University of Engineering & Technology, Dhaka, Bangladesh, in 1973, M.S. degree in electrical engineering from State University of New York, New York, NY. USA, in 1975, and the Ph.D. degree in electrical engineering from Virginia Tech, Blacksburg, VA, USA, in 1978. He is the Founding Director with the Advanced Research Institute, Virginia Tech, Arlington, VA, USA, where he is the J. R. Loring Professor of Electrical and Computer Engineering. He also directs the Center for Energy and the Global Environment. He has published over 140 journal papers and has over 400 conference and invited presentations. He is a Distinguished Lecturer for the PES and has lectured on renewable energy, energy efficiency, smart grid, energy Internet, blockchain, and IoT sensor integration in over 30 countries.
Prof. Rahman is 2022 IEEE PresidentElect. He is an IEEE Millennium Medal Winner. He was the Founding Editorin Chief of IEEE Electrification Magazine and the IEEE TRANSACTIONS ON SUSTAINABLE ENERGY. He served as the Chair of the U.S. National Science Foundation Advisory Committee for International Science and Engineering from 2010 to 2013. He was the President of the IEEE Power and Energy Society for 2018 and 2019.
Musaed Alrashidi received the Ph.D. degree in electrical and computer engineering from Virginia Tech University, Blacksburg, VA, USA in 2021, and the M.SC degree in Electrical Engineering from The School of Engineering & Applied Science at the George Washington University, DC, USA in 2016. He is currently an Assistant Professor at Electrical Engineering Department, Qassim University, Saudi Arabia. His research interests lie in renewable energy resources, smart grids, machine learning algorithms, operation of distribution networks, and advanced intelligent optimization techniques.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Alrashidi, M., Rahman, S. Shortterm photovoltaic power production forecasting based on novel hybrid datadriven models. J Big Data 10, 26 (2023). https://doi.org/10.1186/s40537023007067
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023007067
Keywords
 PV power forecast
 Machine learning
 Metaheuristic Optimization Algorithms
 Hyperparameters and architectures tuning
 Feature selection