Predictive modelling is the process by which a model is created to predict an outcome. If the outcome is categorical it is called classification and if the outcome is numerical it is called regression. Descriptive modelling or clustering is the assignment of observations into clusters so that observations in the same cluster are similar. Finally, association rules can find interesting associations amongst observations. Figure 2 represents all the existing predictive analytics techniques classified by four categories.

Predictive analytics determines what is likely to happen in the future. This analysis is based on machine learning and statistical techniques as well as other more recently developed techniques that fall under the general category of data mining. The objective of these techniques is to be capable to provide predictions and forecasts about the future of the businesses activities.

There have been many studies using machine learning techniques to predict the stock market price. A large number of successful applications have shown that regression algorithms can be very useful tools for financial time series modeling and forecasting [2, 22, 23].

Regression is a data mining function that predicts a number. A regression task begins with a data set in which the target values are known. Regression models are tested by computing various statistics that calculate the difference between the predicted values and the expected values. The historical data for a regression project is typically divided into two data sets: one for building the model, the other for testing the model.

Regression modelling has many applications in trend analysis, business planning, marketing, financial forecasting, time series prediction, and environmental modelling. There are different families of regression algorithms and different ways of measuring the error [23, 24].

### Decision tree regression

Decision tree builds regression or classification models in the form of a tree structure. It decomposes a dataset into smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The highest decision node in a tree which corresponds to the best predictor named root node. Decision trees can manipulate both categorical and numerical data.

The core algorithm for building decision trees named ID3 by Quinlan which engages a top-down, greedy search through the space of possible branches with no backtracking. The ID3 algorithm can be used to build a decision tree for regression by changing Information Gain with Standard Deviation Reduction.

A decision tree is built top-down from a root node and includes partitioning the data into subsets that contain instances with similar values (homogenous). We use standard deviation to calculate the homogeneity of a numerical sample. If the numerical sample is totally homogeneous its standard deviation is zero.

$$S = \sqrt {\frac{{\sum (x - \mu )^{2} }}{n}}$$

(11)

$$S\left( {T,X} \right) = \sum\limits_{{c{ \in }X}}^{{}} {P(c)S(c)}$$

(12)

Equation (11) represents the standard deviation of one attribute and Eq. (12) represents the standard deviation of two attributes.

The standard deviation reduction is based on the reduction in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches) [24, 25].

### Multiple linear regression

A regression model is a compact mathematical representation of the relationship between the response variable and the input parameters in a given design space. Linear regression models are commonly used to obtain estimates of parameter significance as well as forecasting of the response variable at random points in the design space. One of the simpler forms of such models is

$$y = \beta_{0} + \;\sum\limits_{i = 1}^{m} {\beta_{i} x_{i} + \varepsilon }$$

(13)

where *y* is the dependent or response variable, {*xi|* 1 ≤ *i* ≤ *m*} are the independent or regressor variables and ε is the residual - the error due to lack of fit. β_{0} is interpreted as the intercept of the response surface with the y-axis and {*βi| 1* ≤ *i* ≤ *m*} are known as the partial regression coefficients. The coefficient values represent the expected change in the response *y* per unit change in *x*
_{
i
} and indicate the relative significance of the corresponding terms. It is frequent the case that the regressor variables interact i.e. the effect of a change in *x*
_{
i
} on y depends on the value of *x*
_{
j
}.

In such cases, the simple model in Eq. 13 is not sufficient. It is necessary to introduce terms that explicitly model two-factor interactions as shown below.

$$y = \beta_{0} + \;\sum\limits_{i = 1}^{m} {\beta_{i} x_{i} } + \;\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1 + 1}^{m} {\beta_{i,\,j} x_{i} x_{j} + \varepsilon } }$$

(14)

Equation 15 represents a generic model that includes three-factor, four-factor and all higher order interactions. There are 2^{m} terms in this model and an equal number of unknown regression coefficients

$$y = \beta_{0} + \;\sum\limits_{i = 1}^{m} {\beta_{i} x_{i} } + \;\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1 + 1}^{m} {\beta_{i,\,j} x_{i} x_{j} } } + \;\sum\limits_{i = 1}^{m} {\sum\limits_{j = i + 1}^{m} {\sum\limits_{k = j + 1}^{m} {\beta_{i,\,j,\,k} x_{i} x_{j} x_{k} } } } + \cdots + \beta_{1,\,2, \ldots m} x_{1} x_{2} x_{m} .$$

(15)

The linear regression models we represent in this session can be represented as a sum of *k* terms from this complete linear model, developed in a generic form as.

$$y = \beta_{0} + \beta_{1} x_{{i^{1} }} + \beta_{2} x_{{i^{2} }} \cdots + \beta_{k - 1} x_{{i^{k - 1} }} + \varepsilon$$

(16)

where each *x*
_{
i
}
^{j} is a distinct term from the generic model, and can be single factor, two factor, three factor or of any higher order. The collection of terms chosen for a given linear model will be mentioned to as the model terms.

In matrix terms, Eq. 16 can be written as

$$y\; = \;X\beta \; + \;\upvarepsilon$$

(17)

where *β* is the vector of regression coefficients and X is the model matrix. The model matrix has columns corresponding to the regressor variables *x*
_{
1
}
*, x*
_{
2
}
*, …, x*
_{
m
}, columns for interaction terms of any order, and a column of one’s defining the intercept [24, 26].

### Support vector regression

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyse data used for classification and regression analysis. Support Vector Machine is one of the most powerful algorithms in machine learning. The theory has been developed over the last three decades by Vapnik, Chervonenkis and others. When support vector machines were used to solve the regression problem they were usually called support vector regression.

In SVR, the fundamental idea is to map nonlinearly the original data X into the high-dimensional feature space and then do linear regression in this feature space. Therefore, suppose a set of data

$$S = \left\{ {\left( {x_{1} ,y_{1} } \right), \ldots , \left( {x_{i} ,y_{i} } \right), \ldots , \left( {x_{l} ,y_{l} } \right)} \right\} \in \left( {X \times Y} \right)^{l}$$

(18)

is given, where \(x_{i } \in X = R^{n}\) is the input vector, *y*
_{
i
} ∊ *Y* = *R* is the corresponding out value and *l* is the total number of data set, the Support Vector Regression function is

$$f\left( x \right) = w\cdot\varphi \left( x \right) + b$$

(19)

Where *φ*(*x*) is the nonlinear mapping function, *w* is the weight vector and *b* is the bias value. They can be assessed by minimizing the regularized risk function

$$R\left( w \right) = \frac{1}{2}\left| {\left| w \right|} \right|^{2} + C\sum\limits_{i = 1}^{l} {L_{e} \left( {y_{i},\,f\left( {x_{i} } \right)} \right)}$$

(20)

Where \(\frac{1}{2}\left| {\left| w \right|} \right|^{2}\) is used as a measurement of function flatness. C is the punishment parameter, which determines the trade-off between the training error and the generalization performance, *L*
_{
e
}(*y*
_{
i
}, *f*(*x*
_{
i
}) is called the insensitive loss function which is defined as

$$L_{e} (y_{i} ,f\left( {x_{i} } \right) = \left\{ {\begin{array}{l} {\left| {y_{i} - f\left( {x_{i} } \right)} \right| - \varepsilon , \quad \left| {y_{i} - f\left( {x_{i} } \right)} \right| \ge \varepsilon } \\ {0, \quad \left| {y_{i} - f\left( {x_{i} } \right)} \right| < \varepsilon } \\ \end{array} } \right.$$

(21)

Where \(\left| {y_{i} - f\left( {x_{i} } \right)} \right|\) is the error of predicting value and *ɛ* is the loss function.

When the error of estimation is taken into account, introduction of two positive slack variables ζ and ζ^{*} is to represent the distance between the actual value and the corresponding boundary values.

Different kernel functions are nominated; in fact, there is different network structure in support vector machines. The selection of kernel function is important to the effectiveness of Support Vector Regression. However, there is no mature theory in the selection of kernel function of SVR [24, 27].