The schematic representation of proposed methods is shown in Fig. 1.
Data source and acquisition
The data used in this study is 4661 DPP4 inhibitor molecular data in the form of Canonical SMILES which contains information on the molecular structure of the DPP4 inhibitor with the type of biological activity IC50. The data were obtained from https://www.ebi.ac.uk/ site accessed in July 2019 through several stages, such as selecting the DPP4 inhibitor and then selecting the biological activity value of IC50. The IC50 values have been converted to molar units pIC50 (defined as \(\log\) 10 IC50).
Data preprocessing
In the data preprocessing stage, there are three things that can be done, including data cleaning, data conversion, and feature extraction. The data were cleaned with the following criteria: (1) only selecting human DPP4 inhibitor; (2) only selecting DPP4 enzyme; (3) select data with biological activity value IC50; (4) eliminating duplicate compounds. The IC50 value is in the interval of 0.1 to 98,000 nM. In this study, compounds that have a value between 50 and 500 nM are defined as grey compounds, so they need to be removed. Besides, compounds having an IC50 value below 50 nM are defined as active compounds and an IC50 value above 500 nM as inactive compounds.
Several data cleaning steps, such as removing incomplete data rows in each column, reading Canonical SMILES columns with the OpenBabel library, removing duplicate molecular structures, and removing salt in the Canonical SMILES column using the RDKit Salt Stripper library, were executed using the KNIME software version 3.7, as shown in Fig. 2. KNIME nodes can do a wide set of functions for many different tasks such as read and write data files, data processing, statistical analysis, data mining, and graphical visualization. At KNIME there are RDKit and CDK nodes that are very complete for cheminformatics applications, from reading data in various formats to converting molecules from 2D to 3D [32].
At the data conversion step, the data is converted from CSV to SDF format. The feature extraction process carried out is calculating the molecular descriptors using several fingerprinting methods [11]. In this research, the RDKit packages which are implemented in the Python 3.6 programming language are used for the calculation of molecular descriptors. The molecular fingerprints used are ECFP_4, ECFP_6, FCFP_4 and FCFP_6 fingerprints. The bit length of the ECFP_4, ECFP_6, FCFP_4 and FCFP_6 fingerprints is 1024 bits. The formation of the QSAR regression and QSAR classification models will divide the dataset into a training dataset by 80% and a test dataset by 20%. Then 80% of the training dataset will be randomly divided into k groups (folds). The number of folds used is five folds \((k = 5)\).
Data clustering using Kmodes
Kmodes clustering is a clustering method that has a similar procedure to Kmeans clustering by expanding the paradigm for grouping categorical data by making several modifications, such as using simple dissimilarity measures to handle categorical objects, replacing the average calculation method on the cluster with the data mode; and using frequencybased methods to update the data mode to solve problems in the Kmeans clustering algorithm [16]. The Kmodes clustering algorithm used in this study is described as follows:

1.
Determine the number of clusters (k).

2.
Determine the initial k mode for each cluster.

3.
Calculate the distance between each data to the mode based on the dimension of dissimilarity.

4.
Group objects against the cluster in the closest mode.

5.
After all objects have been grouped into k clusters, recalculate the dimensions of the dissimilarity with respect to the current mode.

6.
If there are objects whose closest mode belongs to another cluster, group the objects against that cluster and update the mode of both clusters.

7.
Repeat steps 3 through 6 until no objects have moved clusters after all data has been tested.
In this study, the determination of the number of clusters was carried out by calculating the cluster evaluation value using the Silhouette Coefficient method. Silhouette studies the separation distance between the clusters generated in the clustering process which aims to measure the closeness of each object in a cluster to objects in other clusters. The Silhouette values range between − 1 and +1, with values close to +1 indicating the model with the best separation between clusters [33].
The metric used to measure distance, or a measure of dissimilarity, in the clustering algorithm in this study is the Levenshtein distance. There are three main parts in the Levenshtein distance calculation algorithm: initializing the distance matrix, calculating the distance matrix, and returning the value from the distance matrix with the largest value as a result of the Levenshtein distance. In this study, similarity strings were used in the grouping of DPP4 inhibitor molecules through a comparison of bit vector strings, each of which contained a string of 0 or 1, from each molecule obtained based on the fingerprinting method, ECFP and FCFP [17,18,19].
Molecular selection
The selection of DPP4 inhibitor molecules is made by taking one molecule from each cluster obtained from the results of clustering. Molecules are selected based on the lowest logP value and ‘Lipinski’s Rule of 5’ rule, i.e. the logP value cannot be more than 5.
In this study, the calculation of the logP value was carried out based on the atomicbased approach method, as proposed by Crippen and Wildman in the RDKit module (1999). In this proposed method, the logP value is given by adding up the contributions of each atom, as given in Eq. 1, where \(n_{i}\) is the number of atoms of the ith atomic type and \(a_{i}\) is the contribution coefficient of the ith atomic type [34,35,36,37].
$$\begin{aligned} logP = n_{i} a_{i} \end{aligned}$$
(1)
Before calculating the logP value, the molecular column can be added or displayed on the dataset first. After the logP value is obtained for each molecule, the dataset is then sorted according to the logP value. In this study, selection of molecules was carried out based on the lowest logP value of the molecules from each cluster, so that one molecule from each cluster would be selected, which was obtained from the clustering process with the lowest logP value.
Feature selection
CatBoost is an implementation of gradient boosting, which uses a binary decision tree as a basic prediction. Two important things that were introduced by CatBoost were the implementation of ordered boosting, namely permutationbased alternatives to classic algorithms, and algorithms for processing categorical features [29]. CatBoost divides the dataset into random permutations and applies ordered boosting to the random permutations. The advantage of using a gradient boosting decision tree is that it is relatively easy to take essential values for each attribute after the tree is built.
Using the CatBoost library in the Python programming language, prediction values change used to obtain essential features. For each feature, the prediction values change shows the average change in predictions if the feature values changethe more significant the importance, the greater the average change to the predicted value. The leaf pairs being compared have split values in different nodes to the leaf path. If it meets the splitting criteria, the object goes to the tree’s left side; otherwise, it goes the other way. The following Eqs. 2 and 3 determines the significant feature value.
$$\begin{aligned} FI= & {} \sum _{trees,leaf_{SF}} (v_1  avr)^{2} c_{1} + (v_2  avr)^{2} c_{2} \end{aligned}$$
(2)
$$\begin{aligned} avg= & {} \frac{v_1 c_1 + v_2 c_2}{c_1 + c_2} \end{aligned}$$
(3)
Deep learning
Deep learning is a subfield of machine learning that uses ANN algorithms, which are inspired by the structure and function of the human brain. DNN is a deep learning method that has been used since 2012 by Dahl et al. in the ‘Merck Molecular Activity Challenge’ to predict biomolecular targets in one drug [38]. The basic neural network model can be described in M liner combination with input variables \(x_1,\dots ,x_D\) as follows [39].
Data goes to \(z_{j}\) neuron
$$\begin{aligned} a_{j} = \sum _{i=1}^{D} w_{ji}^{(1)} x_i + w_{j_{0}}^{(1)}, j = 1, \dots M, \end{aligned}$$
(4)
where \(w_{ji}^{(1)}\) is denoted as the weight parameter and \(w_{j0}^{(1)}\) as the bias parameter.
Data out of \(z_j\) neuron
$$\begin{aligned} z_{j} = h(a_{j}) \end{aligned}$$
(5)
Data goes to \(y_{k}\) neuron
$$\begin{aligned} b_{k} = \sum _{i=1}^{M} w_{kj}^{(2)} z_j + w_{k_{0}}^{(2)}, j = 1, \dots M, \end{aligned}$$
(6)
where \(w_{kj}^{(2)}\) is denoted as the weight parameter and \(w_{k0}^{(2)}\) as the bias parameter.
Data out of \(y_k\) neuron
$$\begin{aligned} y_k=l(b_k). \end{aligned}$$
(7)
For binary classification problems, each activation unit is transformed using the sigmoid logistic function so that:
$$\begin{aligned} \sigma (x)=\frac{1}{1+exp(x)} \end{aligned}$$
(8)
The overall neural network function becomes
$$\begin{aligned} y_k (x,w)=l\left( \sum _{j=0}^{M} w_{kj}^{(2)} h \left( \sum _{i=0}^{D} w_{ji}^{(1)} x_{i} + w_{j_0}^{(1)}\right) \right) + w_{k_0}^{(2)} \end{aligned}$$
(9)
The classification problem is mathematically formulated as an optimisation problem where the objective function or loss function is the crossentropy between the target vector and the predicted results. Given the dataset set \(\left\{ x_n,t_n \right\} _{(i=1)}^N\), where \(x \in {\mathbb {R}}^m\) is the input vector and \(t_n\) is the target vector, crossentropy is defined as [39]
$$\begin{aligned} {\mathscr {L}} = \frac{1}{N} \sum _{i=1}^{N} [t_n \ln (y_n) +(1t_n) \ln (1y_n)]. \end{aligned}$$
(10)
Furthermore, the best training procedure is known as the backpropagation algorithm, which is implemented using stochastic gradient descent [40]. The idea of the backpropagation algorithm is to fix errors from the output layer to the input layer with the chain rule. A simple approach to using gradient information is to select an update of the weights to form small steps towards a negative gradient, that is,
$$\begin{aligned} w^{(\tau + 1)} = w^{(\tau )}  \eta \nabla E(w^{(\tau )}), \end{aligned}$$
(11)
where the learning rate is used as a parameter to determine the speed of model learning process, \(w^{(\tau + 1)}\) is denoted as the new weight parameter and \(w^{(\tau )}\) the weight before updating. At this stage, we evaluate \(\nabla E(x) = 0\), which is used to solve optimisation problems such as stochastic gradient descent iteratively. Deep neural network architecture in this study consists of an input layer, 3 hidden layers, and an output layer. In this study, the selection of 3 hidden layers was carried out based on the research of [41], where the hidden layers are chosen between 2 and 5, while [42] chose 4 hidden layers for the DNN model architecture. Other hyperparameters used in this study include initialization of weights with random values normally distributed, the activation function used in each hidden layer is RELU, and the sigmoid function in the output layer [42], Adam’s optimizer is used as an optimization method in updating weights, using a dropout rate of 0.2 on the input layer and 0.5 in the hidden layer [43], the batch size is chosen was 32. The epoch used in the learning model process was 30 by applying the early stopping technique. The architecture of the DNN model in this research is illustrated in Fig. 3.
Rotation forest
Let \(y=[y_1,y_2, \dots ,y_n ]^T\) be the set of class labels or response variables from the set \({w_1,w_2 }\). The decision tree in the ensemble is denoted by \(D_1,D_2, \dots ,D_L\) and the set of independent variables in X is denoted by F. There are two parameters in this method: the number of decision trees denoted by L, and the number of original variables separator denoted by K. These two parameters have an essential role in determining the success of the Rotation Forest method. The first step in this method is to choose the number of decision trees (L) used. Then, based on [44] to build each decision tree \(D_i\) for \(i=1, \dots ,L\) will be determined by the following steps:

1
Randomly divide the set of independent variables F into K subsets. To increase the probability of high diversity in each tree, select disjoint subset so that each subset of feature contains \(M = p/K\) features.

2
Let \(F_(i,j)\) be the jth feature subset for the training dataset \(D_i\). Note \(X_{i,j}\) as a data set with the variable set \(F_{i,j}\), where \(j = 1,2,3, \dots ,K\). Randomly select a nonempty class subset and draw a bootstrap object sample of 75% of the total observations where \(X_{i,j}^*\) is the bootstrapped data.

3
Apply PCA analysis to \(X_{(i,j)}^*\). Use all PC coefficients from PCA and save as \(a_{i,j}^{(1)},a_{i,j}^{(2)}, \dots ,a_{i,j}^{M_j}\) into matrix \(C_{(i,j)}\) of \(M_j \times 1\).

4
Principal Component Analysis (PCA) analysis on \(X_{i,j}^*\). Use all principal component coefficients from PCA and save as \(a_{i,j}^{(1)}, a_{i,j}^{(2)}, \dots ,a_{i,j}^{(M_j)}\) into matrix \(C_{i,j}\) of \(M_j \times 1\).

5
Rearrange \(C_{i,j}\) into a rotation matrix \(R_i\) of \(p \times p\) with the coefficients obtained.

6
Rearrange the columns of the matrix \(R_i\) so that they correspond to the original variable subset F to construct the \(D_i\) tree. Then, state the rotational matrix composed of \(R_{i}^{a}\).

7
Use \(XR_{i}^{a}\) as the training data cluster to build the \(D_i\) decision tree.

8
Estimate in each D the number of L trees followed by calculating the majority voting.
This research will use the Rotation Forest classification (PCA), which is developed using an algorithm that is under the Rotation Forest algorithm proposed by Rodriguez et al. (2006) and the Rotation Forest regressor proposed by Zhang et al. (2008) [44, 45].
Rotation Forest models that use PCA for classification and regression algorithms are called QSAR RFCPCA and QSAR RFRPCA respectively. Meanwhile, for models that use SPCA, they are called QSAR RFCSPCA and QSAR RFRSPCA. The differences between models that use PCA and SPCA is in the third step of Rotation Forest algorithm, in which SPCA analysis on \(X_{i,j}^*\) is performed, and all the main component coefficients of the sparse loading are used.
QSAR model building
The main focus on this current work is to implement efficient modeling and welldefined models of QSAR. To accomplish this objective, it was necessary to solve two challenging issues in QSAR modelling. The first one is to carry out rational molecular selection to obtain a representative molecular subset and molecular descriptor selection to predict inhibitory concentration of DPP4 inhibitor molecules. The second one is to make a modeling workflow as model validation, so that the result can be unbiasedly evaluated. A schematic of QSAR modeling workflow is shown in Fig. 1. This workflow starts with DPP4 inhibitor molecules data acquisition and data preprocessing. After this step, the feature selection is carried out to identify an optimized non redundant descriptor that can lead to best models. Finally, when the descriptors are determined, it can be used to develop the QSAR Classification and QSAR Regression models. The QSAR Classification model building was executed using the Rotation Forest Classifier and DNN algorithms. QSAR Classification with the Rotation Forest Classifier algorithm uses 2 types of matrix rotation methods, namely PCA and SPCA. Each of these models is called QSAR RFCPCA and QSAR RFCSPCA. Meanwhile, QSAR Classification with the DNN algorithm is called QSAR DNN. The QSAR Regression model building was executed using the Rotation Forest Regressor algorithm. QSAR Regression with the Rotation Forest Regressor algorithm also uses PCA and SPCA as the rotation matrix methods. Each of these models is called QSAR RFRPCA and QSAR RFRSPCA.
Evaluation
To determine the QSAR classification model, a confusion matrix that is used to indicate the number of observations was predicted correctly or not is required [46]. There are four parameters in this method, namely the true positive, false positive, false negative, and true negative.
Based on these parameters, the classification model evaluation metrics can calculate performance evaluation; such as Sensitivity, Specificity, Accuracy, and Matthews correlation coefficient (MCC), as explained as follows.
$$\begin{aligned} sensitivity= & {} \frac{TP}{TP + FN}, \end{aligned}$$
(12)
$$\begin{aligned} specificity= & {} \frac{TN}{TN + FP}, \end{aligned}$$
(13)
$$\begin{aligned} accuracy= & {} \frac{TP + TN}{TP + FP + FN + TN}, \end{aligned}$$
(14)
$$\begin{aligned} MCC= & {} \frac{TP \times TN  FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}. \end{aligned}$$
(15)
To assess the performance of the QSAR regression model obtained, the coefficient of determination \(R^2\) of root mean square error (RMSE) can be used. This criterion is determined using the formulas (15) and (16).
$$\sqrt {\frac{{\sum {(Y_{i}  \widehat{{Y_{i} }})^{2} } }}{n}}$$
(16)
$$\begin{aligned}&1  \frac{\sum (Y_i  {\hat{Y}}_i)^2}{\sum (Y_i  {\bar{Y}})^2}. \end{aligned}$$
(17)