Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus

New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Rotation Forest and Deep Neural Network (DNN) are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA (SPCA) as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method. The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew’s correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%. The K-modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The QSAR RFC-PCA and QSAR RFR-PCA models performed better than QSAR RFC-SPCA and QSAR RFR-SPCA models because QSAR RFC-PCA and QSAR RFR-PCA models have more effective time than the QSAR RFC-SPCA and QSAR RFR-SPCA models.


Background
Type 2 diabetes is one of the fastest-growing chronic diseases due to decreased insulin function and insulin secretion [1].Recent treatment for type 2 diabetes mellitus has been developed to increase the effect of incretin, which can improve insulin function and secretion.DPP-4 inhibitors are drugs that work as agents that inhibit the dipeptidyl peptidase-4 enzyme, which can increase the effects of incretin [2].However, the registered DPP-4 inhibitor drugs have adverse side effects, such as upper respiratory tract infections, pancreatitis, and risk of heart failure, if taken for a long time.Therefore, development of new DPP-4 inhibitors is needed.
In-silico methods apply the use of computers as a tool in drug discovery that can perform cost-efficiently compared to conventional methods, which are known to be timeconsuming and high cost [3].They offer simulations and calculations that can rationally reduce the number of proposed compounds and assist in studying drug interactions with targets to the toxic properties of compounds and their metabolites [4].QSAR is the ligand-based virtual screening method that studies the relationship between the chemical structures and biological activities of the molecules that can be calculated to derive a model or equation that can be used to predict the activity of a compound [4][5][6].
Several studies have performed QSAR against DPP-4 inhibitors.[2] developed QSAR using naïve Bayesian and recursive partitioning approaches to predict new DPP-4 inhibitors using thousands of available DPP-4 inhibitor molecule data.Hermansyah (2019) developed the QSAR method using XGBoost, Random Forest, Support Vector Machine, Multiple Linear Regression, and Deep Learning techniques to find a new DPP-4 inhibitor hit compound.These studies have relatively high predictive accuracy, which is above 80%.However, both studies still carried out a random selection method of molecules in the data partitioning stage.According to Andrada et al. (2015) [7], a random selection of molecules can lead to a mismatch because all members of the validation data may be members of the same group, thereby resulting in a molecular set that is not representative of the real data.Thus, a method is needed that can produce a representative data set in the data partition stage [2,7,8].
The problems in QSAR, in general, include the problem of transforming molecules into feature vectors that can be used as input data (feature extraction) and building a high-performance QSAR model.Rational selection of molecules in the data preparation stage of QSAR modelling is believed to affect the performance improvement of the QSAR model compared to random molecular selection.One method that can be used to select a subset of molecules rationally is clustering [7,9].Molecules can be uniquely identified from chemical databases by molecular descriptors [10].Molecular descriptors are derived by several algorithms that describe specific aspects of a compound [11].Molecular descriptors of compound molecules can be extracted from the Simplified Molecular Input Line Entry System (SMILES) format into a molecular fingerprint.SMILES is a general method that facilitates the representation and manipulation of molecular structures using computers with text strings and symbols, such as for single bonds, f or double bonds, and # for triple bonds [9,12,13].
A fingerprint is a numerical representation concept of a particular structure or feature of a molecule that combines the presence or absence of different molecular substructures in a molecule into one molecular descriptor.Fingerprints can be classified into non-hashed and hashed.A non-hashed fingerprint, also known as a structural key, is based on a predefined substructure dictionary, so there is a unique mapping between the position of the bit vector and a particular substructure.A hashed fingerprint is defined as a method that generates substructure set for a molecule that is converted into a fixedlength vector of bits.Molecular fingerprint data in the form of vectors containing bit strings with the numbers 0 or 1 can be considered as data with categorical features because they state the presence or absence of a particular molecular substructure in the fingerprint pattern of a molecule [11,14].In this research, the extended circular fingerprint (ECFP) and functional class fingerprint (FCFP) methods were used with diameters 4 and 6 [15], respectively, and a bit vector length of 1024, where the bit vector length represents the features of the data.
Clustering is a method that can be used to group or divide data by distance metrics that have been determined in high dimensional space.K-modes clustering is a clustering method that employs a similar procedure to K-means clustering.This method uses distance measures to handle categorical objects, replaces the average calculation method on the cluster with data mode, and uses frequency-based methods to update the data mode in the clustering process to minimise the clustering cost function [16].Levenshtein distance is a distance method that can be used in the K-modes clustering as a dissimilarity function.It aims at measuring the similarity of two strings in the form of a sequence, where the order of the elements in one string is considered necessary using insertion, deletion, and substitution operations [17][18][19].
The artificial intelligence paradigm has been widely used to model QSAR, especially in the development of machine learning and deep learning methods.These methods have been successful in various applications of bioinformatics, including hypoxia detection in Cardiotocography (CTG) signals [20], prediction of protein interactions using amino acid sequences [21], biclustering method implementation on gene expression data of carcinoma tumour [22], protein interaction networks of schizophrenia's risk factor candidate genes [23], and Ebola virus phylogenetic analysis [24].Rotation Forest is a machine learning method that uses PCA in forming a rotation matrix to rotate data sets that are then compiled into a decision tree.DNN is a deep learning method consisting of an input layer, several hidden layers, and an output layer.Several studies have been conducted to compare machine learning and deep learning methods in the drug discovery process.Similar performances of the two methods have been found with random data selection, but different performances can occur when data selection is made rationally [7,25,26].Feature selection is one of the important steps for processing and analysing machine learning methods effectively.This technique can reduce the number of features of data while maintaining the same or even better learning performance by selecting a feature subset [27].Important features are often useful for studying the relative importance or contribution of each feature in predicting targets [28].CatBoost is a gradient boosting algorithm developed by [29].The benefit of using the gradient boosting method is that it is relatively easy to determine the important score for each feature.
The analysis of the QSAR method is generally categorised into QSAR regression and QSAR classification [30].The QSAR classification method is used to predict the active compound from the database, and then the QSAR regression method is used to study the activity value [30.This study aims to build QSAR models using an artificial intelligence paradigm, especially for the QSAR classification and regression models, to design a new DPP-4 inhibitor candidate for the treatment of type 2 diabetes.Machine learning and deep learning methods are used, including the Rotation Forest and DNN methods, respectively, where in the machine learning methods we compared PCA with sparse PCA (SPCA) as transformation methods in the framework of Rotation Forest.The K-modes clustering with Levenshtein distance method was used for the DPP-4 inhibitor molecular selection method and CatBoost was used for the feature selection method.
There are several novelties in this research.The first novelty in this study is the use of the K-modes clustering method using Levenshtein distance.The clustering analysis was carried on DPP-4 inhibitors through the fingerprints obtained by the ECFP and FCFP methods.The determination of the number of clusters in the K-modes clustering algorithm was evaluated by applying the Silhouette Coefficient method.Selection of DPP-4 inhibitor molecules was carried out based on the clustering results by taking one molecule from each cluster with the lowest logP value that is less than five, according to Lipinski's Rule of 5 [31].The second novelty is the use of SPCA as a rotation matrix in the Rotation Forest method.The SPCA method is often used to perform dimensional reduction.In this study, SPCA was applied in the Rotation Forest method to transform feature data variables into new variables within an independent loading vector-matrix, in which all of the principal components (PCs) in the SPCA loading matrix were retained, to build QSAR models.CatBoost as a feature selection method for ECFP and FCFP fingerprint methods is a new thing in ligand-based virtual screening research.Therefore, this study will compare and analyse the performance of the QSAR model as a ligand-based virtual screening method with and without using the CatBoost feature selection method.

Methodology
The schematic representation of proposed methods is shown in Fig. 1.

Data source and acquisition
The data used in this study is 4661 DPP-4 inhibitor molecular data in the form of Canonical SMILES which contains information on the molecular structure of the DPP-4 inhibitor with the type of biological activity IC50.The data were obtained from https://www.ebi.ac.uk/ site accessed in July 2019 through several stages, such as selecting the DPP-4 inhibitor and then selecting the biological activity value of IC50.The IC50 values have been converted to molar units pIC50 (defined as log 10 IC50).

Data preprocessing
In the data preprocessing stage, there are three things that can be done, including data cleaning, data conversion, and feature extraction.The data were cleaned with the following criteria: (1) only selecting human DPP-4 inhibitor; (2) only selecting DPP-4 enzyme; (3) select data with biological activity value IC50; (4) eliminating duplicate compounds.The IC50 value is in the interval of 0.1 to 98,000 nM.In this study, compounds that have a value between 50 and 500 nM are defined as grey compounds, so they need to be removed.Besides, compounds having an IC50 value below 50 nM are defined as active compounds and an IC50 value above 500 nM as inactive compounds.
Several data cleaning steps, such as removing incomplete data rows in each column, reading Canonical SMILES columns with the OpenBabel library, removing duplicate molecular structures, and removing salt in the Canonical SMILES column using the RDKit Salt Stripper library, were executed using the KNIME software version 3.7, as shown in Fig. 2. KNIME nodes can do a wide set of functions for many different tasks such as read and write data files, data processing, statistical analysis, data mining, and graphical visualization.At KNIME there are RDKit and CDK nodes that are very complete for cheminformatics applications, from reading data in various formats to converting molecules from 2D to 3D [32].
At the data conversion step, the data is converted from CSV to SDF format.The feature extraction process carried out is calculating the molecular descriptors using several fingerprinting methods [11].In this research, the RDKit packages which are implemented in the Python 3.6 programming language are used for the calculation of molecular

Data clustering using K-modes
K-modes clustering is a clustering method that has a similar procedure to K-means clustering by expanding the paradigm for grouping categorical data by making several modifications, such as using simple dissimilarity measures to handle categorical objects, replacing the average calculation method on the cluster with the data mode; and using frequency-based methods to update the data mode to solve problems in the K-means clustering algorithm [16].The K-modes clustering algorithm used in this study is described as follows: 1. Determine the number of clusters (k). 2. Determine the initial k mode for each cluster.3. Calculate the distance between each data to the mode based on the dimension of dissimilarity.4. Group objects against the cluster in the closest mode. 5.After all objects have been grouped into k clusters, recalculate the dimensions of the dissimilarity with respect to the current mode.6.If there are objects whose closest mode belongs to another cluster, group the objects against that cluster and update the mode of both clusters.7. Repeat steps 3 through 6 until no objects have moved clusters after all data has been tested.
In this study, the determination of the number of clusters was carried out by calculating the cluster evaluation value using the Silhouette Coefficient method.Silhouette studies the separation distance between the clusters generated in the clustering process which aims to measure the closeness of each object in a cluster to objects in other clusters.The Silhouette values range between − 1 and +1, with values close to +1 indicating the model with the best separation between clusters [33].
The metric used to measure distance, or a measure of dissimilarity, in the clustering algorithm in this study is the Levenshtein distance.There are three main parts in the Levenshtein distance calculation algorithm: initializing the distance matrix, calculating the distance matrix, and returning the value from the distance matrix with the largest value as a result of the Levenshtein distance.In this study, similarity strings were used in the grouping of DPP-4 inhibitor molecules through a comparison of bit vector strings, each of which contained a string of 0 or 1, from each molecule obtained based on the fingerprinting method, ECFP and FCFP [17][18][19].

Molecular selection
The selection of DPP-4 inhibitor molecules is made by taking one molecule from each cluster obtained from the results of clustering.Molecules are selected based on the lowest logP value and 'Lipinski's Rule of 5' rule, i.e. the logP value cannot be more than 5.
In this study, the calculation of the logP value was carried out based on the atomicbased approach method, as proposed by Crippen and Wildman in the RDKit module (1999).In this proposed method, the logP value is given by adding up the contributions of each atom, as given in Eq. 1, where n i is the number of atoms of the ith atomic type and a i is the contribution coefficient of the ith atomic type [34][35][36][37].
Before calculating the logP value, the molecular column can be added or displayed on the dataset first.After the logP value is obtained for each molecule, the dataset is then sorted according to the logP value.In this study, selection of molecules was carried out based on the lowest logP value of the molecules from each cluster, so that one molecule from each cluster would be selected, which was obtained from the clustering process with the lowest logP value.

Feature selection
CatBoost is an implementation of gradient boosting, which uses a binary decision tree as a basic prediction.Two important things that were introduced by CatBoost were the implementation of ordered boosting, namely permutation-based alternatives to classic algorithms, and algorithms for processing categorical features [29].CatBoost divides the dataset into random permutations and applies ordered boosting to the random permutations.The advantage of using a gradient boosting decision tree is that it is relatively easy to take essential values for each attribute after the tree is built.
Using the CatBoost library in the Python programming language, prediction values change used to obtain essential features.For each feature, the prediction values change shows the average change in predictions if the feature values change-the more significant the importance, the greater the average change to the predicted value.The leaf pairs being compared have split values in different nodes to the leaf path.If it meets the splitting criteria, the object goes to the tree's left side; otherwise, it goes the other way.The following Eqs. 2 and 3 determines the significant feature value.
(1) logP = n i a i

Deep learning
Deep learning is a sub-field of machine learning that uses ANN algorithms, which are inspired by the structure and function of the human brain.DNN is a deep learning method that has been used since 2012 by Dahl et al. in the 'Merck Molecular Activity Challenge' to predict biomolecular targets in one drug [38].The basic neural network model can be described in M liner combination with input variables x 1 , . . ., x D as follows [39].
Data goes to z j neuron where w (1)  ji is denoted as the weight parameter and w (1)  j0 as the bias parameter.Data out of z j neuron Data goes to y k neuron where w (2)  kj is denoted as the weight parameter and w (2)  k0 as the bias parameter.Data out of y k neuron For binary classification problems, each activation unit is transformed using the sigmoid logistic function so that: The overall neural network function becomes The classification problem is mathematically formulated as an optimisation problem where the objective function or loss function is the cross-entropy between the target vector and the predicted results.Given the dataset set {x n , t n } N (i=1) , where x ∈ R m is the input vector and t n is the target vector, cross-entropy is defined as [39] (2) Furthermore, the best training procedure is known as the backpropagation algorithm, which is implemented using stochastic gradient descent [40].The idea of the backpropagation algorithm is to fix errors from the output layer to the input layer with the chain rule.A simple approach to using gradient information is to select an update of the weights to form small steps towards a negative gradient, that is, where the learning rate is used as a parameter to determine the speed of model learning process, w (τ +1) is denoted as the new weight parameter and w (τ ) the weight before updating.At this stage, we evaluate ∇E(x) = 0 , which is used to solve optimisation prob- lems such as stochastic gradient descent iteratively.Deep neural network architecture in this study consists of an input layer, 3 hidden layers, and an output layer.In this study, the selection of 3 hidden layers was carried out based on the research of [41], where the hidden layers are chosen between 2 and 5, while [42] chose 4 hidden layers for the DNN model architecture.Other hyperparameters used in this study include initialization of weights with random values normally distributed, the activation function used in each hidden layer is RELU, and the sigmoid function in the output layer [42], Adam's optimizer is used as an optimization method in updating weights, using a dropout rate of 0.2 on the input layer and 0.5 in the hidden layer [43], the batch size is chosen was 32.The epoch used in the learning model process was 30 by applying the early stopping technique.The architecture of the DNN model in this research is illustrated in Fig. 3.

Rotation forest
Let y = [y 1 , y 2 , . . ., y n ] T be the set of class labels or response variables from the set w 1 , w 2 .The decision tree in the ensemble is denoted by D 1 , D 2 , . . ., D L and the set of independent variables in X is denoted by F. There are two parameters in this method: the number of decision trees denoted by L, and the number of original variables separator denoted by K.These two parameters have an essential role in determining the success of the Rotation Forest method.The first step in this method is to choose the number of (10 (11) w (τ +1) = w (τ ) − η∇E(w (τ ) ), Fig. 3 DNN model architecture decision trees (L) used.Then, based on [44] to build each decision tree D i for i = 1, . . ., L will be determined by the following steps: 1 Randomly divide the set of independent variables F into K subsets.To increase the probability of high diversity in each tree, select disjoint subset so that each subset of feature contains M = p/K features.2 Let F ( i, j) be the j-th feature subset for the training dataset D i .Note X i,j as a data set with the variable set F i,j , where j = 1, 2, 3, . . ., K .Randomly select a nonempty class subset and draw a bootstrap object sample of 75% of the total observations where X * i,j is the bootstrapped data.3 Apply PCA analysis to X * (i,j) .Use all PC coefficients from PCA and save as a i,j , . . ., a M j i,j into matrix C (i,j) of M j × 1. 4 Principal Component Analysis (PCA) analysis on X * i,j .Use all principal component coefficients from PCA and save as a i,j , . . ., a (M j ) i,j into matrix C i,j of M j × 1. 5 Rearrange C i,j into a rotation matrix R i of p × p with the coefficients obtained.6 Rearrange the columns of the matrix R i so that they correspond to the original vari- able subset F to construct the D i tree.Then, state the rotational matrix composed of R a i .7 Use XR a i as the training data cluster to build the D i decision tree.8 Estimate in each D the number of L trees followed by calculating the majority voting.This research will use the Rotation Forest classification (PCA), which is developed using an algorithm that is under the Rotation Forest algorithm proposed by Rodriguez et al. (2006) and the Rotation Forest regressor proposed by Zhang et al. (2008) [44,45].
Rotation Forest models that use PCA for classification and regression algorithms are called QSAR RFC-PCA and QSAR RFR-PCA respectively.Meanwhile, for models that use SPCA, they are called QSAR RFC-SPCA and QSAR RFR-SPCA.The differences between models that use PCA and SPCA is in the third step of Rotation Forest algorithm, in which SPCA analysis on X * i,j is performed, and all the main component coefficients of the sparse loading are used.

QSAR model building
The main focus on this current work is to implement efficient modeling and well-defined models of QSAR.To accomplish this objective, it was necessary to solve two challenging issues in QSAR modelling.The first one is to carry out rational molecular selection to obtain a representative molecular subset and molecular descriptor selection to predict inhibitory concentration of DPP-4 inhibitor molecules.The second one is to make a modeling workflow as model validation, so that the result can be unbiasedly evaluated.A schematic of QSAR modeling workflow is shown in Fig. 1.This workflow starts with DPP-4 inhibitor molecules data acquisition and data preprocessing.After this step, the feature selection is carried out to identify an optimized non redundant descriptor that can lead to best models.Finally, when the descriptors are determined, it can be used to develop the QSAR Classification and QSAR Regression models.The QSAR Classification model building was executed using the Rotation Forest Classifier and DNN algorithms.QSAR Classification with the Rotation Forest Classifier algorithm uses 2 types of matrix rotation methods, namely PCA and SPCA.Each of these models is called QSAR RFC-PCA and QSAR RFC-SPCA.Meanwhile, QSAR Classification with the DNN algorithm is called QSAR DNN.The QSAR Regression model building was executed using the Rotation Forest Regressor algorithm.QSAR Regression with the Rotation Forest Regressor algorithm also uses PCA and SPCA as the rotation matrix methods.Each of these models is called QSAR RFR-PCA and QSAR RFR-SPCA.

Evaluation
To determine the QSAR classification model, a confusion matrix that is used to indicate the number of observations was predicted correctly or not is required [46].There are four parameters in this method, namely the true positive, false positive, false negative, and true negative.
Based on these parameters, the classification model evaluation metrics can calculate performance evaluation; such as Sensitivity, Specificity, Accuracy, and Matthews correlation coefficient (MCC), as explained as follows.
To assess the performance of the QSAR regression model obtained, the coefficient of determination R 2 of root mean square error (RMSE) can be used.This criterion is deter- mined using the formulas ( 15) and (16).

Molecular selection result
Clustering procedure is used to assist in the molecular selection process.Meanwhile, The K-modes clustering algorithm is used to classify DPP-4 inhibitor compounds based on molecular fingerprints that can help the process of selecting molecules rationally.Levenshtein distance, which is used as a measure of dissimilarity in the K-modes clustering algorithm, is used to measure the closeness or similarity of the (12)  ( DPP-4 inhibitor compound's molecules through a string comparison of molecular bit fingerprint vectors.The process of selecting DPP-4 inhibitor molecules was carried out based on the logP value criterion in the Lipinski rule after obtaining the results of grouping the DPP-4 inhibitor molecules in the clustering process using the K-modes clustering algorithm with Levenshtein distance. Based on the results of grouping four datasets of molecular fingerprint for DPP-4 inhibitor compounds with the ECFP and FCFP methods, namely ECFP_4, ECFP_6, FCFP_4, and FCFP_6, the number of clusters from each dataset were obtained from the lowest being FCFP_4 = 1261, FCFP_6 = 1261, ECFP_4 = 1263, and ECFP_6 = 1265.The FCFP_4 dataset obtained the highest silhouette coefficient value (0.2574).
Based on the calculation of the logP value for 2053 DPP-4 inhibitor molecules, the logP value was obtained in the range of − 1.6693 to 4.99044, meaning that the logP value has met the criterion of Lipinski's rule that the logP value of a compound molecule must be less than 5.All fingerprint datasets labelled for each cluster are sorted according to the lowest logP value.After all the datasets are sorted, one molecule with the lowest logP value is taken from each cluster in each fingerprint dataset.1263 molecules are obtained for the ECFP_4, 1265 for the ECFP_6, 1261 for the FCFP_4, and 1261 for the FCFP_6.Based on the selected molecules in all fingerprint datasets, 1020 molecules of the same DPP-4 inhibitor compound were obtained from all datasets.Those representative molecules can be further used for QSAR modelling.

Features selection result
This section will explain the results of selecting features using the CatBoost method.In assigning a value to essential features, each feature extraction dataset is used as an input vector at the learning stage of the CatBoost method.In this study, the CatBoost model learning was performed on each feature extraction data using the Python programming language with the CatBoost library.CatBoost models are generated, then there is an essential value in each feature.The feature that has the highest importance value indicates that it contributes the most in predicting the target.The performance results of the CatBoost model in predicting active and inactive compounds of DPP-4 inhibitors are presented in Table 1.
Table 1 informs that the accuracy of the CatBoost model results is above 0.840.Besides, the CatBoost model also has a balanced value between sensitivity, specificity, and MCC values.As a result, the recommended features as the essential features come from a good model.The following are the number of essential features for each dataset, where the sum of the feature values qualifies the proportion values of 60%, 70%, 80%, and 90%.There are 1024 features from the feature extraction results that are then selected using the CatBoost method.Fig. 2 explains that from the four fingerprint datasets, FCFP_4 received the least number of important features, by 55, with a proportion of 60%.In comparison, ECFP_4 received the most number of important features, by 280, with a ratio of 90%.Furthermore, these features will be used as input feature vectors to build the QSAR classification model using the Rotation Forest and DNN methods.

QSAR classification models using machine learning
This section will explain the QSAR model in the machine learning approach using the Rotation Forest method to predict DPP-4 inhibitors into active and inactive compounds with different numbers of features based on the results of feature selection in the Cat-Boost method using a proportion of feature importance value of 60%, 70%, 80%, and 90% and without doing performing feature selection (100% feature importance).The QSAR classification model building using the machine learning approach is called QSAR-RFC(PCA) and QSAR-RFC(SPCA).Furthermore, it will be observed whether using some of the features in the feature extraction dataset results in an increase or decrease in the performance of the Rotation Forest model.The performance results of the QSAR-RFC(PCA) and QSAR-RFC(SPCA) models by selecting or not selecting features using the CatBoost method are presented in Figs. 5 and 6, respectively.
The feature selection with a proportion of 60% increased the MCC value in the feature extraction datasets ECFP_4, ECFP_6 and FCFP_6.Still, it did not increase the MCC value in the FCFP_4 feature extraction dataset, because there was a difference in the decrease in the performance of the MCC value by 0.018 in the QSAR-RFC(PCA) models.On the other hand, in the QSAR-RFC(SPCA) model, the MCC value increased in the feature extraction dataset of ECFP_4, ECFP_6, FCFP_4 and FCFP_6.Thus, to predict the DPP-4 inhibitors into active or inactive compounds using the QSAR-RFC(PCA) model combined with the FCFP_4 feature extraction, it is better to use all of the features to get a good model.

QSAR classification models using deep learning
This section will explain the QSAR model in the deep learning approach using the DNN method to predict DPP-4 inhibitors into active and inactive compounds with different numbers of features based on the results of feature selection in the CatBoost method utilising a proportion of feature importance value of 60%, 70%, 80%, and 90% and without making feature selection (100% feature importance).QSAR classification model building using a deep learning approach is called QSAR-DNN.Furthermore, it will be observed whether using some of the features in the feature extraction dataset results in an increase or decrease of the performance of the DNN model.The performance results of the QSAR-DNN by selecting or not feature using the CatBoost method are presented in Figs. 5 and 7 .
The feature selection with a proportion of 60%, 70%, 80% and 90% increased the MCC value in the feature extraction dataset using QSAR-DNN models.For example, Fig. 4 shows that only by selecting features with a proportion of 60% in the ECFP_4 feature extraction, the QSAR-DNN can obtain an MCC value of 0.766; this is better than using all the features where the MCC value is 0.704.There are several features in the feature extraction of ECFP and FCFP that do not represent the molecular structure of the DPP-4 inhibitor properly, so these features need to be removed.The best QSAR-DNN in this study is the DNN model combined with ECFP_4 as a feature extraction method with a feature selection proportion of 90%.This model has sensitivity, specificity, accuracy and MCC values of 0.905, 0.903, 0.904 and 0.807, respectively.

Comparison of the QSAR classification model using machine learning and deep learning approach
The machine learning and deep learning methods used in this study are expected to predict each DPP-4 inhibitor molecular structure, whether it is an active or inactive compound.This section will compare the QSAR classification model using the machine learning approach, namely the QSAR-RFC(PCA) and QSAR-RFC(SPCA), with the QSAR classification model using deep learning approach, namely, QSAR-DNN.Based on the results obtained in this study, the QSAR model in the machine learning approach, the QSAR-RFC(SPCA) model combined with the ECFP_6 feature extraction and using the feature important value proportion of 60% of the features obtained an MCC value of 0.824, which is the highest MCC value.Meanwhile, the values for sensitivity, specificity and accuracy were 0.887, 0.938 and 0.911, respectively, and a running time of 11.622 s.Furthermore, the QSAR model, in the deep learning approach, has the best model, namely the QSAR-DNN model combined with ECFP_4.This model obtained the highest MCC value of 0.807 on ECFP_4 with a feature selection proportion of 90%, with the sensitivity, specificity and accuracy values of the QSAR-DNN model being 0.905, 0.903 and 0.904, respectively, and a running time of 2.046 s.
Balanced sensitivity, specificity and accuracy values are good criteria for a model [46], so that the model can distinguish between active and inactive compounds accurately.The QSAR classification model using the machine learning and deep learning approaches has shown a balanced performance between sensitivity, specificity and accuracy values.In this study, the author will not only observe the balance of sensitivity, specificity and accuracy values but also use the MCC value to compare the best performance of the QSAR classification model, because the MCC value is a more representative measurement metric when compared to accuracy [47].The QSAR classification model using machine learning has the highest MCC value of 0.824, but its running time is longer than that of the DNN model.

QSAR regression
This section discusses the QSAR regression model using a Rotation Forest with a PCA rotation matrix and SPCA rotation matrix.This QSAR regression model uses ).The machine learning approach using the Rotation Forest regression method will predict the value of DPP-4 inhibitor activity with several different features, based on the results of feature selection using the CatBoost Regressor method with the proportion of important feature values of 60%, 70%, 80%, and 90% and without making feature selection (100%).Furthermore, it will be observed whether using some of the features in the feature extraction dataset will increase or decrease the performance of the Rotation Forest regression model.In this study, the QSAR Rotation Forest regression model uses a PCA rotation matrix called QSAR-RFR(PCA), and the QSAR Rotation Forest regression model uses an SPCA rotation matrix called QSAR-RFR(SPCA).The performance results of the QSAR-RFR(PCA) model using the CatBoost feature selection method and without using the feature selection are presented in Tables 2, 3, 4 and 5.
Based on the simulation results, on the ECFP_4, ECFP_6, FCFP_4 and FCFP_6 datasets, the optimum features for the QSAR-RFR(PCA) model are 90%, because they have an R 2 value of 0.344, 0.345, 0.306 and 0.362 in each dataset.Meanwhile, without performing feature selection, the values for R 2 are 0.313, 0.278, 0.19 and 0.125 for each dataset, which means that the R 2 value is lower than the highest performance that can be achieved with the feature selection process.The performance results of   the QSAR-RFR(SPCA) model using the CatBoost feature selection method and without using the feature selection are presented in Tables 6, 7, 8 and 9.
Based on the simulation results, on the ECFP_4 and ECFP_6 datasets, the optimum features for the QSAR-RFR (SPCA) model are 90%, because they have R 2 values of 0.366 and 0.408 in each dataset, respectively.In the FCFP_4 dataset, the optimum number of features for the QSAR-RFR(SPCA) model is 60%, because it has an R 2 value of 0.334.
In the FCFP_6 dataset, on the other hand, the optimum number of features for the QSAR-RFR (SPCA) model is 70%, because it has an R 2 value of 0.353.Meanwhile,    without performing feature selection, the values for R 2 are 0.303,0.233,0.174 and 0.273 for each datasets ECFP_4, ECFP_6, FCFP_4 and FCFP_6, meaning that the R 2 value is lower than the highest performance that can be achieved with the feature selection process.Figs. 8 and 9 explain that the more features used, the longer the computation time required.From this, it can be observed that the importance of the feature selection process increases the value of R 2 and reduces the computation time in the QSAR-RFR(PCA) and QSAR-RFR(SPCA) models.

Conclusions
New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus.This study aims to build QSAR classification and regression models using machine  Clustering was carried out on the molecular fingerprint of DPP-4 inhibitors obtained from the SMILES feature.Several methods, such as ECFP diameter 4 and 6 and FCFP diameter 4 and 6, were used to construct four fingerprint datasets.From the selection process of molecules, 1020 representative DPP-4 inhibitor molecules are produced.Before constructing the QSAR classification and regression model, feature selection was carried out using CatBoost.The results of feature selection using CatBoost are proven can improve the performance of the QSAR model, which requires a running time that is not too long.It is indicated by the results of the performance of QSAR-RFC, QSAR-RFR, and QSAR-DNN that have improved compared to using all features.The QSAR-RFC(PCA) method performed best in terms of predicting active and inactive DPP-4 inhibitors using ECFP_4, with an accuracy, sensitivity, specificity and MCC of 0.902, 0.914, 0.887 and 0.802, respectively, at 90% features with a running time of 4.307 s.The QSAR-RFC(SPCA) model using ECFP_6 has good performance, obtaining the corresponding values of 0.911, 0.887, 0.938 and 0.824, respectively, at 60% features with a running time of 11.622 s.The best QSAR-DNN model performance using ECFP_4 resulted in accuracy, sensitivity, specificity and MCC of 0.904, 0.905, 0.903 and 0.807, respectively, at 60% features with a running time of 2.046 s.
From these observations, the PCA rotation matrix's use to predict the activity value of the DPP-4 inhibitor IC50 is good enough.Based on the explanation above, in the QSAR regression model, the use of the PCA rotation matrix on the rotation Rotation Forest method obtains a more efficient computation time than modifying the rotation matrix with SPCA.The ECFP_4 dataset of the QSAR-RFR(PCA) model was able to get an R 2 value of 0.344 with a computation time of 4.28 s, while the QSAR- RFR(SPCA) model obtained an R 2 value of 0.366 with a computation time of 32.11 s.It explains why the SPCA rotation matrix increases the R 2 value by 0.022 but requires a very long computation time.

Fig. 4
Fig.4 The number of important features

Table 1
CatBoost model performance