Skip to main content

Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis


Most of the current cancer treatment approaches are invasive along with a broad spectrum of side effects. Furthermore, cancer drug resistance known as chemoresistance is a huge obstacle during treatment. This study aims to predict the resistance of several cancer cell-lines to a drug known as Cisplatin. In this papers the NCBI GEO database was used to obtain data and then the harvested data was normalized and its batch effects were corrected by the Combat software. In order to select the appropriate features for machine learning, the feature selection/reduction was performed based on the Fisher Score method. Six different algorithms were then used as machine learning algorithms to detect Cisplatin resistant and sensitive samples in cancer cell lines. Moreover, Differentially Expressed Genes (DEGs) between all the sensitive and resistance samples were harvested. The selected genes were enriched in biological pathways by the enrichr database. Topological analysis was then performed on the constructed networks using Cytoscape software. Finally, the biological description of the output genes from the performed analyses was investigated through literature review. Among the six classifiers which were trained to distinguish between cisplatin resistance samples and the sensitive ones, the KNN and the Naïve Bayes algorithms were proposed as the most convenient machines according to some calculated measures. Furthermore, the results of the systems biology analysis determined several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. These predictions pave the way for further experimental researches.


Cancer is one of the most lethal and costly diseases around the world. There exist several common therapeutic procedures such as surgery, cytotoxic chemotherapy, targeted therapy, radiation therapy, endocrine therapy and immunotherapy which are used based on the level of cancer aggressiveness. The majority of the aforementioned approaches are invasive with a broad spectrum of side effects [1,2,3,4]. Furthermore, one important challenge that clinical practice face is drug resistance, which results from tolerance of cancer cells to anti-cancer agents [5, 6].

The concept of drug resistance was first distinguished by observing bacterial resistance to antibiotics, but the phenomenon was also attributed to a wider range of disorders including cancers in no time [7]. Traditional chemotherapeutic agents destroy cancer cells by directly damaging DNA strand. Therefore, not only they are non-specific but also, they result in broad side effects. Furthermore, studies show that new drugs developed for targeting cancer cells are most effective in the beginning of the therapies, but as time passes, most patients show resistance to these drugs. Resistance to new targeted, chemotherapeutic agents is a big challenge in cancer therapies as these agents are responsible for the preponderance of relapses. The drug resistance phenomena root from different mechanisms which can be cancer-specific or not, such as drug efflux [7, 8].

So far, numerous researches have been done to distinguish and describe cancer drug resistance. Housman et al. categorized the mechanisms of drug resistance in cancer as drug inactivation, drug target alteration, drug efflux, DNA damage repair, cell death inhibition, and the epithelial-mesenchymal transition [7]. On the other hand, drug resistance was classified into intrinsic resistance that exists before drug treatment or acquired resistance which is induced after the therapy. The prediction of drug resistance can overcome the inevitable failure of targeted and chemical therapeutics in clinical anticancer treatment [9, 10].

Chemotherapy resistance prediction methods include cell culture-based chemo-sensitivity tests, DNA, RNA, and protein-based chemo-sensitivity tests and recently developed computational methods [11, 12]. Cell culture-based tests which have been used for more than 30 years have some technical weaknesses. The technique is time-consuming and the primary culture has a low potency to success [13, 14].

To resolve the aforementioned issues, DNA, RNA, and protein-based chemosensitivity tests emerged [15, 16]. These methods include gene-based tools such as the Oncotype DX® assay which uses 21 genes to predict the recovery of breast cancer after treatment. The tool can also be used for other cancer types including colon and prostate cancers [11]. Another such tool is the MammaPrint® which employs 70 genes to predict the possibility of metastasis in breast cancer [17]. The principal challenge in these tests is the recognition of participating genes in the chemoresistance process. Moreover, due to the development of interdisciplinary techniques, computational and statistical methods are used for predicting chemotherapy responses [11].

So far, numerous computational approaches are developed to study drug resistance based on biological mechanisms. These computational techniques are generally divided into mechanism-based mechanistic modeling methods and data-driven prediction methods [18,19,20]. Molecular dynamics simulation, Kinetic models of signaling networks, Ordinary differential equation (ODE) model of cellular populations, Stochastic models, Partial differential equation models (PDEs), Agent-based and Pharmacokinetic–pharmacodynamic models are examples of the first class. The second class benefits from Omics data-based node biomarker screening, Static network biomarker prediction and Dynamic network biomarker prediction models. Linear models, support vector machines (SVMs), hierarchical clustering, principal components analysis (PCA) and the formation of a scoring algorithm are other models of computational methods used for the prediction of cancer responses. These models which belong to a concept known as “machine learning algorithms (ML)” are being used to predict resistance of cancer cells to chemotherapy [20, 21]. Huang et al. represent ML as a part of artificial intelligence that can find correlations in the cancer-relevant datasets [21]. Conventional analytical approaches for determining treatment are very expensive and are also limited due to innate technical issues. ML algorithms are considered as cost–benefit, time saving strategies which can evaluate multiple cells line simultaneously [22, 23]. Yet another challenge which ML can overcome compared to conventional methods is the ability to determine biological information which are concealed by tumor cell heterogeneity [24].

This study aims to predict the resistance of several cancer cell-lines to Cisplatin. In this study, different algorithms including Naïve Bayes, K-Nearest Neighbors (with k = 3), Decision tree, Random Forest and Neural network are used to classify Cisplatin sensitive and resistance samples. Moreover, the results of our systems biology analysis indicated several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. Our results have been validated against different databases such as UniProt, Enrichr and DIANA mirPath v.3 and the papers extracted from the literature. Furthermore, the results of the systems biology analysis determined several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others.

The remainder of this paper is formed as below: “Materilas and methods” section reviews the Materials and Methods. The results are presented in “Results” section. The discussion is reported in “Discussion” section and finally, “Conclusion” section present the conclusion of the overall work.

Materials and methods

Data collection

NCBI GEO database was used to obtain data from datasets [25]. Five different platforms of microarray including GPL13667, GPL6947, GPL6480, GPL6104, and GPL6244 have been used. A total of 85 samples of gene expression datasets of microarray data were selected from different platforms. The selected datasets were related to various cell lines such as ovarian, pancreatic and non-small cell lung cancer (NSCLC) both resistance and sensitive to Cisplatin drug. Sample numbers as well as cancer types are indicated in Table 1.

Table 1 Samples collected from NCBI GEO database

The data has been normalized using the LIMA package in the R software [26]. Average has been taken between the expressed values of repetitive probes in each dataset to obtain a unique expression value for each probe. A total 7621 genes were harvested after the isolation of common genes between platforms. The Combat software was then used to eliminate batch effects between different platforms and experiments [27]. Also, an average has been taken between replicates of each platform (sensitive and resistance separately) which reduces the total number of samples from 85 to 14 sensitive and 14 resistances to Cisplatin samples.

Data processing

As the data was collected from various sources, it was necessary to somehow remove the discrepancies known as “Batch effects” between different samples. The batch effects of 85 different samples, each containing 7620 genes, were corrected by the SVA package. The SVA package comprises functions to remove the batch factors and other undesirable conversion in high-throughput examination. Specifically, the SVA package comprises functions to identify and build surrogate variables for high-dimensional datasets. Surrogate variables are covariates built directly from high-dimensional data (such as gene expression and RNA sequencing data) that can be utilized in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. Moreover, the t-SNE algorithm was then used in MATLAB software to make the data presentable before and after the batch effect correction [28].

Feature selection

High dimensional DNA microarray has presented serious challenges to the existing machine learning and classification methods. In other words, in many of medical and microarray datasets, it is possible that many genes are irrelevant or redundant for machine learning algorithm [29,30,31,32]. Feature selection or gene selection is a popular and powerful approach in medical datasets to overcome this shortcoming [33,34,35]. In gene selection, to decrease the microarray data dimensions, by eliminating the irrelevant and similar genes, only a subset of relevant and dissimilar genes that are strongly related to the objective function are selected [36]. This is a powerful strategy to increase the efficiency of the machine learning algorithm, reduce time complexity, build more general classification algorithm, and reduce storage requirements [37, 38]. Gene selection approaches have been successfully employed in many medical applications including; gene expression [39], cancer classification [40], medical diagnosis [32], etc. In other words, a fundamental problem in machine learning algorithms is the high dimensional datasets, in which the size of the feature subset is much higher than the size of the patterns. Therefore, the classification accuracy is significantly reduced. As a result, it is necessary to reduce the initial features using dimension reduction techniques [41, 42]. One efficient way to reduce the dimension is feature selection (gene selection in DNA microarray datasets). In feature selection, an attempt is made to choose a set of initial features that satisfy two targets simultaneously: the minimum similarity between the selected genes and the maximum relevance of these genes with the target class [43, 44]. The main goal of this step is to select appropriate and important feature from original features [45]. To do this purpose, Fisher Score (FS) feature selection algorithm is used to select the features that are most relevant to the target class. Fisher Score is a supervised filter method that selects a feature subset such that the samples in the specific class are most similar to each other and the samples in the different classes are less similar. As a result, this measure Scores higher value to genes with higher separation characteristic. The FS of gene fi is defined as follows:

$$FS({f_k}) = \frac{{{{\sum\nolimits_{\nu \in V} {{n_\nu }(\bar f_k^\nu - {{\bar f}_k})} }^2}}}{{\sum\nolimits_{\nu \in \wedge V} {{{({\sigma _\nu } \wedge V({f_k}))}^2}} }}$$

where \(\overline{f}_{k}\) is the mean value of all the samples regarding the feature fk, V is a set of all classes in a dataset, nν is the number of pattern on the class ν, and σ (fk) and \(\overline{f}_{k}^{\nu }\) shows the variance and average of feature fk on class ν, correspondingly.

The Fisher Score method was performed for 14 folds and in each step, 100 features that were most consistent with the normal distribution were selected so that a total of 1400 features were obtained. In the next step, in order to reduce the sample size, the PCA [46] method was performed for 14 folds on the output features of the Feature Score method and reduced set of features were selected for machine training.

Machine learning

In order to measure the flexibility of the developed method on different classifiers, in the designed experiments, the efficiency of the various approaches on three widely used six machine learning algorithms including Naïve Bayes [47], SVM [7], KNN [48], Decision tree [49], Random Forest [50] and Neural Network [51] algorithms is examined for machine training to detect Cisplatin sensitive and resistant samples in cancer patients undergoing chemotherapy. These machine learning algorithms are one of the most well-known and widely used machine learning algorithms that are used by researchers in various prediction and classification problems.

In these experiments the developed approach and other compared methods implemented using Python language programming on an Intel Core-i9 CPU with 16 GB of RAM.

To train these algorithms and due to the small number of available samples (14 pairs of samples including 14 sensitive and 14 resistant samples), the Leave one out method (LOO) was used [52]. In this method, at each step of the machine training, 13 pairs of samples were used as training data and one pair was used as test data. The training process continued until all pairs of samples were once used as the test data. The training was performed twice, once by considering 1400 features extracted from Fisher Score method and once by considering 210 features extracted by PCA method as the training samples. Finally, the performance of the trained machines was evaluated according to Accuracy [53], Specificity [54], Sensitivity [54], Precision [53], MCC [55] and F1 scores [55].

Biological system evaluation

73 genes were selected using the CountIF filter in Excel (Refer to online resource 1). These genes were repeated in 50% or more of the 14 pairs sample extracted from the GEO. The LIMA package was used to calculate the DEGs between all sensitive and resistance samples. The result indicated that 34 of 73 gens which were chosen in the feature selection phase were down-regulated and the rest were up-regulated. The genes were enriched in biological pathways using the Enrichr database [56]. To obtain the mirs related to the 73 harvested genes, miRCancer [57] and miRDB [58] databases were used and the common mirs between the harvested mirs from the two aforementioned databases were selected based on the specific cancer cell line as well as the direction of the regulation. mirs were enriched using DIANA mirPath v.3 database [59] (refer to online resource 2) and pathways related to chemoresistance were selected among the obtained pathways. Furthermore, the transcription factors which regulate the obtained DE genes were harvested using the TRRUST v2 database [60] and were enriched in pathways related to chemoresistance by Enricher database.


Using the obtained transcription factors and mirs, two types of network including a TF-gene network and a mir-gene network were constructed using Cytoscape software [61]. Topological analyses were then performed using degree and betweenness centralities to report the networks hub genes.

Biological description

The obtained DEGs were studied through literature review. All of the 73 genes were checked in articles and are indicated in a table based on the mechanism of chemo resistance. Some of these genes were not mentioned in the related articles as chemo resistance agents and therefore, they were nominated for further resistance studies which indicated that these genes have main roles in cells and are included in important biological pathways.


Data processing

To correct batch effects between different samples, t-SNE algorithm was used (Fig. 1). This algorithm reduces data dimension and makes the data easier to picture and therefore, easier to comprehend.

Fig. 1
figure 1

a Data before batch effect correction. Data dimension has been reduced using the t-SNE algorithm so that it can be displayed properly. Each batch is illustrated in a distinct color and it is clear that each batch is grouped near each other. b Data collected from various sources after batch effect correction by the Combat software. It is clear that data related to different batches are combined and the boundaries between different batches have been removed

Features selected and reduced by Fisher Score and PCA algorithms

In order to select appropriate features for machine learning algorithms, feature selection was performed for 14 folds using the Fisher Score method. In each fold, 100 genes which were most similar to the normal distribution were selected out of 7620 genes. Furthermore, to reduce the dimension of the selected features, the PCA algorithm was performed for 14 folds on the extracted features and several genes were selected in each fold (Table 2).

Table 2 Features selected for machine learning purposes. Selected features are obtained using Fisher Score and PCA algorithms, respectively

A machine learning approach to detect Cisplatin sensitive and resistant samples in cancer cell lines

Performance of the machines trained by using reduced features extracted from the PCA algorithm (Table 2) are shown in Fig. 2. Among the developed machines, the Decision Tree algorithm with the average Accuracy of 50% has the weakest performance in terms of accuracy. On the other hand, KNN shows the highest accuracy with an average of 67%. The best performance based on accuracy criteria also belong to KNN and Decision Tree algorithms according to the obtained box plots (Fig. 2c).

Fig. 2
figure 2

Performance of the machines trained by using the reduced features extracted from the PCA algorithm. a The diagrams show averages of Accuracy, Specificity, Sensitivity and Precision criteria in 14 machine training cycles. b The diagrams show averages of TP, TN, FP, FN, MCC and F1 Scores in 14 machine training cycles. c The box plots show performances of the six proposed machines based on accuracy, sensitivity, specificity and precision criteria

Among the developed machines, the Naïve Bayes algorithm is the weakest machine in terms of negative sample detection with a 50% Specificity criteria. Decision Tree algorithm, on the other hand, has the highest average Specificity criterion of 69%, followed by Random Forest and KNN algorithms. The KNN algorithm has the best performance based on the specificity criterion based on the extracted box plots (Fig. 2c).

Based on the results, the weakest performance in terms of Sensitivity criteria is related to the Naïve bayes algorithm, which has not correctly detected any positive samples. On the other hand, KNN and Decision Tree algorithms have the highest Sensitivity criteria with an average of 78 and 67%, respectively (Fig. 2). The Decision Tree algorithm has the best performance in terms of sensitivity criteria based on the obtained box plots (Fig. 2C).

Among these algorithms, Decision Tree and Random Forest algorithms with an average precision of 71% have the highest average precision. The Decision Tree algorithm has the best performance in terms of precision according to the extracted box plots (Fig. 2C).

Similarly, a new set of six machines was trained only this time, 1400 features extracted from the Fisher Score algorithm were used in the training process. Performance results of these machines are shown in Fig. 3. In the new set, the KNN algorithm with an average of 67% accuracy has the highest percentage of correct sample detection compared to other algorithms. Random Forest, Naïve Bayes and SVM algorithms come afterward with an average accuracy of 64%. The KNN algorithm is also the best machine in terms of accuracy based on extracted box plots (Fig. 3c).

Fig. 3
figure 3

Performances of the machines trained using the features obtained from the Fisher Score algorithm. a The diagrams show the averages of accuracy, specificity, sensitivity and precision criteria. b The diagrams illustrate the averages of TP, TN, FP, FN, MCC and F1 Scores. c The box plots show performances of the six proposed machines based on the accuracy, sensitivity, specificity and precision criteria

Naïve Bayes and Random Forest algorithms, with an average specificity criterion of 67%, are the best machines to correctly detect negative samples. On the other hand, the Decision Tree algorithm with the average specificity of 45% has the weakest performance in this regard (Fig. 3). Furthermore, the KNN algorithm has the best performance in terms of specificity based on the obtained box plots (Fig. 3c). In addition, the KNN algorithm with 78% average sensitivity is the best machine to correctly detect positive samples. The SVM algorithm comes afterward with an average sensitivity of 70% (Fig. 3). According to the obtained box plots, the Naïve Bayes and KNN algorithms are the best choices in terms of Sensitivity criteria, respectively. In addition, among the above algorithms, Naïve Bayes and Random Forest algorithms with an average precision of 71% are the most precise machines. Similarly based on the calculated box plots, the Naïve Bayes algorithm performed better than other algorithms in terms of precision criteria. Finally, according to the MCC criteria, the KNN algorithm and according to the F1 Score criterion, the Random Forest and Naïve Bayes algorithms have the best performances (Fig. 3).

Determining specific mirs for extracted DE genes

The specific mirs for the extracted 73 DEGs were harvested from the miRCancer and miRDB databases for related cancer types (Table 1). These results determine that the expression profile of mirs are down for upregulated genes and are up for down regulated ones (Table 3; Fig. 4).

Table 3 mirs related to both upregulated and down regulated genes. In the upregulated genes the mirs are down and in the down regulated genes the mirs are up
Fig. 4
figure 4

Network of mirs-Targets

mir-target network topology

The mir target network has been topologically analyzed using the degree centrality and the results revealed the hub genes which should be evaluated for their performance in the chemoresistance process (Table 4).

Table 4 The extracted hub genes and mirs nominated for performance evaluation in the chemoresistance process

mir enrichment

The upregulated and down regulated mirs in the pathways related to chemoresistance were enriched using DIANA mirPath v.3 database. The related pathways were extracted from KEGG database using the standard P-value of 0.05 (The extracted related pathway: Additional file 1).

TF network topology

Three factors including the in-degree, the out-degree and the betweenness centralities have been noted in TF-gene interaction network for topological analysis. The detected hub genes are related to the in-degree centrality and are listed in Table 5.

Table 5 The detected hub genes based on the in-degree centrality in the TF-gene interaction network

The network illustrated in Fig. 5 has been constructed by merging upregulated and down regulated genes and their corresponding regulatory TFs. Topology analysis has been performed based on this network. According to the results obtained from the trrust database, the TFs were down for upregulated genes and were up for down regulated genes (Table 6).

Fig. 5
figure 5

Tf-gene network involved in chemoresistance process. Upregulated genes are colored in pink while down regulated ones are blue

Table 6 The TFs related to both upregulated and down regulated genes. In the upregulated genes the TFs are down and in the down regulated genes the TFs are up

TF enrichment

The transcription factors regulating the genes harvested from the feature selection step were enriched using Enrichr database in the oncogenes and chemoresistance pathways. In the Enrichr database the pathways were harvested from KEGG and the adjusted P-value was significant. The results are shown in Table 7.

Table 7 The annotations of transcription factors regulating the harvested genes from the feature selection step

Biological description

A literature review was performed on the 73 DE genes which were harvested by the CountIF filter in the feature selection step. A group of these genes were reported in the literature as chemoresistance genes (Additional file 2). The other ones were identified to have a vital role in the oncogenesis pathway and other important cell functions. Therefore, although they were not reported as chemoresistance genes we propose that these genes might be potentially chemoresistance (Table 8). Future studies can be performed to validate these results.

Table 8 potential chemoresistance genes proposed for further studies


The main aim of this study was to train a machine for the detection of sensitive and resistant samples to Cisplatin in different cancer cell lines including ovarian, pancreatic and lung cancers. Six machines were developed based on different algorithms including Naïve bayes, SVM, KNN, Decision tree, Random Forest and Neural network and the results were evaluated by the accuracy, specificity, sensitivity, precision, MCC and F1 Scores. Furthermore, a series of systems biology analyses were performed using the DE genes harvested from the feature selection step to further improve our study.

It was concluded that the machines which were trained using the features extracted from the Fisher Score algorithm performed better than the ones trained by the same set of features reduced using the PCA algorithm. The reason is due to the richer distinguishing information in the features selected by the Feature Score algorithm than the reduced features of the PCA algorithm. Exceptionally, the KNN algorithm performed similarly in both cases. The similarity of the KNN algorithm performance in both cases is due to the preservation of the data arrangement in the reduced space after the implementation of the PCA dimension reduction algorithm. Since the KNN algorithm decides the fate of a data based on its K nearest neighbors, maintenance of the data arrangement after the dimension reduction has led in the same results in both cases. Furthermore, the KNN and the Naïve bayes algorithms are proposed as the most appropriate machines for prediction. However, it should be noted that the appropriate machine must be selected based on the considered specific application.

Using the classifying features extracted, mir-target network and TF-gene network were constructed and enrichment and topology analyses were performed to detect hub genes and hub TFs. Based on the degree centrality, PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB and ACSL6 target genes were detected as the chemoresistance hub genes according to mir-topology. PTGER3 is encoding PTGER3 protein, a member of the G-protein coupled receptor family. In this study, the degree of the PTGER3 was obtained to be seven which is higher than that of other obtained hub genes. Furthermore, this gene is also reported to be a cisplatin-resistant gene through Ras-MAPK/Erk-ETS1-ELK1/CFTR1 pathways [93]. YWHAH and CTNNB with the degree centrality of six were identified as the second robust hub genes in the row. It has been reported that after chemotherapy, YWHAH is upregulated in prostate cancer cells and is down regulated in Liposarcoma, representing the potency of this gene in chemoresistance [74, 75]. Moreover, it has been shown that CTNNB1 has a vital role in cancer regulatory pathways such as Gastric cancer signaling [94]. With a degree centrality of 5, ANKRD50 and EDNRB were the next obtained hub genes. According to previous studies, it has been reported that EDNRB-methylation is a very common phenomenon in NSCLC. Due to the higher rate of EDNRB methylation in Squamous Cell Carcinoma (SCCs), it can be used to distinguish between SCCs and lung Adenocarcinoma [87].

The TF-gene network topology analysis was also performed and the results specified two hub genes including IFNG and CTNNB1. IFNG is a protein coding gene and it is involved in Folate Metabolism. Furthermore, Yaghoobi et al. have evaluated the IFNG and its antisense (IFNG-AS1) roles in breast cancer and have proposed the involvement of IFNG and IFNG-AS1 in the pathogenesis of breast cancer [95]. In another study, Gao et al. investigated the role of IFNG pathway in the anti-CTLA resistance mechanism. Anti-CTLA-4 produces IFNG to enhance T cell responses. Their data revealed that defects in the IFNG signaling pathway leads to resistance to anti-CTLA-4 therapy [96].

With a systems biology approach including machine learning methods, feature selection, topological analysis, enrichment analysis and finally literature review, we managed to obtain a set of genes which play critical roles in chemoresistance processes. We also have nominated a set of potentially chemoresistance genes which could be used in further studies.


In this study, machine learning approach as well as systems biology analysis was used to extract the genes which commonly separated cisplatin resistant samples from the sensitive ones in lung, pancreatic and ovarian cancers. Furthermore, six classifiers were trained to distinguish between chemoresistance samples from the sensitive ones. As a result, KNN and Naïve Bayes algorithms were selected as the most practical machines according to a set of calculated measures. Moreover, the results of our systems biology analysis indicated several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. Our results have been validated against different databases such as UniProt, Enrichr and DIANA mirPath v.3 and the papers extracted from the literature. Therefore, this in silico study as well as its predictions can pave the way for further experimental researches.

Availability of data and materials

The dataset used in this study can be obtained from the corresponding author on reasonable request.



Differentially Expressed Genes


Ordinary differential equation


Partial differential equation


Support vector machines


Principal components analysis


Machine learning


Non-small cell lung cancer


Fisher Score


Leave one out


  1. Mazumdar M, Glassman J. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. Stat Med. 2000;19(1):113–32.

    Article  Google Scholar 

  2. Urruticoechea A, Alemany R, Balart J, Villanueva A, Viñals F, Capella G. Recent advances in cancer therapy: an overview. Curr Pharm Des. 2010;16(1):3–10.

    Article  Google Scholar 

  3. Damin DC, Lazzaron AR. Evolving treatment strategies for colorectal cancer: a critical review of current therapeutic options. World J Gastroenterol: WJG. 2014;20(4):877.

    Article  Google Scholar 

  4. Khalil DN, Smith EL, Brentjens RJ, Wolchok JD. The future of cancer treatment: immunomodulation, CARs and combination immunotherapy. Nat Rev Clin Oncol. 2016;13(5):273.

    Article  Google Scholar 

  5. Raguz S, Yagüe E. Resistance to chemotherapy: new treatments and novel insights into an old problem. Br J Cancer. 2008;99(3):387–91.

    Article  Google Scholar 

  6. Rebucci M, Michiels C. Molecular aspects of cancer cell resistance to chemotherapy. Biochem Pharmacol. 2013;85(9):1219–26.

    Article  Google Scholar 

  7. Housman G, et al. Drug resistance in cancer: an overview. J Cancers. 2014;6(3):1769–92.

    Article  Google Scholar 

  8. Uramoto H, Tanaka F. Recurrence after surgery in patients with NSCLC. Transl Lung Cancer Res. 2014;3(4):242.

    Google Scholar 

  9. Lippert TH, Ruoff H-J, Volm M. Intrinsic and acquired drug resistance in malignant tumors. Arzneimittelforschung. 2008;58(06):261–4.

    Google Scholar 

  10. Kelderman S, Schumacher TN, Haanen JB. Acquired and intrinsic resistance in cancer immunotherapy. Mol Oncol. 2014;8(6):1132–9.

    Article  Google Scholar 

  11. Lloyd KL, Cree IA, Savage RS. Prediction of resistance to chemotherapy in ovarian cancer: a systematic review. BMC Cancer. 2015;15(1):117.

    Article  Google Scholar 

  12. Sekine I, Shimizu C, Nishio K, Saijo N, Tamura T. A literature review of molecular markers predictive of clinical response to cytotoxic chemotherapy in patients with breast cancer. Int J Clin Oncol. 2009;14(2):112–9.

    Article  Google Scholar 

  13. Cortazar P, Johnson BE. Review of the efficacy of individualized chemotherapy selected by in vitro drug sensitivity testing for patients with cancer. J Clin Oncol. 1999;17(5):1625–1625.

    Article  Google Scholar 

  14. Fruehauf JP, Alberts DS. Assay-assisted treatment selection for women with breast or ovarian cancer. In: Chemosensitivity testing in oncology. Springer; 2003. p. 126–145.

  15. Sekine I, Minna JD, Nishio K, Saijo N, Tamura T. Genes regulating the sensitivity of solid tumor cell lines to cytotoxic agents: a literature review. Jpn J Clin Oncol. 2007;37(5):329–36.

    Article  Google Scholar 

  16. Sekine I, Minna JD, Nishio K, Tamura T, Saijo N. A literature review of molecular markers predictive of clinical response to cytotoxic chemotherapy in patients with lung cancer. J Thorac Oncol. 2006;1(1):31–7.

    Article  Google Scholar 

  17. Slodkowska EA, Ross JS. MammaPrintTM 70-gene signature: another milestone in personalized medical care for breast cancer patients. Expert Rev Mol Diagn. 2009;9(5):417–22.

    Article  Google Scholar 

  18. Camidge DR, Pao W, Sequist LV. Acquired resistance to TKIs in solid tumours: learning from lung cancer. Nat Rev Clin Oncol. 2014;11(8):473.

    Article  Google Scholar 

  19. Sawyers C. Targeted cancer therapy. Nature. 2004;432(7015):294.

    Article  Google Scholar 

  20. Sun X, Hu B. Mathematical modeling and computational prediction of cancer drug resistance. Brief Bioinform. 2017;19(6):1382–99.

    Article  Google Scholar 

  21. Huang C, et al. Machine learning predicts individual cancer patient responses to therapeutic drugs with high accuracy. Sci Rep. 2018;8(1):16444.

    Article  Google Scholar 

  22. Ali M, Aittokallio T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys Rev. 2019;11(1):31–9.

    Article  Google Scholar 

  23. Chen R, Liu X, Jin S, Lin J, Liu J. Machine learning for drug-target interaction prediction. Mol Cells. 2018;23(9):2208.

    Google Scholar 

  24. Liu R, Zhang G, Yang Z. Towards rapid prediction of drug-resistant cancer cell phenotypes: single cell mass spectrometry combined with machine learning. Chem Commun. 2019;55(5):616–9.

    Article  Google Scholar 

  25. Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

    Article  Google Scholar 

  26. A. Team RC. R: A language and environment for statistical computing. Vienna; 2013.

  27. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

    Article  MATH  Google Scholar 

  28. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(5):2579–605.

    MATH  Google Scholar 

  29. Rostami M, Moradi P. A clustering based genetic algorithm for feature selection. In: Information and Knowledge Technology (IKT). 2014. p. 112–116.

  30. Moradi P, Rostami M. A graph theoretic approach for unsupervised feature selection. Eng Appl Artif Intell. 2015;44:33–45.

    Article  Google Scholar 

  31. Moradi P, Rostami M. Integration of graph clustering with ant colony optimization for feature selection. Knowledge Based Syst. 2015;84:144–61.

    Article  Google Scholar 

  32. Rostami M, Forouzandeh S, Berahmand K, Soltani M. Integration of multi-objective PSO based feature selection and node centrality for medical datasets. Genomics. 2020;112(6):4370–84.

    Article  Google Scholar 

  33. Berahmand K, Haghani S, Rostami M, Li Y. A new attributed graph clustering by using label propagation in complex networks. J King Saud Univ Comput Inf Sci. 2020.

    Article  Google Scholar 

  34. Rostami M, Berahmand K, Forouzandeh S. A novel community detection based genetic algorithm for feature selection. J Big Data. 2021;8(1):2.

    Article  Google Scholar 

  35. Rostami M, Berahmand K, Forouzandeh S. A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty. J Big Data. 2020;7(1):83.

    Article  Google Scholar 

  36. Liu Y, Nie F, Gao Q, Gao X, Han J, Shao L. Flexible unsupervised feature extraction for image classification. Neural Netw. 2019;115:65–71.

    Article  MATH  Google Scholar 

  37. Wang H, Zhang Y, Zhang J, Li T, Peng L. A factor graph model for unsupervised feature selection. Inf Sci. 2019;480:144–59.

    Article  MathSciNet  MATH  Google Scholar 

  38. Tang X, Dai Y, Xiang Y. Feature selection based on feature interactions with application to text categorization. Expert Syst Appl. 2019;120:207–16.

    Article  Google Scholar 

  39. Wahid A, et al. Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule. Chemom Intell Lab Syst. 2020;199:103958.

    Article  Google Scholar 

  40. Saeys Y, Inza I, Larrañaga P. review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.

    Article  Google Scholar 

  41. Yazdi KM, et al. Prediction optimization of diffusion paths in social networks using integration of ant colony and densest subgraph algorithms. J High Speed Netw. 2020;26:141–53.

    Article  Google Scholar 

  42. Yazdi KM et al. Improving recommender systems accuracy in social networks using popularity. In: 2019 20th international conference on parallel and distributed computing, applications and technologies (PDCAT). 2019. p. 301–307.

  43. Gao W, Hu L, Zhang P, He J. Feature selection considering the composition of feature relevancy. Pattern Recognit Lett. 2018;112:70–4.

    Article  Google Scholar 

  44. Abdulla M, Khasawneh MT. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif Intell Med. 2020;108:101941.

    Article  Google Scholar 

  45. Rostami M, Berahmand K, Nasiri E, Forouzande S. Review of swarm intelligence-based feature selection methods. Eng Appl Artif Intell. 2021;100:104210.

    Article  Google Scholar 

  46. Lever J, Krzywinski M, Altman N. Points of significance: principal component analysis. ed: Nature Publishing Group; 2017.

  47. Lewis DD. Naive (Bayes) at forty: the independence assumption in information retrieval. Springer; 1998. p. 4–15.

    Google Scholar 

  48. Zhang Z. Introduction to machine learning: k-nearest neighbors. Ann Transl Med. 2016;4(11):218.

    Article  Google Scholar 

  49. Barros RC, Basgalupp MP, Freitas AA, De Carvalho AC. Evolutionary design of decision-tree algorithms tailored to microarray gene expression data sets. IEEE Trans Evol Comput. 2013;18(6):873–92.

    Article  Google Scholar 

  50. Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006;7(1):3.

    Article  Google Scholar 

  51. Lancashire LJ, Lemetre C, Ball GR. An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform. 2009;10(3):315–29.

    Article  Google Scholar 

  52. Elisseeff A, Pontil M. Leave-one-out error and stability of learning algorithms with applications. NATO Sci Ser Sub Ser iii Comput Syst Sci. 2003;190:111–30.

    Google Scholar 

  53. Heidaryan E. A note on model selection based on the percentage of accuracy-precision. J Energy Resour Technol. 2019.

    Article  Google Scholar 

  54. Altman DG, Bland JM. Diagnostic tests. 1: sensitivity and specificity. BMJ Br Med J. 1994;308(6943):1552.

    Article  Google Scholar 

  55. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.

    Article  Google Scholar 

  56. Chen EY, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 2013;14(1):128.

    Article  Google Scholar 

  57. Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics. 2013;29(5):638–44.

    Article  Google Scholar 

  58. Liu W, Wang X. Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data. Genome Biol. 2019;20(1):1–10.

    Article  Google Scholar 

  59. Vlachos IS, et al. DIANA-miRPath v3. 0: deciphering microRNA function with experimental support. Nucleic Acids Res. 2015;43(W1):W460–6.

    Article  Google Scholar 

  60. Han H, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46(D1):D380–6.

    Article  Google Scholar 

  61. Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.

    Article  Google Scholar 

  62. Sekine Y, et al. The Kelch repeat protein KLHDC10 regulates oxidative stress-induced ASK1 activation by suppressing PP5. Mol Cell. 2012;48(5):692–704.

    Article  Google Scholar 

  63. Zhong M, et al. Expression of MSP58 in hepatocellular carcinoma. Med Oncol. 2013;30(2):539.

    Article  Google Scholar 

  64. Chae YK, et al. Genomic landscape of DNA repair genes in cancer. Oncotarget. 2016;7(17):23312.

    Article  Google Scholar 

  65. Fischer F. The function of mismatch repair proteins in response to DNA damage caused by chemotherapeutic agents, University of Zurich; 2007.

  66. Zhang D, et al. Regulation of the adaptation to ER stress by KLF4 facilitates melanoma cell metastasis via upregulating NUCB2 expression. J Exp Clin Cancer Res. 2018;37(1):176.

    Article  Google Scholar 

  67. Qu S, et al. MicroRNA-330 is an oncogenic factor in glioblastoma cells by regulating SH3GL2 gene. PLoS ONE. 2012;7(9):e46010.

    Article  Google Scholar 

  68. Yang Z, et al. GRSF1-mediated MIR-G-1 promotes malignant behavior and nuclear autophagy by directly upregulating TMED5 and LMNB1 in cervical cancer cells. Autophagy. 2019;15(4):668–85.

    Article  Google Scholar 

  69. Vert A, Castro J, Ribo M, Vilanova M, Benito A. Transcriptional profiling of NCI/ADR-RES cells unveils a complex network of signaling pathways and molecular mechanisms of drug resistance. Onco Targets Ther. 2018;11:221.

    Article  Google Scholar 

  70. Zheng P, Wang W, Muxi Ji QZ, Feng Y, Zhou F, He Q. TMEM119 promotes gastric cancer cell migration and invasion through STAT3 signaling pathway. OncoTargets Ther. 2018;11:5835.

    Article  Google Scholar 

  71. Zheng P, et al. TMEM119 silencing inhibits cell viability and causes the apoptosis of gastric cancer SGC-7901 cells. Oncol Lett. 2018;15(6):8281–6.

    Google Scholar 

  72. Gheysarzadeh A, Bakhtiari H, Ansari A, Emami Razavi A, Emami MH, Mofid MR. The insulin-like growth factor binding protein-3 and its death receptor in pancreatic ductal adenocarcinoma poor prognosis. J Cell Physiol. 2019;234(12):23537–46.

    Article  Google Scholar 

  73. Lee M, Cheung G, Nair R, Done S. Defining the roles of COIL and WIPI1 in breast cancer metastasis. ed: AACR; 2012.

  74. Daigeler A, et al. Heterogeneous in vitro effects of doxorubicin on gene expression in primary human liposarcoma cultures. BMC Cancer. 2008;8(1):313.

    Article  Google Scholar 

  75. Kibel AS, et al. Genetic variants in cell cycle control pathway confer susceptibility to aggressive prostate carcinoma. Prostate. 2016;76(5):479–90.

    Article  Google Scholar 

  76. Sandhu V. A systems biology approach to integrated molecular analysis in pancreatic and periampullary adenocarcinoma. 2016.

  77. Shih I-M, Nakayama K, Wu G, Nakayama N, Zhang J, Wang T-L. Amplification of the ch19p13.2 NACC1 locus in ovarian high-grade serous carcinoma. Mod Pathol. 2011;24(5):638.

    Article  Google Scholar 

  78. Xia L, et al. ACP5, a direct transcriptional target of FoxM1, promotes tumor metastasis and indicates poor prognosis in hepatocellular carcinoma. Oncogene. 2014;33(11):1395.

    Article  Google Scholar 

  79. Qi H, Liu S, Guo C, Wang J, Greenaway FT, Sun M. Role of annexin A6 in cancer. Oncol Lett. 2015;10(4):1947–52.

    Article  Google Scholar 

  80. O’Sullivan D, et al. A novel inhibitory anti-invasive MAb isolated using phenotypic screening highlights AnxA6 as a functionally relevant target protein in pancreatic cancer. Br J Cancer. 2017;117(9):1326.

    Article  Google Scholar 

  81. Shinmura K, et al. BSND and ATP6V1G3: novel immunohistochemical markers for chromophobe renal cell carcinoma. Medicine. 2015;94(24):e989.

    Article  Google Scholar 

  82. Eo H-S, Heo JY, Choi Y, Hwang Y, Choi H-S. A pathway-based classification of breast cancer integrating data on differentially expressed genes, copy number variations and MicroRNA target genes. Mol Cells. 2012;34(4):393–8.

    Article  Google Scholar 

  83. Khan K, Hardy R, Haq A, Ogunbiyi O, Morton D, Chidgey M. Desmocollin switching in colorectal cancer. Br J Cancer. 2006;95(10):1367.

    Article  Google Scholar 

  84. Cui T, et al. Diagnostic and prognostic impact of desmocollins in human lung cancer. J Clin Pathol. 2012;65(12):1100–6.

    Article  Google Scholar 

  85. Ladner RD. The role of dUTPase and uracil-DNA repair in cancer chemotherapy. Curr Protein Pept Sci. 2001;2(4):361–70.

    Article  Google Scholar 

  86. Schussel J, et al. EDNRB and DCC salivary rinse hypermethylation has a similar performance as expert clinical examination in discrimination of oral cancer/dysplasia versus benign lesions. Clin Cancer Res. 2013;19(12):3268–75.

    Article  Google Scholar 

  87. Chen S-C, et al. Aberrant promoter methylation of EDNRB in lung cancer in Taiwan. Oncol Rep. 2006;15(1):167–72.

    Google Scholar 

  88. Chen F, He B, Yan L, Qiu Y, Lin L, Cai L. FADS1 rs174549 polymorphism may predict a favorable response to chemoradiotherapy in oral cancer patients. J Oral Maxillofac Surg. 2017;75(1):214–20.

    Article  Google Scholar 

  89. Zhang K, Waxman DJ. PC3 prostate tumor-initiating cells with molecular profile FAM65B high/MFI2 low/LEF1 low increase tumor angiogenesis. Mol Cancer. 2010;9(1):319.

    Article  Google Scholar 

  90. Mironova N, Patutina O, Brenner E, Kurilshikov A, Vlassov V, Zenkova M. The systemic tumor response to RNase A treatment affects the expression of genes involved in maintaining cell malignancy. Oncotarget. 2017;8(45):78796.

    Article  Google Scholar 

  91. Raymond JR, Appleton KM, Pierce JY, Peterson YK. Suppression of GNAI2 message in ovarian cancer. J Ovarian Res. 2014;7(1):6.

    Article  Google Scholar 

  92. Jung-Yi Jiang R-JL, Lee S-J. A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification. IEEE Trans Knowl Data Eng. 2011;23(3):335–49.

    Article  Google Scholar 

  93. Rodriguez-Aguayo C, et al. “PTGER3 induces ovary tumorigenesis and confers resistance to cisplatin therapy through up-regulation Ras-MAPK/Erk-ETS1-ELK1/CFTR1 axis,” (in eng). EBioMedicine. 2019;40:290–304.

    Article  Google Scholar 

  94. Tanabe S, Kawabata T, Aoyagi K, Yokozaki H, Sasaki H. “Gene expression and pathway analysis of CTNNB1 in cancer and stem cells,” (in eng). World J Stem Cells. 2016;8(11):384–95.

    Article  Google Scholar 

  95. Yaghoobi H, Azizi H, Oskooei VK, Taheri M, Ghafouri-Fard S. Assessment of expression of interferon γ (IFN-G) gene and its antisense (IFNG-AS1) in breast cancer (in eng). World Journal Surg Oncol. 2018;16(1):211–211.

    Article  Google Scholar 

  96. Gao J, et al. Loss of IFN-γ pathway genes in tumor cells as a mechanism of resistance to anti-CTLA-4 therapy. Cell. 2016;167(2):397-404.e9.

    Article  Google Scholar 

Download references





Author information

Authors and Affiliations



The specific contributions made by each author is as follows: AA: Conceptualization, Methodology, Implementation, Writing-Original Draft, Writing—Review & Editing. NSM: Methodology, Implementation, Validation, Writing—Review & Editing. JZ: Methodology, Implementation, Validation, Writing—Review & Editing. MR: Methodology, Implementation, Validation, Writing—Review & Editing. SSA: Conceptualization, Writing—Review & Editing. AAR: Conceptualization, Writing—Review & Editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to S. Shahriar Arab.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

The extracted related pathway.

Additional file 2.

The reported chemoresistance genes.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ataei, A., Majidi, N.S., Zahiri, J. et al. Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis. J Big Data 8, 97 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: