Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis

Ataei, Atousa; Majidi, Niloufar Seyed; Zahiri, Javad; Rostami, Mehrdad; Arab, S. Shahriar; Rizvanov, Albert A.

doi:10.1186/s40537-021-00477-z

Research
Open access
Published: 05 July 2021

Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis

Atousa Ataei¹,
Niloufar Seyed Majidi²,
Javad Zahiri³,
Mehrdad Rostami⁴,
S. Shahriar Arab ORCID: orcid.org/0000-0002-2739-1617² &
…
Albert A. Rizvanov¹

Journal of Big Data volume 8, Article number: 97 (2021) Cite this article

3939 Accesses
4 Citations
Metrics details

Abstract

Most of the current cancer treatment approaches are invasive along with a broad spectrum of side effects. Furthermore, cancer drug resistance known as chemoresistance is a huge obstacle during treatment. This study aims to predict the resistance of several cancer cell-lines to a drug known as Cisplatin. In this papers the NCBI GEO database was used to obtain data and then the harvested data was normalized and its batch effects were corrected by the Combat software. In order to select the appropriate features for machine learning, the feature selection/reduction was performed based on the Fisher Score method. Six different algorithms were then used as machine learning algorithms to detect Cisplatin resistant and sensitive samples in cancer cell lines. Moreover, Differentially Expressed Genes (DEGs) between all the sensitive and resistance samples were harvested. The selected genes were enriched in biological pathways by the enrichr database. Topological analysis was then performed on the constructed networks using Cytoscape software. Finally, the biological description of the output genes from the performed analyses was investigated through literature review. Among the six classifiers which were trained to distinguish between cisplatin resistance samples and the sensitive ones, the KNN and the Naïve Bayes algorithms were proposed as the most convenient machines according to some calculated measures. Furthermore, the results of the systems biology analysis determined several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. These predictions pave the way for further experimental researches.

Introduction

Cancer is one of the most lethal and costly diseases around the world. There exist several common therapeutic procedures such as surgery, cytotoxic chemotherapy, targeted therapy, radiation therapy, endocrine therapy and immunotherapy which are used based on the level of cancer aggressiveness. The majority of the aforementioned approaches are invasive with a broad spectrum of side effects [1,2,3,4]. Furthermore, one important challenge that clinical practice face is drug resistance, which results from tolerance of cancer cells to anti-cancer agents [5, 6].

The concept of drug resistance was first distinguished by observing bacterial resistance to antibiotics, but the phenomenon was also attributed to a wider range of disorders including cancers in no time [7]. Traditional chemotherapeutic agents destroy cancer cells by directly damaging DNA strand. Therefore, not only they are non-specific but also, they result in broad side effects. Furthermore, studies show that new drugs developed for targeting cancer cells are most effective in the beginning of the therapies, but as time passes, most patients show resistance to these drugs. Resistance to new targeted, chemotherapeutic agents is a big challenge in cancer therapies as these agents are responsible for the preponderance of relapses. The drug resistance phenomena root from different mechanisms which can be cancer-specific or not, such as drug efflux [7, 8].

So far, numerous researches have been done to distinguish and describe cancer drug resistance. Housman et al. categorized the mechanisms of drug resistance in cancer as drug inactivation, drug target alteration, drug efflux, DNA damage repair, cell death inhibition, and the epithelial-mesenchymal transition [7]. On the other hand, drug resistance was classified into intrinsic resistance that exists before drug treatment or acquired resistance which is induced after the therapy. The prediction of drug resistance can overcome the inevitable failure of targeted and chemical therapeutics in clinical anticancer treatment [9, 10].

Chemotherapy resistance prediction methods include cell culture-based chemo-sensitivity tests, DNA, RNA, and protein-based chemo-sensitivity tests and recently developed computational methods [11, 12]. Cell culture-based tests which have been used for more than 30 years have some technical weaknesses. The technique is time-consuming and the primary culture has a low potency to success [13, 14].

To resolve the aforementioned issues, DNA, RNA, and protein-based chemosensitivity tests emerged [15, 16]. These methods include gene-based tools such as the Oncotype DX® assay which uses 21 genes to predict the recovery of breast cancer after treatment. The tool can also be used for other cancer types including colon and prostate cancers [11]. Another such tool is the MammaPrint® which employs 70 genes to predict the possibility of metastasis in breast cancer [17]. The principal challenge in these tests is the recognition of participating genes in the chemoresistance process. Moreover, due to the development of interdisciplinary techniques, computational and statistical methods are used for predicting chemotherapy responses [11].

So far, numerous computational approaches are developed to study drug resistance based on biological mechanisms. These computational techniques are generally divided into mechanism-based mechanistic modeling methods and data-driven prediction methods [18,19,20]. Molecular dynamics simulation, Kinetic models of signaling networks, Ordinary differential equation (ODE) model of cellular populations, Stochastic models, Partial differential equation models (PDEs), Agent-based and Pharmacokinetic–pharmacodynamic models are examples of the first class. The second class benefits from Omics data-based node biomarker screening, Static network biomarker prediction and Dynamic network biomarker prediction models. Linear models, support vector machines (SVMs), hierarchical clustering, principal components analysis (PCA) and the formation of a scoring algorithm are other models of computational methods used for the prediction of cancer responses. These models which belong to a concept known as “machine learning algorithms (ML)” are being used to predict resistance of cancer cells to chemotherapy [20, 21]. Huang et al. represent ML as a part of artificial intelligence that can find correlations in the cancer-relevant datasets [21]. Conventional analytical approaches for determining treatment are very expensive and are also limited due to innate technical issues. ML algorithms are considered as cost–benefit, time saving strategies which can evaluate multiple cells line simultaneously [22, 23]. Yet another challenge which ML can overcome compared to conventional methods is the ability to determine biological information which are concealed by tumor cell heterogeneity [24].

This study aims to predict the resistance of several cancer cell-lines to Cisplatin. In this study, different algorithms including Naïve Bayes, K-Nearest Neighbors (with k = 3), Decision tree, Random Forest and Neural network are used to classify Cisplatin sensitive and resistance samples. Moreover, the results of our systems biology analysis indicated several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. Our results have been validated against different databases such as UniProt, Enrichr and DIANA mirPath v.3 and the papers extracted from the literature. Furthermore, the results of the systems biology analysis determined several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others.

The remainder of this paper is formed as below: “Materilas and methods” section reviews the Materials and Methods. The results are presented in “Results” section. The discussion is reported in “Discussion” section and finally, “Conclusion” section present the conclusion of the overall work.

Materials and methods

Data collection

NCBI GEO database was used to obtain data from datasets [25]. Five different platforms of microarray including GPL13667, GPL6947, GPL6480, GPL6104, and GPL6244 have been used. A total of 85 samples of gene expression datasets of microarray data were selected from different platforms. The selected datasets were related to various cell lines such as ovarian, pancreatic and non-small cell lung cancer (NSCLC) both resistance and sensitive to Cisplatin drug. Sample numbers as well as cancer types are indicated in Table 1.

Table 1 Samples collected from NCBI GEO database

Full size table

The data has been normalized using the LIMA package in the R software [26]. Average has been taken between the expressed values of repetitive probes in each dataset to obtain a unique expression value for each probe. A total 7621 genes were harvested after the isolation of common genes between platforms. The Combat software was then used to eliminate batch effects between different platforms and experiments [27]. Also, an average has been taken between replicates of each platform (sensitive and resistance separately) which reduces the total number of samples from 85 to 14 sensitive and 14 resistances to Cisplatin samples.

Data processing

As the data was collected from various sources, it was necessary to somehow remove the discrepancies known as “Batch effects” between different samples. The batch effects of 85 different samples, each containing 7620 genes, were corrected by the SVA package. The SVA package comprises functions to remove the batch factors and other undesirable conversion in high-throughput examination. Specifically, the SVA package comprises functions to identify and build surrogate variables for high-dimensional datasets. Surrogate variables are covariates built directly from high-dimensional data (such as gene expression and RNA sequencing data) that can be utilized in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. Moreover, the t-SNE algorithm was then used in MATLAB software to make the data presentable before and after the batch effect correction [28].

Feature selection

High dimensional DNA microarray has presented serious challenges to the existing machine learning and classification methods. In other words, in many of medical and microarray datasets, it is possible that many genes are irrelevant or redundant for machine learning algorithm [29,30,31,32]. Feature selection or gene selection is a popular and powerful approach in medical datasets to overcome this shortcoming [33,34,35]. In gene selection, to decrease the microarray data dimensions, by eliminating the irrelevant and similar genes, only a subset of relevant and dissimilar genes that are strongly related to the objective function are selected [36]. This is a powerful strategy to increase the efficiency of the machine learning algorithm, reduce time complexity, build more general classification algorithm, and reduce storage requirements [37, 38]. Gene selection approaches have been successfully employed in many medical applications including; gene expression [39], cancer classification [40], medical diagnosis [32], etc. In other words, a fundamental problem in machine learning algorithms is the high dimensional datasets, in which the size of the feature subset is much higher than the size of the patterns. Therefore, the classification accuracy is significantly reduced. As a result, it is necessary to reduce the initial features using dimension reduction techniques [41, 42]. One efficient way to reduce the dimension is feature selection (gene selection in DNA microarray datasets). In feature selection, an attempt is made to choose a set of initial features that satisfy two targets simultaneously: the minimum similarity between the selected genes and the maximum relevance of these genes with the target class [43, 44]. The main goal of this step is to select appropriate and important feature from original features [45]. To do this purpose, Fisher Score (FS) feature selection algorithm is used to select the features that are most relevant to the target class. Fisher Score is a supervised filter method that selects a feature subset such that the samples in the specific class are most similar to each other and the samples in the different classes are less similar. As a result, this measure Scores higher value to genes with higher separation characteristic. The FS of gene f_i is defined as follows:

$$FS({f_k}) = \frac{{{{\sum\nolimits_{\nu \in V} {{n_\nu }(\bar f_k^\nu - {{\bar f}_k})} }^2}}}{{\sum\nolimits_{\nu \in \wedge V} {{{({\sigma _\nu } \wedge V({f_k}))}^2}} }}$$

(1)

where $\overline{f}_{k}$ is the mean value of all the samples regarding the feature f_k, V is a set of all classes in a dataset, n_ν is the number of pattern on the class ν, and σ (f_k) and $\overline{f}_{k}^{\nu }$ shows the variance and average of feature f_k on class ν, correspondingly.

The Fisher Score method was performed for 14 folds and in each step, 100 features that were most consistent with the normal distribution were selected so that a total of 1400 features were obtained. In the next step, in order to reduce the sample size, the PCA [46] method was performed for 14 folds on the output features of the Feature Score method and reduced set of features were selected for machine training.

Machine learning

In order to measure the flexibility of the developed method on different classifiers, in the designed experiments, the efficiency of the various approaches on three widely used six machine learning algorithms including Naïve Bayes [47], SVM [7], KNN [48], Decision tree [49], Random Forest [50] and Neural Network [51] algorithms is examined for machine training to detect Cisplatin sensitive and resistant samples in cancer patients undergoing chemotherapy. These machine learning algorithms are one of the most well-known and widely used machine learning algorithms that are used by researchers in various prediction and classification problems.

In these experiments the developed approach and other compared methods implemented using Python language programming on an Intel Core-i9 CPU with 16 GB of RAM.

To train these algorithms and due to the small number of available samples (14 pairs of samples including 14 sensitive and 14 resistant samples), the Leave one out method (LOO) was used [52]. In this method, at each step of the machine training, 13 pairs of samples were used as training data and one pair was used as test data. The training process continued until all pairs of samples were once used as the test data. The training was performed twice, once by considering 1400 features extracted from Fisher Score method and once by considering 210 features extracted by PCA method as the training samples. Finally, the performance of the trained machines was evaluated according to Accuracy [53], Specificity [54], Sensitivity [54], Precision [53], MCC [55] and F1 scores [55].

Biological system evaluation

73 genes were selected using the CountIF filter in Excel (Refer to online resource 1). These genes were repeated in 50% or more of the 14 pairs sample extracted from the GEO. The LIMA package was used to calculate the DEGs between all sensitive and resistance samples. The result indicated that 34 of 73 gens which were chosen in the feature selection phase were down-regulated and the rest were up-regulated. The genes were enriched in biological pathways using the Enrichr database [56]. To obtain the mirs related to the 73 harvested genes, miRCancer [57] and miRDB [58] databases were used and the common mirs between the harvested mirs from the two aforementioned databases were selected based on the specific cancer cell line as well as the direction of the regulation. mirs were enriched using DIANA mirPath v.3 database [59] (refer to online resource 2) and pathways related to chemoresistance were selected among the obtained pathways. Furthermore, the transcription factors which regulate the obtained DE genes were harvested using the TRRUST v2 database [60] and were enriched in pathways related to chemoresistance by Enricher database.

Networks

Using the obtained transcription factors and mirs, two types of network including a TF-gene network and a mir-gene network were constructed using Cytoscape software [61]. Topological analyses were then performed using degree and betweenness centralities to report the networks hub genes.

Biological description

The obtained DEGs were studied through literature review. All of the 73 genes were checked in articles and are indicated in a table based on the mechanism of chemo resistance. Some of these genes were not mentioned in the related articles as chemo resistance agents and therefore, they were nominated for further resistance studies which indicated that these genes have main roles in cells and are included in important biological pathways.

Results

Data processing

To correct batch effects between different samples, t-SNE algorithm was used (Fig. 1). This algorithm reduces data dimension and makes the data easier to picture and therefore, easier to comprehend.

Features selected and reduced by Fisher Score and PCA algorithms

In order to select appropriate features for machine learning algorithms, feature selection was performed for 14 folds using the Fisher Score method. In each fold, 100 genes which were most similar to the normal distribution were selected out of 7620 genes. Furthermore, to reduce the dimension of the selected features, the PCA algorithm was performed for 14 folds on the extracted features and several genes were selected in each fold (Table 2).

Table 2 Features selected for machine learning purposes. Selected features are obtained using Fisher Score and PCA algorithms, respectively

Full size table

A machine learning approach to detect Cisplatin sensitive and resistant samples in cancer cell lines

Performance of the machines trained by using reduced features extracted from the PCA algorithm (Table 2) are shown in Fig. 2. Among the developed machines, the Decision Tree algorithm with the average Accuracy of 50% has the weakest performance in terms of accuracy. On the other hand, KNN shows the highest accuracy with an average of 67%. The best performance based on accuracy criteria also belong to KNN and Decision Tree algorithms according to the obtained box plots (Fig. 2c).

Among the developed machines, the Naïve Bayes algorithm is the weakest machine in terms of negative sample detection with a 50% Specificity criteria. Decision Tree algorithm, on the other hand, has the highest average Specificity criterion of 69%, followed by Random Forest and KNN algorithms. The KNN algorithm has the best performance based on the specificity criterion based on the extracted box plots (Fig. 2c).

Based on the results, the weakest performance in terms of Sensitivity criteria is related to the Naïve bayes algorithm, which has not correctly detected any positive samples. On the other hand, KNN and Decision Tree algorithms have the highest Sensitivity criteria with an average of 78 and 67%, respectively (Fig. 2). The Decision Tree algorithm has the best performance in terms of sensitivity criteria based on the obtained box plots (Fig. 2C).

Among these algorithms, Decision Tree and Random Forest algorithms with an average precision of 71% have the highest average precision. The Decision Tree algorithm has the best performance in terms of precision according to the extracted box plots (Fig. 2C).

Similarly, a new set of six machines was trained only this time, 1400 features extracted from the Fisher Score algorithm were used in the training process. Performance results of these machines are shown in Fig. 3. In the new set, the KNN algorithm with an average of 67% accuracy has the highest percentage of correct sample detection compared to other algorithms. Random Forest, Naïve Bayes and SVM algorithms come afterward with an average accuracy of 64%. The KNN algorithm is also the best machine in terms of accuracy based on extracted box plots (Fig. 3c).

Naïve Bayes and Random Forest algorithms, with an average specificity criterion of 67%, are the best machines to correctly detect negative samples. On the other hand, the Decision Tree algorithm with the average specificity of 45% has the weakest performance in this regard (Fig. 3). Furthermore, the KNN algorithm has the best performance in terms of specificity based on the obtained box plots (Fig. 3c). In addition, the KNN algorithm with 78% average sensitivity is the best machine to correctly detect positive samples. The SVM algorithm comes afterward with an average sensitivity of 70% (Fig. 3). According to the obtained box plots, the Naïve Bayes and KNN algorithms are the best choices in terms of Sensitivity criteria, respectively. In addition, among the above algorithms, Naïve Bayes and Random Forest algorithms with an average precision of 71% are the most precise machines. Similarly based on the calculated box plots, the Naïve Bayes algorithm performed better than other algorithms in terms of precision criteria. Finally, according to the MCC criteria, the KNN algorithm and according to the F1 Score criterion, the Random Forest and Naïve Bayes algorithms have the best performances (Fig. 3).

Determining specific mirs for extracted DE genes

The specific mirs for the extracted 73 DEGs were harvested from the miRCancer and miRDB databases for related cancer types (Table 1). These results determine that the expression profile of mirs are down for upregulated genes and are up for down regulated ones (Table 3; Fig. 4).

Table 3 mirs related to both upregulated and down regulated genes. In the upregulated genes the mirs are down and in the down regulated genes the mirs are up

Full size table

mir-target network topology

The mir target network has been topologically analyzed using the degree centrality and the results revealed the hub genes which should be evaluated for their performance in the chemoresistance process (Table 4).

Table 4 The extracted hub genes and mirs nominated for performance evaluation in the chemoresistance process

Full size table

mir enrichment

The upregulated and down regulated mirs in the pathways related to chemoresistance were enriched using DIANA mirPath v.3 database. The related pathways were extracted from KEGG database using the standard P-value of 0.05 (The extracted related pathway: Additional file 1).

TF network topology

Three factors including the in-degree, the out-degree and the betweenness centralities have been noted in TF-gene interaction network for topological analysis. The detected hub genes are related to the in-degree centrality and are listed in Table 5.

Table 5 The detected hub genes based on the in-degree centrality in the TF-gene interaction network

Full size table

The network illustrated in Fig. 5 has been constructed by merging upregulated and down regulated genes and their corresponding regulatory TFs. Topology analysis has been performed based on this network. According to the results obtained from the trrust database, the TFs were down for upregulated genes and were up for down regulated genes (Table 6).

Table 6 The TFs related to both upregulated and down regulated genes. In the upregulated genes the TFs are down and in the down regulated genes the TFs are up

Full size table

TF enrichment

The transcription factors regulating the genes harvested from the feature selection step were enriched using Enrichr database in the oncogenes and chemoresistance pathways. In the Enrichr database the pathways were harvested from KEGG and the adjusted P-value was significant. The results are shown in Table 7.

Table 7 The annotations of transcription factors regulating the harvested genes from the feature selection step

Full size table

Biological description

A literature review was performed on the 73 DE genes which were harvested by the CountIF filter in the feature selection step. A group of these genes were reported in the literature as chemoresistance genes (Additional file 2). The other ones were identified to have a vital role in the oncogenesis pathway and other important cell functions. Therefore, although they were not reported as chemoresistance genes we propose that these genes might be potentially chemoresistance (Table 8). Future studies can be performed to validate these results.

Table 8 potential chemoresistance genes proposed for further studies

Full size table

Discussion

The main aim of this study was to train a machine for the detection of sensitive and resistant samples to Cisplatin in different cancer cell lines including ovarian, pancreatic and lung cancers. Six machines were developed based on different algorithms including Naïve bayes, SVM, KNN, Decision tree, Random Forest and Neural network and the results were evaluated by the accuracy, specificity, sensitivity, precision, MCC and F1 Scores. Furthermore, a series of systems biology analyses were performed using the DE genes harvested from the feature selection step to further improve our study.

It was concluded that the machines which were trained using the features extracted from the Fisher Score algorithm performed better than the ones trained by the same set of features reduced using the PCA algorithm. The reason is due to the richer distinguishing information in the features selected by the Feature Score algorithm than the reduced features of the PCA algorithm. Exceptionally, the KNN algorithm performed similarly in both cases. The similarity of the KNN algorithm performance in both cases is due to the preservation of the data arrangement in the reduced space after the implementation of the PCA dimension reduction algorithm. Since the KNN algorithm decides the fate of a data based on its K nearest neighbors, maintenance of the data arrangement after the dimension reduction has led in the same results in both cases. Furthermore, the KNN and the Naïve bayes algorithms are proposed as the most appropriate machines for prediction. However, it should be noted that the appropriate machine must be selected based on the considered specific application.

Using the classifying features extracted, mir-target network and TF-gene network were constructed and enrichment and topology analyses were performed to detect hub genes and hub TFs. Based on the degree centrality, PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB and ACSL6 target genes were detected as the chemoresistance hub genes according to mir-topology. PTGER3 is encoding PTGER3 protein, a member of the G-protein coupled receptor family. In this study, the degree of the PTGER3 was obtained to be seven which is higher than that of other obtained hub genes. Furthermore, this gene is also reported to be a cisplatin-resistant gene through Ras-MAPK/Erk-ETS1-ELK1/CFTR1 pathways [93]. YWHAH and CTNNB with the degree centrality of six were identified as the second robust hub genes in the row. It has been reported that after chemotherapy, YWHAH is upregulated in prostate cancer cells and is down regulated in Liposarcoma, representing the potency of this gene in chemoresistance [74, 75]. Moreover, it has been shown that CTNNB1 has a vital role in cancer regulatory pathways such as Gastric cancer signaling [94]. With a degree centrality of 5, ANKRD50 and EDNRB were the next obtained hub genes. According to previous studies, it has been reported that EDNRB-methylation is a very common phenomenon in NSCLC. Due to the higher rate of EDNRB methylation in Squamous Cell Carcinoma (SCCs), it can be used to distinguish between SCCs and lung Adenocarcinoma [87].

The TF-gene network topology analysis was also performed and the results specified two hub genes including IFNG and CTNNB1. IFNG is a protein coding gene and it is involved in Folate Metabolism. Furthermore, Yaghoobi et al. have evaluated the IFNG and its antisense (IFNG-AS1) roles in breast cancer and have proposed the involvement of IFNG and IFNG-AS1 in the pathogenesis of breast cancer [95]. In another study, Gao et al. investigated the role of IFNG pathway in the anti-CTLA resistance mechanism. Anti-CTLA-4 produces IFNG to enhance T cell responses. Their data revealed that defects in the IFNG signaling pathway leads to resistance to anti-CTLA-4 therapy [96].

With a systems biology approach including machine learning methods, feature selection, topological analysis, enrichment analysis and finally literature review, we managed to obtain a set of genes which play critical roles in chemoresistance processes. We also have nominated a set of potentially chemoresistance genes which could be used in further studies.

Conclusion

In this study, machine learning approach as well as systems biology analysis was used to extract the genes which commonly separated cisplatin resistant samples from the sensitive ones in lung, pancreatic and ovarian cancers. Furthermore, six classifiers were trained to distinguish between chemoresistance samples from the sensitive ones. As a result, KNN and Naïve Bayes algorithms were selected as the most practical machines according to a set of calculated measures. Moreover, the results of our systems biology analysis indicated several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. Our results have been validated against different databases such as UniProt, Enrichr and DIANA mirPath v.3 and the papers extracted from the literature. Therefore, this in silico study as well as its predictions can pave the way for further experimental researches.

Availability of data and materials

The dataset used in this study can be obtained from the corresponding author on reasonable request.

Abbreviations

DEG:: Differentially Expressed Genes
ODE:: Ordinary differential equation
PDE:: Partial differential equation
SVM:: Support vector machines
PCA:: Principal components analysis
ML:: Machine learning
NSCLC:: Non-small cell lung cancer
FS:: Fisher Score
LOO:: Leave one out

References

Mazumdar M, Glassman J. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. Stat Med. 2000;19(1):113–32.
Article Google Scholar
Urruticoechea A, Alemany R, Balart J, Villanueva A, Viñals F, Capella G. Recent advances in cancer therapy: an overview. Curr Pharm Des. 2010;16(1):3–10.
Article Google Scholar
Damin DC, Lazzaron AR. Evolving treatment strategies for colorectal cancer: a critical review of current therapeutic options. World J Gastroenterol: WJG. 2014;20(4):877.
Article Google Scholar
Khalil DN, Smith EL, Brentjens RJ, Wolchok JD. The future of cancer treatment: immunomodulation, CARs and combination immunotherapy. Nat Rev Clin Oncol. 2016;13(5):273.
Article Google Scholar
Raguz S, Yagüe E. Resistance to chemotherapy: new treatments and novel insights into an old problem. Br J Cancer. 2008;99(3):387–91.
Article Google Scholar
Rebucci M, Michiels C. Molecular aspects of cancer cell resistance to chemotherapy. Biochem Pharmacol. 2013;85(9):1219–26.
Article Google Scholar
Housman G, et al. Drug resistance in cancer: an overview. J Cancers. 2014;6(3):1769–92.
Article Google Scholar
Uramoto H, Tanaka F. Recurrence after surgery in patients with NSCLC. Transl Lung Cancer Res. 2014;3(4):242.
Google Scholar
Lippert TH, Ruoff H-J, Volm M. Intrinsic and acquired drug resistance in malignant tumors. Arzneimittelforschung. 2008;58(06):261–4.
Google Scholar
Kelderman S, Schumacher TN, Haanen JB. Acquired and intrinsic resistance in cancer immunotherapy. Mol Oncol. 2014;8(6):1132–9.
Article Google Scholar
Lloyd KL, Cree IA, Savage RS. Prediction of resistance to chemotherapy in ovarian cancer: a systematic review. BMC Cancer. 2015;15(1):117.
Article Google Scholar
Sekine I, Shimizu C, Nishio K, Saijo N, Tamura T. A literature review of molecular markers predictive of clinical response to cytotoxic chemotherapy in patients with breast cancer. Int J Clin Oncol. 2009;14(2):112–9.
Article Google Scholar
Cortazar P, Johnson BE. Review of the efficacy of individualized chemotherapy selected by in vitro drug sensitivity testing for patients with cancer. J Clin Oncol. 1999;17(5):1625–1625.
Article Google Scholar
Fruehauf JP, Alberts DS. Assay-assisted treatment selection for women with breast or ovarian cancer. In: Chemosensitivity testing in oncology. Springer; 2003. p. 126–145.
Sekine I, Minna JD, Nishio K, Saijo N, Tamura T. Genes regulating the sensitivity of solid tumor cell lines to cytotoxic agents: a literature review. Jpn J Clin Oncol. 2007;37(5):329–36.
Article Google Scholar
Sekine I, Minna JD, Nishio K, Tamura T, Saijo N. A literature review of molecular markers predictive of clinical response to cytotoxic chemotherapy in patients with lung cancer. J Thorac Oncol. 2006;1(1):31–7.
Article Google Scholar
Slodkowska EA, Ross JS. MammaPrint^TM 70-gene signature: another milestone in personalized medical care for breast cancer patients. Expert Rev Mol Diagn. 2009;9(5):417–22.
Article Google Scholar
Camidge DR, Pao W, Sequist LV. Acquired resistance to TKIs in solid tumours: learning from lung cancer. Nat Rev Clin Oncol. 2014;11(8):473.
Article Google Scholar
Sawyers C. Targeted cancer therapy. Nature. 2004;432(7015):294.
Article Google Scholar
Sun X, Hu B. Mathematical modeling and computational prediction of cancer drug resistance. Brief Bioinform. 2017;19(6):1382–99.
Article Google Scholar
Huang C, et al. Machine learning predicts individual cancer patient responses to therapeutic drugs with high accuracy. Sci Rep. 2018;8(1):16444.
Article Google Scholar
Ali M, Aittokallio T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys Rev. 2019;11(1):31–9.
Article Google Scholar
Chen R, Liu X, Jin S, Lin J, Liu J. Machine learning for drug-target interaction prediction. Mol Cells. 2018;23(9):2208.
Google Scholar
Liu R, Zhang G, Yang Z. Towards rapid prediction of drug-resistant cancer cell phenotypes: single cell mass spectrometry combined with machine learning. Chem Commun. 2019;55(5):616–9.
Article Google Scholar
Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.
Article Google Scholar
A. Team RC. R: A language and environment for statistical computing. Vienna; 2013.
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.
Article MATH Google Scholar
van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(5):2579–605.
MATH Google Scholar
Rostami M, Moradi P. A clustering based genetic algorithm for feature selection. In: Information and Knowledge Technology (IKT). 2014. p. 112–116.
Moradi P, Rostami M. A graph theoretic approach for unsupervised feature selection. Eng Appl Artif Intell. 2015;44:33–45.
Article Google Scholar
Moradi P, Rostami M. Integration of graph clustering with ant colony optimization for feature selection. Knowledge Based Syst. 2015;84:144–61.
Article Google Scholar
Rostami M, Forouzandeh S, Berahmand K, Soltani M. Integration of multi-objective PSO based feature selection and node centrality for medical datasets. Genomics. 2020;112(6):4370–84.
Article Google Scholar
Berahmand K, Haghani S, Rostami M, Li Y. A new attributed graph clustering by using label propagation in complex networks. J King Saud Univ Comput Inf Sci. 2020. https://doi.org/10.1016/j.jksuci.2020.08.013.
Article Google Scholar
Rostami M, Berahmand K, Forouzandeh S. A novel community detection based genetic algorithm for feature selection. J Big Data. 2021;8(1):2.
Article Google Scholar
Rostami M, Berahmand K, Forouzandeh S. A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty. J Big Data. 2020;7(1):83.
Article Google Scholar
Liu Y, Nie F, Gao Q, Gao X, Han J, Shao L. Flexible unsupervised feature extraction for image classification. Neural Netw. 2019;115:65–71.
Article MATH Google Scholar
Wang H, Zhang Y, Zhang J, Li T, Peng L. A factor graph model for unsupervised feature selection. Inf Sci. 2019;480:144–59.
Article MathSciNet MATH Google Scholar
Tang X, Dai Y, Xiang Y. Feature selection based on feature interactions with application to text categorization. Expert Syst Appl. 2019;120:207–16.
Article Google Scholar
Wahid A, et al. Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule. Chemom Intell Lab Syst. 2020;199:103958.
Article Google Scholar
Saeys Y, Inza I, Larrañaga P. review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.
Article Google Scholar
Yazdi KM, et al. Prediction optimization of diffusion paths in social networks using integration of ant colony and densest subgraph algorithms. J High Speed Netw. 2020;26:141–53.
Article Google Scholar
Yazdi KM et al. Improving recommender systems accuracy in social networks using popularity. In: 2019 20th international conference on parallel and distributed computing, applications and technologies (PDCAT). 2019. p. 301–307.
Gao W, Hu L, Zhang P, He J. Feature selection considering the composition of feature relevancy. Pattern Recognit Lett. 2018;112:70–4.
Article Google Scholar
Abdulla M, Khasawneh MT. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif Intell Med. 2020;108:101941.
Article Google Scholar
Rostami M, Berahmand K, Nasiri E, Forouzande S. Review of swarm intelligence-based feature selection methods. Eng Appl Artif Intell. 2021;100:104210.
Article Google Scholar
Lever J, Krzywinski M, Altman N. Points of significance: principal component analysis. ed: Nature Publishing Group; 2017.
Lewis DD. Naive (Bayes) at forty: the independence assumption in information retrieval. Springer; 1998. p. 4–15.
Google Scholar
Zhang Z. Introduction to machine learning: k-nearest neighbors. Ann Transl Med. 2016;4(11):218.
Article Google Scholar
Barros RC, Basgalupp MP, Freitas AA, De Carvalho AC. Evolutionary design of decision-tree algorithms tailored to microarray gene expression data sets. IEEE Trans Evol Comput. 2013;18(6):873–92.
Article Google Scholar
Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006;7(1):3.
Article Google Scholar
Lancashire LJ, Lemetre C, Ball GR. An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform. 2009;10(3):315–29.
Article Google Scholar
Elisseeff A, Pontil M. Leave-one-out error and stability of learning algorithms with applications. NATO Sci Ser Sub Ser iii Comput Syst Sci. 2003;190:111–30.
Google Scholar
Heidaryan E. A note on model selection based on the percentage of accuracy-precision. J Energy Resour Technol. 2019. https://doi.org/10.1115/1.4041844.
Article Google Scholar
Altman DG, Bland JM. Diagnostic tests. 1: sensitivity and specificity. BMJ Br Med J. 1994;308(6943):1552.
Article Google Scholar
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.
Article Google Scholar
Chen EY, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 2013;14(1):128.
Article Google Scholar
Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics. 2013;29(5):638–44.
Article Google Scholar
Liu W, Wang X. Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data. Genome Biol. 2019;20(1):1–10.
Article Google Scholar
Vlachos IS, et al. DIANA-miRPath v3. 0: deciphering microRNA function with experimental support. Nucleic Acids Res. 2015;43(W1):W460–6.
Article Google Scholar
Han H, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46(D1):D380–6.
Article Google Scholar
Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.
Article Google Scholar
Sekine Y, et al. The Kelch repeat protein KLHDC10 regulates oxidative stress-induced ASK1 activation by suppressing PP5. Mol Cell. 2012;48(5):692–704.
Article Google Scholar
Zhong M, et al. Expression of MSP58 in hepatocellular carcinoma. Med Oncol. 2013;30(2):539.
Article Google Scholar
Chae YK, et al. Genomic landscape of DNA repair genes in cancer. Oncotarget. 2016;7(17):23312.
Article Google Scholar
Fischer F. The function of mismatch repair proteins in response to DNA damage caused by chemotherapeutic agents, University of Zurich; 2007.
Zhang D, et al. Regulation of the adaptation to ER stress by KLF4 facilitates melanoma cell metastasis via upregulating NUCB2 expression. J Exp Clin Cancer Res. 2018;37(1):176.
Article Google Scholar
Qu S, et al. MicroRNA-330 is an oncogenic factor in glioblastoma cells by regulating SH3GL2 gene. PLoS ONE. 2012;7(9):e46010.
Article Google Scholar
Yang Z, et al. GRSF1-mediated MIR-G-1 promotes malignant behavior and nuclear autophagy by directly upregulating TMED5 and LMNB1 in cervical cancer cells. Autophagy. 2019;15(4):668–85.
Article Google Scholar
Vert A, Castro J, Ribo M, Vilanova M, Benito A. Transcriptional profiling of NCI/ADR-RES cells unveils a complex network of signaling pathways and molecular mechanisms of drug resistance. Onco Targets Ther. 2018;11:221.
Article Google Scholar
Zheng P, Wang W, Muxi Ji QZ, Feng Y, Zhou F, He Q. TMEM119 promotes gastric cancer cell migration and invasion through STAT3 signaling pathway. OncoTargets Ther. 2018;11:5835.
Article Google Scholar
Zheng P, et al. TMEM119 silencing inhibits cell viability and causes the apoptosis of gastric cancer SGC-7901 cells. Oncol Lett. 2018;15(6):8281–6.
Google Scholar
Gheysarzadeh A, Bakhtiari H, Ansari A, Emami Razavi A, Emami MH, Mofid MR. The insulin-like growth factor binding protein-3 and its death receptor in pancreatic ductal adenocarcinoma poor prognosis. J Cell Physiol. 2019;234(12):23537–46.
Article Google Scholar
Lee M, Cheung G, Nair R, Done S. Defining the roles of COIL and WIPI1 in breast cancer metastasis. ed: AACR; 2012.
Daigeler A, et al. Heterogeneous in vitro effects of doxorubicin on gene expression in primary human liposarcoma cultures. BMC Cancer. 2008;8(1):313.
Article Google Scholar
Kibel AS, et al. Genetic variants in cell cycle control pathway confer susceptibility to aggressive prostate carcinoma. Prostate. 2016;76(5):479–90.
Article Google Scholar
Sandhu V. A systems biology approach to integrated molecular analysis in pancreatic and periampullary adenocarcinoma. 2016.
Shih I-M, Nakayama K, Wu G, Nakayama N, Zhang J, Wang T-L. Amplification of the ch19p13.2 NACC1 locus in ovarian high-grade serous carcinoma. Mod Pathol. 2011;24(5):638.
Article Google Scholar
Xia L, et al. ACP5, a direct transcriptional target of FoxM1, promotes tumor metastasis and indicates poor prognosis in hepatocellular carcinoma. Oncogene. 2014;33(11):1395.
Article Google Scholar
Qi H, Liu S, Guo C, Wang J, Greenaway FT, Sun M. Role of annexin A6 in cancer. Oncol Lett. 2015;10(4):1947–52.
Article Google Scholar
O’Sullivan D, et al. A novel inhibitory anti-invasive MAb isolated using phenotypic screening highlights AnxA6 as a functionally relevant target protein in pancreatic cancer. Br J Cancer. 2017;117(9):1326.
Article Google Scholar
Shinmura K, et al. BSND and ATP6V1G3: novel immunohistochemical markers for chromophobe renal cell carcinoma. Medicine. 2015;94(24):e989.
Article Google Scholar
Eo H-S, Heo JY, Choi Y, Hwang Y, Choi H-S. A pathway-based classification of breast cancer integrating data on differentially expressed genes, copy number variations and MicroRNA target genes. Mol Cells. 2012;34(4):393–8.
Article Google Scholar
Khan K, Hardy R, Haq A, Ogunbiyi O, Morton D, Chidgey M. Desmocollin switching in colorectal cancer. Br J Cancer. 2006;95(10):1367.
Article Google Scholar
Cui T, et al. Diagnostic and prognostic impact of desmocollins in human lung cancer. J Clin Pathol. 2012;65(12):1100–6.
Article Google Scholar
Ladner RD. The role of dUTPase and uracil-DNA repair in cancer chemotherapy. Curr Protein Pept Sci. 2001;2(4):361–70.
Article Google Scholar
Schussel J, et al. EDNRB and DCC salivary rinse hypermethylation has a similar performance as expert clinical examination in discrimination of oral cancer/dysplasia versus benign lesions. Clin Cancer Res. 2013;19(12):3268–75.
Article Google Scholar
Chen S-C, et al. Aberrant promoter methylation of EDNRB in lung cancer in Taiwan. Oncol Rep. 2006;15(1):167–72.
Google Scholar
Chen F, He B, Yan L, Qiu Y, Lin L, Cai L. FADS1 rs174549 polymorphism may predict a favorable response to chemoradiotherapy in oral cancer patients. J Oral Maxillofac Surg. 2017;75(1):214–20.
Article Google Scholar
Zhang K, Waxman DJ. PC3 prostate tumor-initiating cells with molecular profile FAM65B high/MFI2 low/LEF1 low increase tumor angiogenesis. Mol Cancer. 2010;9(1):319.
Article Google Scholar
Mironova N, Patutina O, Brenner E, Kurilshikov A, Vlassov V, Zenkova M. The systemic tumor response to RNase A treatment affects the expression of genes involved in maintaining cell malignancy. Oncotarget. 2017;8(45):78796.
Article Google Scholar
Raymond JR, Appleton KM, Pierce JY, Peterson YK. Suppression of GNAI2 message in ovarian cancer. J Ovarian Res. 2014;7(1):6.
Article Google Scholar
Jung-Yi Jiang R-JL, Lee S-J. A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification. IEEE Trans Knowl Data Eng. 2011;23(3):335–49.
Article Google Scholar
Rodriguez-Aguayo C, et al. “PTGER3 induces ovary tumorigenesis and confers resistance to cisplatin therapy through up-regulation Ras-MAPK/Erk-ETS1-ELK1/CFTR1 axis,” (in eng). EBioMedicine. 2019;40:290–304.
Article Google Scholar
Tanabe S, Kawabata T, Aoyagi K, Yokozaki H, Sasaki H. “Gene expression and pathway analysis of CTNNB1 in cancer and stem cells,” (in eng). World J Stem Cells. 2016;8(11):384–95.
Article Google Scholar
Yaghoobi H, Azizi H, Oskooei VK, Taheri M, Ghafouri-Fard S. Assessment of expression of interferon γ (IFN-G) gene and its antisense (IFNG-AS1) in breast cancer (in eng). World Journal Surg Oncol. 2018;16(1):211–211.
Article Google Scholar
Gao J, et al. Loss of IFN-γ pathway genes in tumor cells as a mechanism of resistance to anti-CTLA-4 therapy. Cell. 2016;167(2):397-404.e9.
Article Google Scholar

Download references

Acknowledgements

None

Funding

None.

Author information

Authors and Affiliations

Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan, Russia
Atousa Ataei & Albert A. Rizvanov
Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
Niloufar Seyed Majidi & S. Shahriar Arab
Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
Javad Zahiri
Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran
Mehrdad Rostami

Authors

Atousa Ataei
View author publications
You can also search for this author in PubMed Google Scholar
Niloufar Seyed Majidi
View author publications
You can also search for this author in PubMed Google Scholar
Javad Zahiri
View author publications
You can also search for this author in PubMed Google Scholar
Mehrdad Rostami
View author publications
You can also search for this author in PubMed Google Scholar
S. Shahriar Arab
View author publications
You can also search for this author in PubMed Google Scholar
Albert A. Rizvanov
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The specific contributions made by each author is as follows: AA: Conceptualization, Methodology, Implementation, Writing-Original Draft, Writing—Review & Editing. NSM: Methodology, Implementation, Validation, Writing—Review & Editing. JZ: Methodology, Implementation, Validation, Writing—Review & Editing. MR: Methodology, Implementation, Validation, Writing—Review & Editing. SSA: Conceptualization, Writing—Review & Editing. AAR: Conceptualization, Writing—Review & Editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to S. Shahriar Arab.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

The extracted related pathway.

Additional file 2.

The reported chemoresistance genes.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ataei, A., Majidi, N.S., Zahiri, J. et al. Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis. J Big Data 8, 97 (2021). https://doi.org/10.1186/s40537-021-00477-z

Download citation

Received: 04 March 2021
Accepted: 24 May 2021
Published: 05 July 2021
DOI: https://doi.org/10.1186/s40537-021-00477-z

Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis

Abstract

Introduction

Materials and methods

Data collection

Data processing

Feature selection

Machine learning

Biological system evaluation

Networks

Biological description

Results

Data processing

Features selected and reduced by Fisher Score and PCA algorithms

A machine learning approach to detect Cisplatin sensitive and resistant samples in cancer cell lines

Determining specific mirs for extracted DE genes

mir-target network topology

mir enrichment

TF network topology

TF enrichment

Biological description

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords