Skip to main content
Fig. 3 | Journal of Big Data

Fig. 3

From: Integration of transcriptomic analysis and multiple machine learning approaches identifies NAFLD progression-specific hub genes to reveal distinct genomic patterns and actionable targets

Fig. 3

Establishment and validation of a discriminative gene signature for risk stratification in NAFLD. (a) The training method of the LOOCV framework was introduced, and the iteration number in our study is 98 (the sample number in the training cohort). The 182 genes regarding NAFLD progression were trained in the RF and SVM algorithms with feature selection of recursive feature elimination (RFE) respectively, and 8 overlapping genes remained in the outputs of the two machine learning approaches. Subsequently, Lasso logistic regression (LR) analysis was applied on the 8 genes, and only 4 genes retained their coefficients, (b) and COL1A2 exhibited the highest coefficient. (c) PCA analysis showed a clear separation of NAFL and NASH samples with the expression matrix of the four genes. (d) Dotplots showed that all four genes were significantly upregulated in NASH compared to NAFL samples. *** p < 0.001. (e) In the training cohort, the ROC analysis indicated that the gene signature-derived score could discriminate NASH from NAFL accurately. (f) In GSE163211, COL1A2 and COL4A2 were significantly upregulated in NASH with fibrosis, and the 2-gene discriminative score was also significantly higher than steatosis and NASH samples without fibrosis. (g) The ROC analysis demonstrated that the 2-gene (COL1A2 and COL4A2) discriminative score could identify advanced samples with a favorable performance (AUC = 0.724, 95% CI = 0.644–0.804). (h) In GSE135251, the “3-gene score” was significantly and stepwisely elevated from normal to NAFL to NASH with advanced fibrosis levels, and it exhibited favorable performances in discriminating (i) advanced stages (AUC = 0.737, 95% CI = 0.665–0.810) and (j) advanced fibrosis levels (AUC = 0.729, 95% CI = 0.650–0.808)

Back to article page