Integration of transcriptomic analysis and multiple machine learning approaches identifies NAFLD progression-specific hub genes to reveal distinct genomic patterns and actionable targets

Background Nonalcoholic fatty liver disease (NAFLD) is a leading public health problem worldwide. Approximately one fourth of patients with nonalcoholic fatty liver (NAFL) progress to nonalcoholic


Introduction
Nonalcoholic fatty liver disease (NAFLD), which ranges from simple steatosis to nonalcoholic steatohepatitis (NASH), is the most common cause of chronic liver disease worldwide and affects approximately one fourth of the global population [1,2].About 25% of patients with NAFL progress to NASH, an advanced stage of NAFLD [3].NAFLD is characterized by hepatic steatosis, necroinflammation, and liver fibrosis, and these pathological features coordinate in the progression of NAFLD and jointly contribute to the development to end-stage liver disease such as cirrhosis and hepatocellular carcinoma (HCC) [4,5].
A "multi-hit" theory has been proposed to explain NAFLD pathogenesis [6].Hepatic steatosis is generally attributed to the dysfunction of intrahepatic lipid metabolism, which acts as the first hit and sensitizes the liver to the subsequent hits including oxidative stress, inflammation, and injury [7,8].A number of key molecules serve a significant role in NAFLD progression.For example, chemokine CCL2 and its receptor CCR2 are abnormally elevated during NAFLD progression, and the therapeutic strategy of targeting CCL2 and CCR2 has been shown as a promising approach for the treatment of NASH [9,10].Important as these molecules are, however, they are not enough to comprehensively explain the initiation and mechanism of NAFLD.NAFLD is a complex heterogeneous disease resulted from both intrinsic susceptibility and environmental background, during which numerous molecules such as CCL2 and CCR2 within the gene regulation network mediate NAFLD progression at different disease stages.Hence, there is an urgent need to identify more molecular drivers and coordinators involved in liver steatosis and inflammation, thus to make a better understanding of NAFLD heterogeneity and facilitate personalized management of high-risk NAFLD patients who may benefit from more intensive surveillance and preventive intervene.
In this study, we aimed to investigate the dysregulated signaling pathways and identify key genes involved in the progression of NAFLD.With a series of bioinformatic approaches, a robust NAFLD progression-specific gene signature was established to classify the NAFL patients into different risk subgroups.Intriguingly, more severe inflammatory response, alteration of extracellular matrix organization and cell-cell adhesion were detected in the high-risk NAFL samples compared to the low-risk subset, and similar infiltrating patterns of fibroblasts and Th1/2 cell populations were observed between high-risk NAFL and NASH samples.Furthermore, an integrative strategy of machine learning was used to screen for the most robust biomarkers, and their capacity of discrimination in progressive stages of NAFLD was validated in different independent cohorts.In addition, distinct immune and stromal patterns were observed among different CTNNB1/COL1A2 groups in HCC.From these comprehensive analyses, we provide evidence for the necessity of molecular classification and its potential clinical utility in the risk stratification of NAFLD, aiming to guide preventive intervene in the early stage of the disease.We hope this work could facilitate the personalized management and treatment of NAFLD patients.

Identification of NAFLD progression-specific genes and pathways
GSE167523 was used as a training cohort.Using Gene Ontology Biological Process (GOBP) enrichment analysis, the transcriptome profiling data of GSE167523 which contains 51 NAFL and 47 NASH samples were used to explore the enriched pathways which represent the alterations of biological processes involved in NAFLD progression.
Two strategies were combined to screen for NAFLD progression-specific genes.Firstly, the weighted gene co-expression network analysis (WGCNA) [23] was used to construct a scale-free co-expression network using R package "WGCNA" and to identify a gene module which is mostly correlated with sample category (NAFL or NASH), and this gene module was considered as "gene module of NAFLD progression".Secondly, using R package "limma", differentially expressed genes (DEGs) were identified between NAFL and NASH samples with a filtering threshold of FDR (false discovery rate) q value less than 0.01 and fold change (FC) > 2 or < 0.5.The overlapping genes in "gene module of NAFLD progression" and "DEGs between NAFL and NASH" were finally considered as NAFLD progression-specific genes.A protein-protein interaction (PPI) network was generated to reveal the functional and physical linkages among these DEGs using the STRING database.

Connectivity map (CMap) analysis
CMap is a resource that uses transcriptional expression data to probe the relationships between disease, cell physiology, and therapeutics [24].The above-mentioned NAFLD progression-specific genes were submitted to CMap for analysis to explore potential targets and applicable drugs for high-risk NAFLD patients.Top 10 compounds with the highest predictive scores and corresponding descriptions of mode-of-actions (MoAs) were displayed in a dot diagram.

Estimation of immune infiltration and tumor purity
Several computational algorithms including TIMER [25], Cibersort [26], quanTIseq [27], MCP-counter [28], xCell [29], and EPIC [30] were used to quantify the infiltrating abundance of different cell types, and the ESTIMATE algorithm [31] was applied to infer the tumor purity and immune score.In detail, the infiltration abundance of cancer-associated fibroblast (CAF) was compared among different clusters using EPIC, MCP-counter and xCell algorithms, respectively.

Establishment of a risk-stratification gene signature
To establish a risk-stratification gene signature to discriminate high-risk NAFLD, we combined multiple machine learning approaches including random forest (RF), support vector machine (SVM), and logistic regression (LR) with Lasso regularization to screen for optimal candidates.Recursive feature elimination (RFE) was applied in the RF and SVM algorithms to remove the weakest features and retain the optimal features.To ensure the high accuracy and stability of the predictive model, we introduced the leaveone-out cross-validation (LOOCV) framework in the training cohort.Subsequently, Lasso regularization adds a penalty parameter (λ) to the LR model, and this action can lead to zero coefficients, i.e. some of the candidate genes are completely neglected for evaluation.In our analysis, 8 candidates remained after the integrative selection of RF-RFE and SVM-RFE, and 4 genes retained their logistic coefficients after Lasso regularization.Finally, the discriminative score for risk stratification was calculated with the relative expression and Lasso logistic coefficient of the four genes as follows:

scRNA-seq analysis
Three scRNA-seq datasets (GSE125449, GSE146409, and GSE166635) were used to reveal the components of HCC tumor microenvironment (TME) and depict the gene expression characteristics in different cell types.GSE125449 contains single-cell transcriptome profiling data of nine HCC specimens, GSE146409 six specimens, and GSE166635 two specimens.The scRNA-seq expression matrix was processed with R package "Seurat".At first, the "NormalizeData" function was applied to normalize the gene expression data, then the "FindVariableFeatures" function was applied to identify the top 2,000 highly variable genes (HVGs).After performing "RunPCA" for dimension reduction, R package "Harmony" was used to eliminate batch effect."FindNeighbors" was used to determine the k-nearest neighbors of each cell, and "FindClusters" was used to determine optimal clusters.UMAP reduction was used for cluster visualization, and "SingleR" package was used for cluster annotation.In addition, "FeaturePlot" and "Vln-Plot" functions were used to visualize the gene expression of COL1A2 in each cell type.

Mutation analysis
Somatic variant data of TCGA-LIHC, which was called using MuTect2, were sorted in a mutation annotation format (MAF) file, and were visualized using R package "maftools" [32].The mutation variants of "Frame_Shift_Del", "Frame_Shift_Ins", "In_Frame_Del", "In_Frame_Ins", "Missense_Mutation", "Nonsense_Mutation", "Nonstop_Mutation", "Splice_Region", "Splice_Site" and "Translation_Start_Site" were defined as non-synonymous, and the other variants were defined as synonymous.Furthermore, R package "sigminer" was used to extract mutational signatures from the whole-exome sequencing (WES) data [33].Bayesian variant of nonnegative matrix factorization (NMF) algorithm was used to decipher mutational signatures, and the optimal factorization of k value is selected when the magnitude of the cophenetic correlation coefficient begins to drop significantly.Mutational signatures were annotated by computing cosine similarity against validated single base substitution (SBS) mutational catalogues retrieved from the COSMIC database as previously reported [34].
In addition, the mutation frequency and mutual exclusivity of TP53 and CTNNB1 in HCC were further shown in the TCGA-HCC cohort and another two cohorts (MSK-HCC [35] and INSERM-HCC [36]) with oncoplots.

Additional bioinformatic and statistical analyses
The heatmaps of gene expression and cell abundance were generated using R package "pheatmap".R packages "GOSemSim" and "ggtree" were used to measure and visualize the semantic similarity among GO terms [37,38].Different risk subgroups of NAFL were identified with the expression matrix of NAFLD progression-specific genes using NMF, and the optimal factorization of k value was selected when the magnitude of the cophenetic correlation coefficient begins to fall significantly.Principal component analysis (PCA) was used to visualize the dissimilarity between NAFL and NASH samples.The activities of some biological features such as "inflammatory response", "ECM organization" and "cell-cell adhesion" in NAFLD samples were quantified using the ssGSEA method based on the transcriptome profiling data and corresponding gene sets retrieved from the MSigDB [39].Pearson correlation analysis was performed to evaluate the correlation between two variables subject to normal distribution.Student's t-test or oneway analysis of variance (ANOVA) was used to analyze differences among groups with variables subject to normal distribution, otherwise Mann-Whitney U test or Kruskal-Wallis test.The receiver operating characteristic (ROC) analysis was used to evaluate the predictive accuracy of the discriminative score established in our study.Categorical variables between the two groups were compared using the chi-square test.Two-sided p value or FDR q value less than 0.05 was considered statistically significant.All analyses were performed in the GraphPad Prism 8 and R 4.1.0software.

Identification of NAFLD progression-specific pathways and genes
Using R package "limma", significantly upregulated genes were identified in NASH compared to NAFL, and then submitted to GO analysis for pathway exploration.R package "GOSemSim" was used to measure the similarity among all the biological processes (BP), and R package "ggtree" was used to visualize the clusters, and similar BPs were clustered into a mutual branch (Fig. 1a).According to the significance, we extracted the top five altered GOBPs and displayed them in a Circos diagram (Fig. 1b).We observed that alterations in two biological features are mainly responsible for the NAFLD progression: extracellular matrix (ECM) organization and cell cycle process.
Two filtering strategies were combined to identify NAFLD progression-specific genes.Firstly, WGCNA was performed with the transcriptome profiling data of 98 NAFLD samples and their sample category (NAFL or NASH) to construct a scale-free co-expression network.A total of 36 gene modules were generated with a power of 6 as the optimal soft threshold (Supplementary Fig. 1).Among these modules, the brown module exhibited a highest correlation with sample category (|r| = 0.61, p = 6e-11) and was considered as "gene module of NAFLD progression" (Fig. 1c).Secondly, the "limma" algorithm was used to identify a total of 378 DEGs between NAFL and NASH samples with a filtering threshold of q value less than 0.01 and fold change (FC) > 2 or < 0.5 (Fig. 1d).Finally, 182 overlapping genes in the intersection of "gene module of NAFLD progression" and "DEGs between NAFL and NASH" were considered as "NAFLD progressionspecific genes" (Fig. 1e) and the detailed information is shown in supplementary Table 2.A PPI network was generated to depict the functional and physical linkages among these DEGs, and we observed that the hub (the red circle) in the network is mainly composed of collagen family members (Fig. 1f ).
To explore potential targets and applicable drugs for high-risk NAFLD patients, the above-mentioned 182 NAFLD progression-specific genes were submitted to CMap for further investigation.Top 10 compounds with the highest predictive scores and corresponding 7 mode-of-actions (MoAs) were displayed in a dot diagram (Fig. 2g).The 7 MoAs were annotated with "Angiotensin receptor antagonist", "Aromatic hydrocarbon derivative", "Aurora kinase inhibitor", "EGFR inhibitor", "HDAC inhibitor", "NFκB pathway inhibitor", and "Sphingosine kinase inhibitor".In particular, four compounds named HDAC3-selective, entinostat, mocetinostat, and Merck60 share a common MoA of HDAC inhibitor (HDACi), which indicates HDACi might be a potentially applicable drug for advanced NAFLD patients.

Different risk subgroups with distinct inflammatory and fibrotic patterns were identified in NAFL
The identified 182 genes concerning NAFLD progression were further analyzed using GO method, and a Circos illustrated that they were mainly enriched in five GOBPs annotated with "ECM organization", "Blood vessel development", "Wnt signaling pathway", "Cell adhesion", and "Cell morphogenesis" (Fig. 2a).We observed that the five GOBPs include the main biological alterations occurred in NASH, and this result further indicated that the 182 genes could commendably represent the pathological development of NAFLD.Based on the expression profile of 182 NAFLD progression genes and using the NMF algorithm, the 51 NAFLs in the training cohort were divided into two subclusters (C1 = 21, C2 = 30; Fig. 2b).Using ssGSEA algorithm, we found that the quantified scores of "Inflammatory response", "ECM organization", and "Cell-cell adhesion" were significantly and progressively elevated from C2 to C1 to NASH (Fig. 2c -e; *** p < 0.001).In particular, positive correlations between "Inflammatory response" and "ECM organization" or "Cell-cell adhesion" were observed in NAFL-C1, C2 and NASH  (d) A volcano plot showed a total of 378 DEGs between NAFL and NASH samples with a filtering threshold of q value less than 0.01 and fold change (FC) > 2 or < 0.5.(e) 182 overlapping genes in the intersection of "WGCNA-identified gene module of NAFLD progression" and "DEGs between NAFL and NASH" are considered as "NAFLD progression-specific genes".(f) A PPI network was generated to depict the functional and physical linkages among these DEGs, and the hub (the red circle) in the network is mainly composed of collagen family members.(g) CMap algorithm showed top 10 compounds with the highest predictive scores and corresponding 7 mode-of-actions (MoAs) in a dot diagram.The 7 MoAs were annotated with "Angiotensin receptor antagonist", "Aromatic hydrocarbon derivative", "Aurora kinase inhibitor", "EGFR inhibitor", "HDAC inhibitor", "NFκB pathway inhibitor", and "Sphingosine kinase inhibitor" samples (Fig. 2f & g).Furthermore, among all correlations, NASH exhibited the weakest correlation between "Inflammatory response" and "ECM organization" (Fig. 2f ).
Considering inflammation as a driver in NAFLD progression, we selected a list of widely acknowledged inflammatory factors (LPAR1, PTPRE, CCR2, CCL20, CLEC5A, CXCL6, ITGB8, PDPN, and GPC3) and showed their expression profiles in NAFL-C1, C2 and NASH (Fig. 2h; *** p < 0.001).Using one-way ANOVA test, we found that all these inflammatory factors were significantly decreased in the NAFL-C2 group compared to The 182 NAFLD progression-specific genes were analyzed using GO method, and the Circos illustrated that they were mainly enriched in five GOBPs annotated with "ECM organization", "Blood vessel development", "Wnt signaling pathway", "Cell adhesion", and "Cell morphogenesis".(b) Based on the expression profile of the 182 NAFLD progression genes and using the NMF algorithm, the 51 NAFLs in the training cohort were divided into two subclusters (C1 = 21, C2 = 30).(c-e) Using ssGSEA quantification, we observed that the ssGSEA scores of "Inflammatory response", "ECM organization", and "Cell-cell adhesion" were progressively elevated from C2 to C1 to NASH.(f & g) Positive correlations between "Inflammatory response" and "ECM organization" or "Cell-cell adhesion" were observed in NAFL-C1, C2, and NASH samples.(h) A group of widely acknowledged inflammatory factors including LPAR1, PTPRE, CCR2, CCL20, CLEC5A, CXCL6, ITGB8, PDPN, and GPC3 were significantly decreased in the NAFL-C2 group compared to either NAFL-C1 or NASH.(i) The fibroblasts abundance was significantly downregulated in NAFL-C2.Th1 cell infiltration was significantly upregulated in NAFL-C2, while no significant difference of Th2 cell infiltration was observed among the three groups.(j) The NAFL-C2 group exhibited the lowest stroma score, while no significant difference of stroma score was observed between NAFL-C1 and NASH.* p < 0.05, ** p < 0.01, *** p < 0.001, # not significant either NAFL-C1 or NASH.In addition, we evaluated the abundance of fibrosis-related cell populations including fibroblast and Th1/2 cells in the three groups using xCell algorithm.The fibroblasts abundance was significantly downregulated in NAFL-C2.Th1 cell infiltration was significantly upregulated in NAFL-C2, while no significant difference of Th2 cell infiltration was observed among the three groups (Fig. 2i; * p < 0.05, *** p < 0.001, # not significant).Furthermore, the NAFL-C2 group exhibited the lowest stroma score, while no significant difference of stroma score was observed between NAFL-C1 and NASH (Fig. 2j; ** p < 0.01, *** p < 0.001, # not significant).

Establishment and validation of the discriminative score for risk stratification
To assess the risk stratification of NAFLD in a quantitative method, we combined different machine learning (ML) approaches to screen for robust biomarkers.The training method of the LOOCV framework was introduced in Fig. 3a, and the iteration number in our study is N = 98 (the sample number in the training cohort).The identified 182 genes regarding NAFLD progression were trained in the RF and SVM algorithms with feature selection of recursive feature elimination (RFE) respectively, and 8 overlapping genes (COL1A1, COL1A2, COL4A1, COL4A2, COL5A1, DTNA, THBS1, and UBD) remained in the outputs of the two ML algorithms.Finally, Lasso logistic regression (LR) analysis was applied on the 8 genes, and only 4 genes (DTNA, COL4A2, UBD, and COL1A2) retained their coefficients.Among them, COL1A2 exhibited the highest coefficient (Fig. 3b).PCA analysis showed a clear separation of NAFL and NASH samples with the expression matrix of the four genes (Fig. 3c).Furthermore, dotplots showed that all four genes were significantly and stepwisely elevated from NAFL-C2 to C1 to NASH samples (Fig. 3d).In the training cohort, the discriminative score was calculated for each sample according to the formula stated in the methods section, and ROC analysis indicated that the score could discriminate NASH from NAFL accurately (Fig. 3e).
A dataset named GSE163211, which contains 88 steatosis samples, 72 NASH with F0, and 82 NASH with F1-4, was used as an external testing cohort.The GSE163211 dataset was produced from the GPL29503 platform of NanoString Technologies and contains RNA-seq data of 800 custom genes.Among the four genes (DTNA, COL4A2, UBD, and COL1A2), only COL1A2 and COL4A2 were detected in the platform.COL1A2 and COL4A2 were significantly upregulated in NASH with fibrosis, and the 2-gene discriminative score was also significantly higher than steatosis and NASH samples without fibrosis (Fig. 3f ).In the total of 154 NASH samples, the ROC analysis demonstrated that the 2-gene discriminative score could identify advanced fibrotic samples with a favorable performance (AUC = 0.724, 95% CI = 0.644-0.804;Fig. 3g).

The risk-stratification gene signature was significantly correlated with malignant progression
To evaluate the biological role of the four discriminative genes in NAFLD and malignant progression, we investigated the expression profile of the four genes in different datasets which consist of normal liver tissues, NASH, and HCC samples.In the GSE164760 microarray dataset, all the four genes were significantly upregulated in HCC samples derived from NASH when compared to healthy liver tissues and NASH samples (Fig. 4a; *** p < 0.001).In the combination of GTEx and TCGA-HCC RNA-seq database, all the Fig. 4 The risk-stratification gene signature was significantly correlated with malignant progression.(a) In the GSE164760 microarray dataset, all the four genes (COL1A2, COL4A2, UBD, and DTNA) were significantly upregulated in HCC samples derived from NASH when compared to healthy liver tissues and NASH samples.(b) In the combination of GTEx and TCGA-HCC RNA-seq database, all the four genes were also significantly upregulated in HCC samples compared to donated normal tissues and adjacent normal tissues.(c, f and i) Using UMAP dimensionality reduction, three scRNA-seq datasets including GSE125449, GSE146409 and GSE166635 were used to reveal the components of HCC tumor microenvironment (TME) and (d, g and j) the expression profile of COL1A2 in different cell types, respectively.(e, h and k) Violin plots showed that COL1A2 is specifically expressed in fibroblasts, almost not expressed in other cells within HCC TME.*** p < 0.001 four genes were also significantly upregulated in HCC samples compared to donated normal tissues and adjacent normal tissues (Fig. 4b; *** p < 0.001).
In addition, three scRNA-seq datasets including GSE125449, GSE146409 and GSE166635 were used to reveal the components of HCC tumor microenvironment (TME) and the expression characteristics of hub genes.UMAP dimensionality reduction was used to show the distribution and dissimilarity of the cell types involved in HCC TME (Fig. 4c, f, and i).Considering COL1A2 possesses the highest discriminative coefficient among the four genes, we assessed the expression characteristics of COL1A2 (Fig. 4d, g, and j) in different cell types, and all the three violin plots showed that COL1A2 is specifically expressed in fibroblasts, almost not expressed in other cell types (Fig. 4e, h, and k).These findings demonstrated that the risk-stratification gene signature was closely relevant to the progression of NAFLD and HCC, and COL1A2 might play a specific role in fibroblast activation and fibrosis severity.

Analysis of mutational patterns and CTNNB1/COL1A2 axis in NAFLD and HCC
Using the WES data of TCGA-HCC and NMF algorithm, we attempted to decipher the mutational patterns of HCC and explore the relationship between COL1A2 and mutation patterns in HCC.The optimal k factorization of 5 was selected (Supplementary Fig. 2), and five mutational signatures were identified and matched in the COSMIC database (Fig. 5a).The three major mutational signatures were annotated with "Defective DNA mismatch repair", "Aflatoxin exposure", and "Aristolochic acid exposure", and the abundance of each mutational signature in TCGA-HCC was shown in a pie chart (Fig. 5b).We reviewed the medical records of the TCGA-HCC cohort and extracted 184 HCC samples with history of alcohol consumption, hepatitis or NAFLD, and the distribution of each mutational signature was depicted in a stacked barplot (Fig. 5c).Using chi-square test, we found that NAFLD-HCC is characterized with high COL1A2 expression (Fig. 5d).With quantiles of COL1A2 mRNA expression, 46 HCC samples were assigned to the COL1A2-lowest and -highest group, respectively.Oncoplots of the two groups demonstrated that CTNNB1 acts as the most frequently mutated gene in the COL1A2-low cohort, with the mutation frequency up to 48% (left panel, Fig. 5e).In contrast, CTNNB1 is rarely mutated in the COL1A2-high cohort (right panel, Fig. 5e).In the integrated analyses of TCGA-HCC, MSK-HCC and INSERM-HCC cohorts, the gene-pair of TP53 and CTNNB1 is shown significantly mutually exclusive (Fig. 5f ).Furthermore, COL1A2 mRNA expression is significantly elevated in CTNNB1-wild type HCC samples compared to CTNNB1-mutated ones (Fig. 5g).Among nine representative oncogenic pathways, the WNT signaling pathway is the most frequently affected one in the COL1A2-low cohort (Fig. 5h).GOBP analysis was used to further explore the altered pathways in CTNNB1-WT/COL1A2-high samples.A functional network showed the five most important pathways in CTNNB1-WT/COL1A2-high HCC samples were termed "vasculature development", "chemotaxis", "ECM organization", "positive regulation of locomotion", and "positive regulation of cell adhesion" (Fig. 5i).These evidences demonstrated that the CTNNB1/COL1A2 axis might play a role in the fibrosis and inflammation severity during NAFLD-HCC progression.

Distinct immune and stromal patterns were observed among different CTNNB1/COL1A2 groups
The infiltrating immune and stromal cells were estimated by TIMER, Cibersort, quan-TIseq and MCP-counter, and the differences of distribution were also investigated in distinct CTNNB1/COL1A2 groups.As shown in a comprehensive heatmap, most of (i) A functional network showed the top five important pathways in CTNNB1-WT/COL1A2-high HCC samples were termed "vasculature development", "chemotaxis", "ECM organization", "positive regulation of locomotion", and "positive regulation of cell adhesion" the immune and stromal cells are enriched in the CTNNB1-WT/COL1A2-high samples, regardless of the TMB level (Fig. 6a).Furthermore, the xCell algorithm inferred the absolute infiltration of a total of 36 cell types involved in the TME, and the chord diagram showed that the infiltrating abundance is significantly higher in the CTNNB1-WT/COL1A2-high and CTNNB1-Mut/COL1A2-high samples than the others (Fig. 2b).Infiltrating scores of CAFs were inferred using EPIC, MCP-counter, and xCell, and we observed that CAFs are significantly enriched in the CTNNB1-WT/COL1A2-high samples, suggesting the distinct fibrosis severity resulted from different classification of CTNNB1-WT and high COL1A2 expression (Fig. 6c -e).On the other hand, we The ESTIMATE algorithm was applied to infer the immune infiltration and tumor purity for each sample, and the CTNNB1-WT/COL1A2-high and CTNNB1-Mut/COL1A2-high groups are labelled with high immune infiltration, and (j) a significantly negative correlation (r = -0.916,p < 0.001) was observed between the immune score and tumor purity across the four groups.Similarly, (k) the inflammatory response activity exhibits distinct distributions among the four groups, and (l) a significantly positive correlation (r = 0.839, p < 0.001) between immune score and inflammatory response was also observed across the four groups.*** p < 0.001 evaluated the levels of representative immune checkpoints including CD274, PDCD1 and TIGIT, and significant differences were observed among different groups, indicating heterogenous tumor immunogenicity and distinct potential response to immunotherapy (Fig. 6f -h).Moreover, the ESTIMATE algorithm was applied to infer the immune infiltration and tumor purity of each sample, and we observed that the CTNNB1-WT/ COL1A2-high and CTNNB1-Mut/COL1A2-high groups are labelled with high immune infiltration (Fig. 6i), and a significantly negative correlation (r = -0.916,p < 0.001) between the immune score and tumor purity across the four groups was shown in Fig. 6j.Similarly, the inflammatory response activity exhibits distinct distributions among the four groups (Fig. 6k), and a significantly positive correlation (r = 0.839, p < 0.001; Fig. 6l) with immune score was also observed across the four groups.

COL1A2 significantly correlates with EMT and angiogenesis in pan-cancer
To investigate the connection between the hub gene COL1A2 and malignant features in pan-cancer, we quantified the ability of ten cancer hallmarks including EMT, angiogenesis, apoptosis, inflammation, hypoxia, glycolysis, cell cycle progression (CCP), senescence, DNA repair and oxidative phosphorylation using ssGSEA algorithm, and calculated the correlations between COL1A2 expression and the ten hallmarks of cancer (Fig. 7a).Among all the cancer hallmarks, COL1A2 mostly correlated with EMT and angiogenesis in 32 malignant solid cancers (EMT: r = 0.86, p < 2.2e-16; Angiogenesis: r = 0.73, p < 2.2e-16) in the overall TCGA pan-cancer cohort (Fig. 7b & c), or in individual tumor types (Fig. 7d & e).Considering EMT and angiogenesis are two critical biological features which contribute to the initial development of cancer, we reasonably speculated that COL1A2 serves a significant role in malignant progression in pan-cancer.

Discussion
NAFLD is a major public health problem worldwide and is becoming a leading cause of chronic liver disease [2].As a complex heterogeneous disease, NAFLD results from both intrinsic susceptibility and environmental factors, and it is characterized by distinctive pathological features including hepatic steatosis, liver fibrosis, and chronic inflammation [3].As the advanced stage of NAFLD, NASH acts as a key step that induces cirrhosis even HCC.However, not every NAFL patient develops to NASH, and the conversion rate is about 25% [3].Therefore, there is an urgent need to classify NAFL patients into different risk subgroups and tailor personalized therapeutic strategies.Obviously, significantly altered signaling pathways and key regulators concerning NAFLD progression determine whether patients with simple NAFL develop to NASH or not.In this study, we attempted to identify pathways and genes closely connected with NAFLD progression and applied them to risk stratification of NAFL samples using a series of bioinformatic methods and machine learning approaches.
Firstly, we investigated the significantly dysregulated pathways in NASH using NAFL as the comparative baseline.Notably, "ECM organization" was identified as the most crucial biological process that drives NAFL to NASH.In a training cohort that contains 51 NAFL and 47 NASH, a total of 182 candidate genes involved in NAFLD progression were identified using an integrated strategy of WGCNA and DEGs screening.GO enrichment analysis confirmed that the 182 genes were mainly enriched in "ECM organization" etc., indicating their explicit role in NAFLD progression.Subsequently, the 182 candidate genes were submitted to the CMap algorithm for drug repositioning, and HDACi exhibited the highest predictive score as a potential drug for the treatment of NAFLD progression.Interestingly, a recent study reported that HDACi givinostat could attenuate NAFLD and liver fibrosis [40], which raises the possibility that HDACi could be an actionable strategy for NAFLD.
Using NMF algorithm with the expression profile of the 182 genes, the 51 NAFL patients were divided into two groups (NAFL-C1 and C2), and the two groups exhibited distinct molecular and pathological features.In detail, significantly higher levels of inflammation, ECM organization and cell-cell adhesion were observed in C1, and the expression patterns of inflammatory factors in C1 were more similar to NASH.No significant difference of Th2 infiltration was observed among NAFL-C1, C2 and NASH.However, the distribution patterns of fibroblast abundance, Th1 infiltration and stroma score of C1 samples were extremely similar to NASH.It is widely acknowledged that Th1/2 balance affects the progression of fibrosis in various tissues and organs including hepatology [41], and activated fibroblasts are classically associated with fibrosis severity in disease progression [42].Thus, we reasonably speculated that although NAFL-C1 samples appear in an early stage of NAFLD and diagnosed as NAFL, they actually have a high similarity of inflammatory response and fibrotic potential with NASH and possess an intrinsic tendency to develop to an advanced stage.Compared with NAFL-C2 patients, C1 patients may benefit from more intensive surveillance and preventive intervene.
To evaluate the risk level of NAFLD in a quantitative method, we combined different machine learning approaches including RF, SVM, and Lasso LR to screen for robust biomarkers to discriminate more severe NAFLD cases.Four genes named DTNA, COL4A2, UBD, and COL1A2 were finally screened out after the rigorous selective procedure, and COL1A2 held the highest logistic regression coefficient among the four genes.COL1A2 belongs to the type I collagen gene family and was reported to mediate ECM deposition and promote fibrotic diseases [43].In independent validation cohorts, the established risk-stratification gene signature could discriminate advanced NAFLD samples.In addition, all the four genes are significantly upregulated in HCC samples compared to normal liver.In particular, COL1A2 is specifically expressed in fibroblasts involved in HCC TME, indicating that COL1A2 might play a critical role in fibroblast activation and fibrosis severity during NAFLD progression and development to HCC.Furthermore, COL1A2 also significantly correlates with EMT and angiogenesis in pan-cancer, suggesting its significant role in the initial development and malignant progression of solid cancers.
In addition, NAFLD-HCC samples are characterized by high expression of COL1A2 with wild-type CTNNB1.By analyzing mutational patterns of TCGA-HCC samples, we found that CTNNB1 is much more frequently mutated in low-COL1A2 HCC samples, while the mutation frequency of TP53 has no significant correlation with COL1A2 expression as a comparison.These findings indicated that Wnt/β-catenin/COL1A2 axis might play an important role in fibrosis severity in NAFLD-HCC progression, while inactivation of β-catenin might attenuate the progression of fibrosis through downregulation of fibrotic progression-related key genes such as COL1A2.Previous bioinformatic studies have shown that COL1A2 has a diagnostic value and may play an important role in NAFLD progression [44,45], which is consistent with our results.Govaere et al. [13] previously reported a 25-gene signature for steatohepatitis and fibrosis in NAFLD using pairwise comparison and logistic regression analysis, and we observed our genes COL1A2 and DTNA were overlapped with Govaere's gene signature.In comparison, we integrated multiple machine learning approaches to screen for robust biomarkers, and our results showed that four genes are enough to discriminate advanced stages and fibrosis levels for NAFLD with favorable performances.We believe the advantage of our gene signature is simpler and more workable in clinical practice.
In summary, our study provided evidence for the necessity of molecular classification for NAFLD, and we established a risk-stratification gene signature to quantify risk assessment, aiming to identify the high-risk subset and to guide personalized treatment.We hope this work could facilitate the personalized therapeutic strategy for NAFLD, especially those patients appear in early stage but actually with high-risk NAFL.

Fig. 1
Fig. 1 Identification of NAFLD progression-specific pathways and genes.(a) The similarities among all the biological processes (BPs) were measured and visualized, and similar terms were clustered into a common branch.(b) The top five altered GOBPs were displayed in a Circos diagram.(c) WGCNA was performed with the transcriptome profiling data of 98 NAFLD samples and a total of 36 gene modules were identified.The brown module which exhibited the highest correlation with sample category (|r| = 0.61, p = 6e-11) and was considered as "WGCNA-identified gene module of NAFLD progression".(d)Avolcano plot showed a total of 378 DEGs between NAFL and NASH samples with a filtering threshold of q value less than 0.01 and fold change (FC) > 2 or < 0.5.(e) 182 overlapping genes in the intersection of "WGCNA-identified gene module of NAFLD progression" and "DEGs between NAFL and NASH" are considered as "NAFLD progression-specific genes".(f) A PPI network was generated to depict the functional and physical linkages among these DEGs, and the hub (the red circle) in the network is mainly composed of collagen family members.(g) CMap algorithm showed top 10 compounds with the highest predictive scores and corresponding 7 mode-of-actions (MoAs) in a dot diagram.The 7 MoAs were annotated with "Angiotensin receptor antagonist", "Aromatic hydrocarbon derivative", "Aurora kinase inhibitor", "EGFR inhibitor", "HDAC inhibitor", "NFκB pathway inhibitor", and "Sphingosine kinase inhibitor"

Fig. 2
Fig.2Different risk subgroups with distinct inflammatory and fibrotic patterns were identified in NAFL.(a) The 182 NAFLD progression-specific genes were analyzed using GO method, and the Circos illustrated that they were mainly enriched in five GOBPs annotated with "ECM organization", "Blood vessel development", "Wnt signaling pathway", "Cell adhesion", and "Cell morphogenesis".(b) Based on the expression profile of the 182 NAFLD progression genes and using the NMF algorithm, the 51 NAFLs in the training cohort were divided into two subclusters (C1 = 21, C2 = 30).(c-e) Using ssGSEA quantification, we observed that the ssGSEA scores of "Inflammatory response", "ECM organization", and "Cell-cell adhesion" were progressively elevated from C2 to C1 to NASH.(f & g) Positive correlations between "Inflammatory response" and "ECM organization" or "Cell-cell adhesion" were observed in NAFL-C1, C2, and NASH samples.(h) A group of widely acknowledged inflammatory factors including LPAR1, PTPRE, CCR2, CCL20, CLEC5A, CXCL6, ITGB8, PDPN, and GPC3 were significantly decreased in the NAFL-C2 group compared to either NAFL-C1 or NASH.(i) The fibroblasts abundance was significantly downregulated in NAFL-C2.Th1 cell infiltration was significantly upregulated in NAFL-C2, while no significant difference of Th2 cell infiltration was observed among the three groups.(j) The NAFL-C2 group exhibited the lowest stroma score, while no significant difference of stroma score was observed between NAFL-C1 and NASH.* p < 0.05, ** p < 0.01, *** p < 0.001, # not significant

Fig. 3
Fig. 3 Establishment and validation of a discriminative gene signature for risk stratification in NAFLD.(a) The training method of the LOOCV framework was introduced, and the iteration number in our study is 98 (the sample number in the training cohort).The 182 genes regarding NAFLD progression were trained in the RF and SVM algorithms with feature selection of recursive feature elimination (RFE) respectively, and 8 overlapping genes remained in the outputs of the two machine learning approaches.Subsequently, Lasso logistic regression (LR) analysis was applied on the 8 genes, and only 4 genes retained their coefficients, (b) and COL1A2 exhibited the highest coefficient.(c) PCA analysis showed a clear separation of NAFL and NASH samples with the expression matrix of the four genes.(d) Dotplots showed that all four genes were significantly upregulated in NASH compared to NAFL samples.*** p < 0.001.(e) In the training cohort, the ROC analysis indicated that the gene signature-derived score could discriminate NASH from NAFL accurately.(f) In GSE163211, COL1A2 and COL4A2 were significantly upregulated in NASH with fibrosis, and the 2-gene discriminative score was also significantly higher than steatosis and NASH samples without fibrosis.(g) The ROC analysis demonstrated that the 2-gene (COL1A2 and COL4A2) discriminative score could identify advanced samples with a favorable performance (AUC = 0.724, 95% CI = 0.644-0.804).(h) In GSE135251, the "3-gene score" was significantly and stepwisely elevated from normal to NAFL to NASH with advanced fibrosis levels, and it exhibited favorable performances in discriminating (i) advanced stages (AUC = 0.737, 95% CI = 0.665-0.810)and (j) advanced fibrosis levels (AUC = 0.729, 95% CI = 0.650-0.808)

Fig. 5
Fig. 5 CTNNB1/COL1A2 axis correlates with fibrosis severity during NAFLD-HCC progression.(a) Five mutational signatures were identified using TCGA-HCC WES data and NMF algorithm.The three major mutational signatures were annotated with "Defective DNA mismatch repair", "Aflatoxin exposure", and "Aristolochic acid exposure".(b) The abundance of each mutational signature in TCGA-HCC was shown in a pie chart.(c) 184 HCC samples with history of alcohol consumption, hepatitis or NAFLD were extracted, and the distribution of each mutational signature was depicted in a stacked barplot.(d) NAFLD-HCC is characterized with high COL1A2 expression.(e) Oncoplot demonstrated that CTNNB1 acts as the most frequently mutated gene in the COL1A2-low cohort, with the mutation frequency up to 48% (left panel).In contrast, CTNNB1 is observed rarely mutated in the COL1A2-high samples (right panel).(f) In the integrated analyses of TCGA-HCC, MSK-HCC and INSERM-HCC cohorts, the genepair of TP53 and CTNNB1 is shown significantly mutually exclusive.(g) COL1A2 mRNA expression is significantly elevated in CTNNB1-wild type HCC samples compared to CTNNB1-mutated ones.(h) Among nine representative oncogenic pathways, the WNT signaling pathway is the most frequently affected one in the COL1A2-low cohort.(i)A functional network showed the top five important pathways in CTNNB1-WT/COL1A2-high HCC samples were termed "vasculature development", "chemotaxis", "ECM organization", "positive regulation of locomotion", and "positive regulation of cell adhesion"

Fig. 6
Fig. 6 Distinct immune and stromal patterns were observed among different CTNNB1/COL1A2 groups.(a) A comprehensive heatmap showed that most of the immune and stromal cells are enriched in the CTNNB1-WT/ COL1A2-high samples, regardless of the TMB level.(b) The xCell algorithm inferred the absolute infiltration of a total of 36 cell types involved in the TME, and the chord diagram showed that the infiltrating abundance is significantly higher in the CTNNB1-WT/COL1A2-high and CTNNB1-Mut/COL1A2-high samples than the other groups.(c -e) CAFs are significantly enriched in the CTNNB1-WT/COL1A2-high samples.(f -h) Different levels of representative immune checkpoints including CD274, PDCD1 and TIGIT indicate heterogenous tumor immunogenicity and distinct potential response to immunotherapy among different groups.(i) The ESTIMATE algorithm was applied to infer the immune infiltration and tumor purity for each sample, and the CTNNB1-WT/COL1A2-high and CTNNB1-Mut/COL1A2-high groups are labelled with high immune infiltration, and (j) a significantly negative correlation (r = -0.916,p < 0.001) was observed between the immune score and tumor purity across the four groups.Similarly, (k) the inflammatory response activity exhibits distinct distributions among the four groups, and (l) a significantly positive correlation (r = 0.839, p < 0.001) between immune score and inflammatory response was also observed across the four groups.*** p < 0.001