Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms

Hepatocellular carcinoma (HCC) is a highly prevalent form of liver cancer that neces‑ sitates accurate prediction models for early diagnosis and effective treatment. Machine learning algorithms have demonstrated promising results in various medical domains, including cancer prediction. In this study, we propose a comprehensive approach for HCC prediction by comparing the performance of different machine learning algorithms before and after applying feature reduction methods. We employ popular feature reduction techniques, such as weighting features, hidden features correlation, feature selection, and optimized selection, to extract a reduced feature subset that cap‑ tures the most relevant information related to HCC. Subsequently, we apply multiple algorithms, including Naive Bayes, support vector machines (SVM), Neural Networks, Decision Tree, and K nearest neighbors (KNN), to both the original high‑dimensional dataset and the reduced feature set. By comparing the predictive accuracy, precision, F Score, recall, and execution time of each algorithm, we assess the effectiveness of fea‑ ture reduction in enhancing the performance of HCC prediction models. Our experi‑ mental results, obtained using a comprehensive dataset comprising clinical features of HCC patients, demonstrate that feature reduction significantly improves the per‑ formance of all examined algorithms. Notably, the reduced feature set consistently outperforms the original high‑dimensional dataset in terms of prediction accuracy and execution time. After applying feature reduction techniques, the employed algo‑ rithms, namely decision trees, Naive Bayes, KNN, neural networks, and SVM achieved accuracies of 96%, 97.33%, 94.67%, 96%, and 96.00%, respectively.


Introduction
According to reports from the World Health Organization (WHO), approximately 14.1 million individuals are diagnosed with cancer each year, resulting in 8.2 million deaths globally [1].Hepatocellular carcinoma (HCC) is a form of liver cancer that arises from chronic liver disease and cirrhosis.Recent studies indicate that HCC is the most lethal cancer worldwide, leading to approximately 600,000 deaths annually [2].Furthermore, liver cancer holds the sixth position among the most frequently diagnosed cancers worldwide [3].These facts demonstrate the global impact of HCC on human lives.Consequently, it is crucial to reduce the mortality rate associated with HCC, which can only be achieved through early detection.To accomplish this goal, it is imperative to leverage various data mining and machine learning techniques to develop an automated diagnostic system that can accurately predict HCC, ensuring more efficient and timely detection.Data mining is a multidisciplinary domain that employs principles from computer science and statistics to extract valuable information, such as features or rules, from provided data [4].Conversely, machine learning is a branch of computer science that focuses on techniques and methodologies through which machines acquire knowledge and learn from experience [5].In the present era, machine learning techniques and data mining are experiencing rapid growth and extensive application in the realm of medical diagnostics to tackle various challenges such as [6][7][8][9][10][11][12][13].
Our research began with a focus on acknowledging the importance of normalized data.A clear trend was observed in previous work-better model performance with normalized data.This observation led us to adapt our dataset accordingly.Next, we introduced feature selection methods, starting with the powerful "Recursive Feature Elimination (RFE)".This method tests the model's performance with each potential feature, systematically removing features and re-testing the model to find the best iteration.Next, we used "Principal Component Analysis (PCA)", which is a popular method for feature extraction.Its goal is to reduce the dimensionality of a data set while preserving as much of the information as possible.PCA accomplishes this by creating new uncorrelated variables or components that successively maximize variance.In our study, PCA was utilized to transform the data set into a set of linearly uncorrelated variables termed principal components.Finally, optimization feature operators were applied.It is well recognized that optimizing the selection of feature subsets can significantly improve the performance of a classifier.To rate the importance of a feature for the classification task, mutual information was utilized.This was followed by executing various machine learning algorithms to assess classification performance.
A clear challenge exists in the form of Hepatocellular Carcinoma (HCC)-a lethal form of cancer cloaked in diagnostic complexity.Accurate, efficient predictive models are crucial for timely diagnosis and optimized treatment.However, conventional predictive models are hindered by the 'dimensionality curse' , a common obstacle in high-dimensional datasets used in HCC diagnosis.

Problem statement
Despite being one of the most lethal forms of cancer, Hepatocellular Carcinoma (HCC) remains shrouded in an air of diagnostic complexity.The development of accurate and efficient predictive models represents a critical facilitator of timely diagnosis and effective treatment.Stunted by the dimensionality curse commonly associated with high-dimensional datasets acquired in HCC diagnosis, traditional predictive models have demonstrated limited proficiency.

Research question
Can the application of alternative feature reduction techniques significantly enhance the performance of machine learning algorithms in the prediction of Hepatocellular Carcinoma?

Research gap
Previous studies have noted the positive relationship between reducing feature dimensionality and the predictive accuracy of machine learning algorithms.However, there remains a conspicuous lack of comprehensive approaches that compare the performance of various machine learning algorithms under the influence of different feature reduction techniques in the domain of hepatocellular carcinoma prediction.

Contributions
This study heralds an important contribution to the field of computational HCC prediction by comparing the performance of much-utilized machine learning algorithms before and after the implementation of feature reduction techniques.The main contributions can be summarized as follows: 1. Adoption of data normalization to improve our model's performance, as reinforced by earlier studies.2. Execution of feature selection methods including 'Recursive Feature Elimination (RFE)' and 'Principal Component Analysis (PCA)' to boost the effectiveness of our predictive model.

Assessment of the influence of various features on the task of classification by
deploying mutual information.4. Conducting a performance comparison of differing machine learning algorithms, gauging their classification results. 5. Addressing existing research shortcomings by performing an extensive comparison of multiple feature reduction techniques and their corresponding impact on the outcomes of a range of machine learning algorithms, particularly about Hepatocellular Carcinoma (HCC) prediction.6.Advancing the computational prediction field for HCC by examining performance shifts in a variety of machine learning algorithms both before and after the integration of feature reduction techniques.

Related work
In a research study by Abajian et al. [14] a study involving 36 patients with HCC who underwent transarterial chemoembolization.They employed machine learning techniques, specifically linear regression, and random forest, and achieved an overall accuracy of 78%.In a study by Ioannou et al. [15] focused on predicting the occurrence of hepatocellular carcinoma (HCC) within 3 years, a recurrent neural network (RNN) was trained using data from patients with hepatitis C virus (HCV)related cirrhosis.The dataset included four variables measured at the beginning of the study and 27 variables measured over time, collected from 48,151 patients receiving healthcare within the US Department of Veterans Affairs system.The findings of the study demonstrated that the RNN model outperformed logistic regression in predicting the development of HCC within the specified timeframe.The RNN achieved an accuracy of 75.9% for all patients and 80.6% for patients who achieved sustained virologic response (SVR) in predicting the onset of hepatocellular carcinoma (HCC).
In a research study conducted by Nam et al. [16], a deep neural network was developed to predict the occurrence of hepatocellular carcinoma (HCC) over a 3-and 5-year period in patients with hepatitis B virus (HBV)-related cirrhosis who were undergoing entecavir therapy.The study examined 424 patients and demonstrated that the deep learning (DL) model outperformed six other previously reported models that utilized older modeling techniques.Additionally, the DL model was tested on a validation cohort consisting of 316 patients, and the results indicated a Harrell's C-index of 0.782, indicating a high level of accuracy in predicting the incidence of HCC in these patients.
Nam et al. [17] built upon their previous work by developing MoRAL-AI, a novel artificial intelligence model utilizing deep learning techniques, to identify liver cancer (HCC) patients at high risk of tumor recurrence after transplantation.The MoRAL-AI model analyzed several prognostic factors including tumor size, patient age, blood alpha-fetoprotein (AFP) levels, and prothrombin time to generate risk predictions.Results of the study demonstrated that MoRAL-AI outperformed traditional prediction models such as the Milan, UCSF, up-to-seven, and Kyoto criteria in determining which HCC patients faced elevated recurrence risk post-transplant.Specifically, MoRAL-AI achieved a C-index of 0.75 for prognostic accuracy compared to 0.64, 0.62, 0.50, and 0.50 for the other models respectively, with this difference being statistically significant (p < 0.001).In summary, MoRAL-AI represented an improved approach for identifying HCC patients likely to experience recurrence following liver transplantation.
In their study, Ali et al. [18] evaluated the predictive performance of various machine learning algorithms for hepatocellular carcinoma (HCC), including logistic regression, k-nearest neighbors (KNN), decision tree, random forest, and support vector machine (SVM).Additionally, they proposed and tested a novel combination approach utilizing linear discriminant analysis (LDA), genetic algorithm (GA), and SVM.When comparing all models, the results demonstrated the LDA-GA-SVM approach yielded the best overall predictive ability.Specifically, the LDA-GA-SVM achieved the highest accuracy of 0.899, sensitivity of 0.892, and specificity of 0.906.These performance metrics were superior to those obtained when using the other individual algorithms evaluatedlogistic regression, KNN, decision tree, random forest, and SVM alone.Therefore, the study findings suggested the LDA-GA-SVM composite model may be the most effective machine learning-based predictive tool for HCC compared to the alternative algorithms analyzed.
Cao et al. [19] evaluated the predictive performance of various machine learning models-logistic regression, k-nearest neighbors (KNN), decision tree (DT), naïve Bayes (NB), and deep neural network (DNN)-using the original dataset.The accuracy of the models ranged from 57.5 to 70.6%.Precision varied between 40.7 and 70.1%, while recall rates were between 20.0 and 67.7%.False positive rates fell between 10.7 and 35.0% and standard deviation values ranged from 0.026 to 0.058.Among the models trained on the original dataset, KNN exhibited the best overall predictive ability.Specifically, KNN achieved an accuracy of 70.6%, precision of 70.1%, recall rate of 51.9%, and a false positive rate of 16.0% with a standard deviation of 0.042.These results indicate that of the algorithms tested on the unmodified data, KNN provided the most accurate and reliable predictions of disease status.
In a study by Zhang et al. [20] 237 patients with liver cancer, almost 39% (92 patients) were identified as having a positive marker for MVI.This group, with an average age of 52, was predominantly male (86 out of 92).The remaining 61% of patients (145 patients) were MVI-negative, with an average age of 54 and a more balanced male-to-female ratio (124 males to 21 females).Patients with MVI had larger tumors, a higher occurrence of tumor capsules, and elevated levels of certain proteins compared to those without MVI.
In a study by [21] After conducting machine learning analysis, they identified eight key feature variables (age, intratumoral arteries, alpha-fetoprotein, pre-operative blood glucose, number of tumors, glucose-to-lymphocyte ratio, liver cirrhosis, and pre-operative platelets) to develop six distinct prediction models.Among these models, the XGBoost model exhibited superior performance, as evidenced by the area under the receiver operating characteristic curve (AUC-ROC) values of 0.993 (95% confidence interval: 0.982-1.000),0.734 (0.601-0.867), and 0.706 (0.585-0.827) in the training, validation, and test datasets, respectively.Furthermore, calibration curve analysis and decision curve analysis demonstrated that the XGBoost model exhibited favorable predictive performance and possessed practical value in clinical applications.
Motivated by the development of different diagnostic systems based on machine learning models to improve the precision of decision-making about HCC diagnosis and prediction we also conducted an approach to enhance hepatocellular carcinoma (HCC) prediction through Feature reduction methods.This study highlights the effectiveness of feature reduction in boosting the performance of various AI techniques for HCC nodule prediction.By streamlining the data, they were able to significantly improve the accuracy of algorithms like Naive Bayes, Neural Networks, Decision Tree, SVM, and KNN.

Database description
Clinical patient data from the Cancer Genome Atlas (TCGA) database were used in this study, The TCGA LIHC clinical data set offers a robust resource for investigating the clinical landscape of hepatocellular carcinoma (HCC).This data, encompassing diverse patient demographics, tumor characteristics, treatment details, and clinical outcomes, facilitates a multi-faceted approach to understanding disease progression and informing research avenues [22][23][24].
• Patient demographics: Age, sex, ethnicity, socioeconomic factors, and medical history provide context for analyzing disease epidemiology and potential risk factors as shown in Fig. 1.Correlations between these variables and clinical outcomes can inform targeted prevention and early intervention strategies.• Tumor characteristics: Detailed information on tumor size, stage, grade, location, and presence of underlying liver disease allows for stratification of patient populations and facilitates investigation of tumor progression patterns.
• Treatment details: Data on surgical procedures, radiation protocols, and chemotherapy regimens allows for comparative effectiveness studies and identification of optimal treatment strategies for different patient subgroups.• Clinical outcomes: Data on overall survival, disease-free survival, time to recurrence, and response to treatment offer important endpoints for evaluating treatment efficacy and informing clinical decision-making.• Limitations: While the TCGA LIHC clinical data set is comprehensive, it's important to acknowledge potential limitations due to data collection inconsistencies, missing follow-up data, and selection bias.Careful consideration of these limitations is necessary to ensure accurate interpretation of results and informed research conclusions.
The dataset employed in this study comprised 77 features for each of the 377 patients in total.The label of the dataset denotes tumor status and can assume a value of "tumorfree" or "with tumor".The term "tumor-free" does not imply a state of normalcy, but instead refers to the absence or persistence of the neoplasm (tumor).It represents a statement regarding the progression or lack thereof of the initial disease.It is crucial to mention that there are missing values for each feature in the dataset that have the information of all features.Within the existing body of literature, two distinct approaches are commonly employed to address missing values.The first method involves removing all samples that contain missing values, but this approach is not feasible in our case as it would result in the loss of a significant portion of the samples.Consequently, we opted to employ the imputation method to fill in the missing values.Missing data was addressed through a diverse range of imputation methods during the studies [25][26][27][28].We utilized a statistical approach to impute missing values by substituting them with the mean value of the corresponding column or feature in which the missing value was found.Elaborate information is provided about the clinical features of the TCGA dataset in Table 1.

Methodology
The proposed research entails a multi-pronged approach to enhance hepatocellular carcinoma (HCC) prediction through Feature reduction methods including feature importance, hidden feature correlation, and feature selection [29] using different algorithms.The initial phase involved a thorough review of existing literature on deep learning applications in risk assessment, diagnosis, prognosis, and therapy for HCC patients.Subsequently, a meticulous analysis of clinical variables was conducted.Deep learning and machine learning algorithms were then implemented for HCC prediction, incorporating various feature reduction techniques.The overarching objective is to demonstrably validate the superiority of employing alternative feature selection methods compared to using all features within the machine learning models for achieving accurate HCC prediction.
In this study, the workflow for training a dataset using feature weight, feature correlation, Normalization, and optimization operators in RapidMiner [30] involves a series of steps designed to enhance the model-building process.
First, the dataset was loaded into RapidMiner, and the relevant operators were added to the process.The weights operator allows assigning importance or significance to individual instances or attributes in the dataset.This was useful when certain instances or attributes carry more weight or relevance in the analysis.
Next, the correlation operator was applied to identify and measure the relationships between different attributes in the dataset.It helps in understanding which attributes are strongly correlated with the target variable or with each other.This information can guide feature selection and eliminate redundant or highly correlated attributes, reducing the dimensionality of the dataset.
After the correlation analysis, the normalization operator was utilized to scale and standardize the numerical attributes in the dataset.This step ensures that all attributes have similar ranges and distributions, preventing any single attribute from dominating the model training process due to differences in their scales.Normalization enhances the stability and convergence of various algorithms leading to improved model performance.
Following normalization, the optimization operator was employed to select the most relevant subset of features from the dataset.It uses optimization algorithms and statistical measures to evaluate the contribution of each attribute to the model's performance.By iteratively evaluating different feature subsets, the optimization operator identified the combination of attributes that maximizes the model's accuracy or other defined performance metrics.This step helped in reducing noise, improving model efficiency, and enhancing interpretability.
Once the optimized feature subset was determined, the dataset was divided into training and testing sets 301 examples for train and 75 examples for test using appropriate sampling techniques.inour case, we used "Stratified sampling" which involves creating random subsets while ensuring that the distribution of classes within those subsets remains consistent with the overall class distribution in the entire example set.Finally, various modeling techniques, such as decision trees, Naive Bayes, KNN, neural networks, and SVM were applied to train the model using the selected features and the assigned weights.
Extracting meaningful insights from the TCGA LIHC dataset through regression tasks requires careful consideration of the chosen model.Several factors influence this selection, including data size, feature types, interpretability needs, and computational resources.For datasets with moderate sizes, similar to what might be encountered within TCGA LIHC, Naive Bayes offers a strong option.Decision trees are particularly wellsuited for handling missing data inherent to real-world datasets, eliminating the need for extra imputation steps.K-Nearest Neighbors (KNN) stands out for its efficiency, directly comparing new data points to existing TCGA LIHC entries for prediction without a separate training phase.More complex models like neural networks can uncover hidden patterns within the data through automatic feature learning.Finally, Support Vector Machines (SVMs) offer robustness to noise, a common challenge in TCGA LIHC datasets.By carefully weighing these factors and evaluating model performance on the specific TCGA LIHC subset used, the model's performance is then evaluated using performance measures like accuracy, precision, F Score, and recall.A Summary of the Data Reduction Workflow for Predicting Hepatocellular Carcinoma, as Depicted in Fig. 2.

Data preprocessing
The dataset initially consisted of 77 features.During the data cleaning process, 28 entries with unknown values in the "TUMOR status" column were replaced with "With TUMOR".In addition, two new features were introduced for further analysis: "optimal weight" based on Body Mass Index (BMI), categorized as Normal, Overweight, or Obesity, and "age stage" categorized as Middle Adulthood, Late Adulthood, or Young Adulthood.Redundant information such as age, height, weight, and other columns with repeated, unavailable, or inapplicable values, as well as patient IDs, were eliminated.As a result, the final dataset now comprises 59 features.Figure 3 illustrates the relationship between patients with obesity and the number of family members with a history of cancer.Our findings indicate that the patient with obesity had the highest number of family members with this medical history.

Feature importance
After data cleansing the remaining 59 features were weighted with different types of weight operators after replacing missing values using RapidMiner.First, we applied "Weight by Information Gain".To determine how relevant each attribute is to the class attribute, the Weight by Information Gain operator uses a calculation called information gain [31].Attributes with higher scores are considered more important.While information gain is generally reliable for assessing attribute relevance [32], it does have a potential drawback.It can sometimes overestimate the importance of attributes that have a very large number of possible values.To overcome the limitations of information gain, particularly its sensitivity to attributes with numerous unique values, we used the information gain ratio by analyzing the information each attribute provides for understanding the target class, this method assigns weights that reflect their relative importance.The more insightful an attribute is for predicting the category, the higher its weight will be.
Secondly, we use the "Weight by Relief " operator.Considered one of the most effective and straightforward algorithms for evaluating feature quality, Relief has gained significant recognition.The fundamental concept behind Relief is to gauge the quality of features based on their ability to differentiate between instances of the same class and instances of different classes that are nearby [33,34].By sampling examples and comparing the feature values between the nearest examples of the same class and different classes, Relief calculates the relevance of features as described in [35].
Pseudocode of the Relief algorithm:

Hidden feature
Weight by Correlation is a feature selection methodology within the framework of Rapid Miner Studio [36].This approach focuses on ascertaining the salience of features by quantifying their correlation with the target variable [37].By assigning weights to individual features as shown in Fig. 4 based on their correlation coefficients, "Weight by Correlation" prioritizes those features that exhibit stronger correlations.This weighting mechanism [38] facilitates the identification and selection of the most influential features, thereby enhancing the efficacy and precision of data analysis and modeling processes within Rapid Miner Studio.

Feature selection
Normalization is a technique employed to rescale values to fit within a specific range.It is particularly crucial when handling attributes that possess varying units and scales [39,40].The significance of data normalization in developing precise predictive models has been investigated across multiple machine learning algorithms [41], including Nearest Neighbors (NN) [42], Artificial Neural Networks (ANN) [43] and Support Vector Machines (SVM) [44].Several researchers have confirmed the positive impact of data normalization on enhancing classification performance in various domains [45].Examples include medical data classification [46,47], multimodal biometrics systems [48], vehicle classification [49], motor detection [50], stock market prediction [51], leaf classification [52], credit approval data classification [53], genomics [54], and other application areas [55,56].The purpose of the normalization operator is to perform the normalization process on selected attributes.There are four available normalization methods, with the "Range transformation" method being utilized in this case.This method normalizes all attribute values to a specified range [57].Upon selecting this method, two additional parameters, namely "min" and "max, " become visible in the Fig. 4 Illustration of assigning weights to individual features based on their correlation coefficients parameters panel.The largest value in the attribute set is assigned to "max, " while the smallest value is assigned to "min." All other values are proportionally scaled to fit within the provided range.It is worth noting that this method may be affected by outliers, as the boundaries adjust towards them.However, it retains the original distribution of the data points, making it suitable for data anonymization purposes as well.
Optimized selection is a valuable technique utilized in RapidMiner.This approach plays an essential role in streamlining the model-building process by automatically identifying and selecting the most relevant subset of features from a given dataset [58,59].By leveraging optimization algorithms and statistical measures, RapidMiner's optimized selection functionality aims to enhance both the efficiency and efficacy of predictive models.The process of optimized selection involves iteratively evaluating different feature subsets and assessing their impact on the model's performance [60].The operator as shown in Fig. 5, implements two deterministic greedy feature selection algorithms: "forward selection" and "backward elimination.".
The goal of the forward selection algorithm is to generate the most effective subset of features while disregarding irrelevant and insignificant ones [61][62][63].It begins by creating an initial population of n individuals, where n represents the number of attributes in the input Example Set.Each individual in the population uses only one feature.The attribute sets are then evaluated, and the top k sets are selected based on their performance.For each of the k selected sets, the algorithm proceeds as follows: If there are j unused attributes, j copies of the attribute set are made, and exactly one previously unused attribute is added to each copy of the set.The algorithm continues to the next step as long as there has been an improvement in performance in the last p iterations.The Backward Elimination technique begins with an attribute set that includes all features [64,65].It evaluates all attribute sets and chooses the top k sets based on their performance.For each of the selected k sets, the algorithm proceeds as follows: If there are j attributes currently used, j copies of the attribute set are made, and exactly one previously used attribute is removed from each copy of the set.The algorithm continues to the next step as long as there has been an improvement in performance in the last p iterations.  2.
Before feature reduction, machine learning models often face challenges such as high dimensionality and redundant or irrelevant features [66][67][68].These issues can negatively impact both accuracy and execution time.With a large number of features, models may struggle to extract meaningful patterns from the data, leading to overfitting or poor generalization.Additionally, the computational complexity of training and inference increases significantly with the increasing number of features.However, after feature reduction techniques were applied, such as dimensionality reduction or feature selection, the models experienced improved performance in terms of accuracy as shown in Fig. 6, and execution time as shown in Fig. 7.
Tables 3 and 4 present a summary of the application of various deep learning and machine learning techniques on the TCGA LIHC clinical variables dataset for predicting hepatocellular carcinoma (HCC).This summary includes the performance of these techniques both before and after feature reduction methods were applied.The algorithms utilized in this study encompassed Naive Bayes, Neural Network, Decision Tree, SVM, and KNN.The primary focus of the evaluation was on the prediction of HCC nodules.The results indicate that both the deep learning models and machine learning models exhibited outstanding performance after the implementation of feature reduction methods.
Before feature reduction, our Neural Network model lumbered through training, achieving an accuracy of 76.00% at the cost of a sluggish 5 min.This sluggishness stemmed from the model struggling to navigate the complexities of a high-dimensional feature space, often getting tangled in irrelevant or redundant information.However, after applying feature reduction techniques, the model shed its excess baggage, emerging lean and mean.It effortlessly soared through training, achieving a remarkable 96% in a mere 1 min and 10 s.This drastic improvement is a testament to the power of feature reduction.By eliminating noisy and superfluous features, we cleared the path for the model to focus on the truly meaningful relationships within the data, resulting in a more accurate and efficient learning process.This optimization paves the way for faster real-time predictions, reduced computational costs, and ultimately, a more robust and deployable model.Applying feature reduction techniques to the Naive Bayes model yields notable enhancements in both accuracy and execution time.Specifically, the model achieves an impressive accuracy rate of 97.33%.Moreover, the execution time is significantly reduced to a mere 49 s, showcasing the model's enhanced efficiency in processing and making predictions.These improvements highlight the effectiveness of feature reduction in optimizing the Naive Bayes model's performance, resulting in superior accuracy and faster execution times.
Before implementing feature reduction, the Decision Tree model attains a commendable accuracy of 90.67% but necessitates a relatively lengthy execution duration of 4 min Fig. 6 Performance of used algorithms for HCC Prediction, on the TCGA LIHC clinical variables dataset after feature reduction methods Fig. 7 Execution time of used algorithms for HCC prediction, before and after feature reduction methods in seconds and 12 s.Nevertheless, following the application of feature reduction techniques, the model undergoes noteworthy enhancements.It accomplishes an impressive accuracy rate of 96%, demonstrating improved precision when classifying instances.Furthermore, the execution time is significantly reduced to a mere 1 min and 9 s.These enhancements underscore the efficacy of feature reduction in optimizing the performance of the Decision Tree model, leading to substantially higher accuracy and faster execution.Moreover, both the SVM and KNN models exhibit superior accuracy, with the SVM model achieving 96.00% accuracy and the KNN model achieving 94.67% accuracy.Notably, the execution times for these models are 1 min and 48 s for SVM and 1 min and 2 s for KNN, respectively.

Discussion
A multitude of machine-learning algorithms have been developed for the prediction of hepatocellular carcinoma.The study [69] explores using a combination of machine learning techniques (ensemble learning) to predict how long Hepatocellular The models they build include variations of Nu-Support Vector Classification, Ridge Classification (RCV), and Gradient Boosting Ensemble Learning (GBEL), each combined with either L1 or L2 regularization or optimized by a Genetic Algorithm or Random Forest.These models are evaluated based on how accurately they predict survival, using metrics like accuracy, sensitivity, and Area Under the Curve (AUC).
Their findings show that the RFGBEL model (Random Forest combined with Gradient Boosting Ensemble Learning) performs best compared to the others.This model achieves an accuracy of over 93% and a high AUC score of 0.932, indicating strong prediction capabilities.Finally, they compare their RFGBEL model to existing methods and demonstrate its superior ability to predict HCC patient survival.Also, researchers in the study [70] propose a new NCA-GA-SVM model for predicting HCC survival.This model combines known high-performing techniques (NCA, GA) to improve SVM classification.It achieved high accuracy (96.36%) on a dataset of 165 patients.
This study [71] developed a highly accurate model for diagnosing liver cancer (HCC) that leverages a combination of personalized biological pathways and machine learning.The model achieved exceptional performance in internal testing (AUROC > 0.98) and demonstrated good generalizability to external data.These results suggest this model has great potential for real-world application in HCC diagnosis.Kiani et al. [72] used a microscopic image from the TCGA dataset and utilized a convolutional neural network (CNN) tool named the "Liver Cancer Assistant, " it accomplished precise discrimination between hepatocellular carcinoma (HCC) and cholangiocarcinoma.Notably, the model achieved a diagnostic accuracy of 0.885, highlighting its efficacy in accurately identifying and distinguishing between these two distinct forms of liver cancer.
In a study conducted by Wang et al. [73], a deep learning technique involving a convolutional neural network (CNN) was utilized to automate the identification and classification of individual nuclei in tissue images.The CNN was trained using H&E-stained tissue sections of hepatocellular carcinoma (HCC) tumors from the TCGA dataset.Subsequently, a process of feature extraction was carried out, resulting in the identification of 246 quantitative image features.Using an unsupervised learning approach, a clustering analysis was performed, which yielded intriguing results.Surprisingly, this analysis unveiled the existence of three distinct histologic subtypes within the HCC tumors.Importantly, these subtypes were found to be unrelated to previously established genomic clusters and exhibited different prognoses.This study demonstrated the potential of CNN-based image analysis in revealing unique histologic subtypes, offering valuable insights into the prognosis of HCC tumors.Table 5 displays a collection of models proposed by different authors, which have been applied to various HCC-related problems using the TCGA dataset.Table 5 represents the Studies of patients with hepatocellular carcinoma based on the TCGA LIHC dataset.
In this work, we proposed an approach that aims to improve the prediction of hepatocellular carcinoma (HCC) through a comprehensive approach that involves multiple strategies.These strategies include reducing the number of features used in the prediction model through methods such as analyzing feature importance, exploring hidden feature correlations, and employing various algorithms for HCC prediction using clinical variables.We utilized TCGA LIHC clinical variables but the data needed to be cleaned to address any inconsistencies, missing values, or errors.Then the data was formatted and prepared for further analysis which involved scaling the data to a common range, encoding categorical variables, or performing feature engineering to create new features from existing ones.After identifying the optimized feature subset, the dataset was split into two sets: a training set with 301 examples and a testing set with 75 examples.This division was performed using a sampling technique called "Stratified sampling." This sampling technique ensures that random subsets are created while maintaining the consistent distribution of classes within those subsets, aligning with the overall class distribution in the entire dataset.In other words, Stratified sampling helps to preserve the proportional representation of different classes during the creation of training and testing sets, which is essential for maintaining the integrity of the dataset and ensuring reliable model evaluation.The application of feature reduction techniques to the Naive Bayes model leads to significant improvements in accuracy and execution time.With these techniques implemented, the model achieves an impressive accuracy rate of 97.33%.Additionally, the execution time is drastically reduced to just 49 s, demonstrating the enhanced efficiency of the model in processing and making predictions.These enhancements clearly illustrate the effectiveness of feature reduction in optimizing the performance of the Naive Bayes model, resulting in higher accuracy and faster execution times.

Limitations
Although machine learning and deep learning have shown promise in various medical applications, including hepatocellular carcinoma (HCC) prediction, there are several limitations associated with their use in this context.One major limitation is the requirement for large and high-quality datasets.Machine learning algorithms, including deep learning models, heavily rely on vast amounts of well-curated data to learn patterns and make accurate predictions.However, acquiring such datasets for HCC prediction can be challenging due to the rarity of the disease and the need for comprehensive clinical and imaging data.The limited availability of annotated HCC datasets hampers the development and evaluation of robust models.
Interpretability and explainability are crucial in medical decision-making, and this is another limitation of the deep learning model.While these models have demonstrated remarkable predictive capabilities, they often function as black boxes, making it difficult to understand the underlying reasons behind their predictions.This lack of interpretability raises concerns in medical settings, where clinicians need to have confidence in the decision-making process and understand the factors contributing to a prediction.
The generalizability of machine learning and deep learning models can also be a limitation.Models trained on specific populations or datasets may not perform as well when applied to different patient populations or settings.The heterogeneity of HCC, including variations in tumor characteristics, genetic profiles, and patient demographics, can introduce challenges in developing models that can effectively predict HCC across diverse populations.Furthermore, the potential for bias in machine learning models is another limitation.Biases can be introduced during the data collection process, such as underrepresentation of certain demographic groups or confounding factors.If the models are trained on biased datasets, they may perpetuate or even amplify existing biases, leading to inaccurate predictions and disparities in healthcare outcomes.

Conclusion and future work
In conclusion, this study focused on the prediction of hepatocellular carcinoma (HCC), a prevalent form of liver cancer, using machine learning algorithms.The objective was to assess the effectiveness of feature reduction techniques in enhancing the performance of HCC prediction models.By comparing the performance of various machine learning algorithms on both the original high-dimensional dataset and a reduced feature subset, this study demonstrated that feature reduction significantly improves the accuracy and execution time of HCC prediction models.The employed feature reduction techniques, including weighting features, hidden features correlation, feature selection, and optimized selection, helped extract a reduced feature set that captured the most relevant information related to HCC.The experimental results obtained from a comprehensive dataset of clinical features of HCC patients showed that the reduced feature set consistently outperformed the original high-dimensional dataset in terms of prediction accuracy.The decision trees, Naive Bayes, K-nearest neighbors, neural networks, and support vector machines (SVM) algorithms achieved accuracies of 96%, 97.33%, 94.67%, 96%, and 96.00%, respectively, after applying feature reduction techniques.These findings suggest that feature reduction methods can be effectively employed in HCC prediction models, leading to improved accuracy and faster execution times.The application of machine learning algorithms, combined with feature reduction techniques, holds great potential for the early diagnosis and effective treatment of HCC, ultimately improving patient outcomes.
While current models using clinical variables for HCC prediction show promise, there are several areas for future work to improve accuracy, personalize risk assessment, and ultimately guide better patient outcomes.Integrating Multimodal Data by Exploring combining clinical data with other modalities like genetic information, imaging data (MRI, CT scans), and blood-based biomarkers.Deep learning models can be particularly adept at handling such diverse data sources.Also, train and validate models on large, geographically diverse datasets to ensure generalizability and avoid overfitting to specific populations.Account for the presence of other chronic conditions like diabetes or hepatitis that may influence HCC development.Develop models that can incorporate longitudinal data (changes in clinical variables over time) to predict risk changes and identify high-risk patients earlier.By focusing on these future work directions, we can improve the accuracy and clinical utility of HCC prediction models using clinical variables, leading to earlier detection, better risk stratification, and ultimately improved patient outcomes.

Fig. 1
Fig. 1 Hepatocellular carcinoma risk factors history in TCGA LIHC data set

Fig. 2
Fig. 2 Outline of data reduction workflow for Hepatocellular carcinoma Prediction

Fig. 3
Fig. 3 Illustration of patients with obesity VS number of family relatives having a history of cancer

Fig. 5
Fig. 5 Normalize and Optimize selection operators in Rapid Miner

Table 1
Information about the features of the TCGA dataset clinical variables

Table 2
Information about the parameters of used operators in RapidMiner

Table 3
Performance comparison when using each of the deep learning and machine learning algorithms for HCC Prediction, on the TCGA LIHC clinical variables dataset Before Feature Reduction Methods The researchers test fifteen different models, each involving data cleaning, reducing unnecessary features, and then classifying patients based on their predicted survival time.To identify the most important factors, they use four methods: LASSO regression, Ridge regression, a Genetic Algorithm, and a Random Forest.Only the most influential factors are used for prediction.

Table 4
Performance comparison when using each of the deep learning and machine learning algorithms for HCC Prediction, on the TCGA LIHC clinical variables dataset After Feature Reduction Methods

Table 5
Studies of patients with hepatocellular carcinoma based on the TCGA LIHC dataset