Advanced machine learning techniques for cardiovascular disease early detection and diagnosis

five


Introduction
The heart is the second-most important organ in the human body, after the brain.The heart's confusion eventually results in body turmoil.We are living in the modern era, and the world around us is undergoing significant transformations that have some impact on our day-to-day lives.Heart disease, which is claiming lives around the world, is one of the leading ailments among the top five deadly diseases [1].Because it enables us to take the necessary steps at the right time, forecasting this disease is of the utmost importance.
Cardiovascular Diseases (CVD) are a group of heterogeneous diseases that affect the heart and circulatory system causing a variety of ailments that are typically brought on by atherosclerosis.Typically, CVD are chronic in nature and progressively manifest over time without symptoms for long periods of time before becoming advanced and showing up as symptoms of different intensity [2][3][4].According to reports from the World Health Organization (WHO), CVD has been the leading cause of premature death in the world for decades, and it is expected that by 2030, CVD will be responsible for the deaths of around 23.6 million people annually.
In addition, the cost of treating cardiovascular disease and it's future consequences and early death, as measured by Disability Adjusted Life Years ("DALYS"), entails a significant economic burden [5][6][7].Many factors contribute variably to the development of cardiovascular disease; these factors can be classed as modifiable and non-modifiable risk factors [5,8].Age, gender, and inherited variables are factors that cannot be modified.However, the other category of concerns, referred to as modifiable risk factors, comprises fasting blood sugar, high blood pressure, serum cholesterol, smoking, dietary propensity, obesity, and physical inactivity [9,10].
Individuals will be able to avoid the development of CVD by identifying modifiable risk factors and attempting to alter lifestyle-related risk factors into healthy ones.Chest discomfort, arm pain, slowness and dizziness, weariness, and perspiration are among the early warning signs of a heart attack [11].Individuals will be able to prevent the progression of CVD by identifying modifiable risk factors and attempting to alter lifestylerelated risk factors into healthy ones.Patients with heart disease do not have symptoms in the early stages of the disease, but they do in later stages when it is sometimes too late to manage or treat [12][13][14].Therefore, despite the difficulty, Rapid recognition and prediction of CVD hypersensitivity in it seem healthy people is essential in assessing prognosis and prognosis.For early diagnosis of CVD, it will be incredibly beneficial and necessary to analyze the current significant CVD-health information contained in the huge database of hospital records.Thus, machine learning algorithms and other techniques for intelligent systems are beneficial in this field, and their findings are reliable and accurate [15][16][17].
The field of machine learning enables the identification of concealed patterns and the establishment of analytical structures, including clustering, classifications, regression, and correlations, through the integration and application of various techniques, such as machine learning models, neural networks, and information retrieval [18][19][20].Consequently, machine learning techniques have demonstrated great potential to support clinical decision-making, aid in the development of clinical guidelines and management algorithms, and promote the establishment of evidence-based clinical practices for the management of Cardiovascular Diseases (CVDs) [21][22][23][24][25][26][27].Furthermore, the early detection of CVDs using machine learning techniques can reduce the need for extensive and expensive clinical and laboratory investigations, resulting in a reduction of the financial burden on both the healthcare system and individuals [28,29].
Cardiovascular disease is a chronic syndrome that can result in heart failure, a critical condition characterized by impaired heart function, and symptoms such as compromised blood vessel function and infarction of the coronary artery [30].According to the American Heart Association (World Health Organization, 2021), cardiovascular diseases are a set of heart and blood vessel abnormalities and one of the main causes of death worldwide.Accounting to statistic of almost 18 million deaths, cardiovascular disease was responsible for 32% of all deaths all over the world [31].Heart attacks and strokes accounted for 85% of all deaths, with 38% occurring in individuals younger than 70.In the treatment and management of cardiovascular disorders, early detection is crucial, and machine learning (ML) can be a useful tool for recognizing a probable heart disease diagnosis [17,32].
Heart disease as well identified as cardiovascular disease is a leading cause of death worldwide.The cardiac muscle is responsible for the circulation of blood around the body [33].Although machine learning methods have demonstrated intriguing results in forecasting certain medical disorders, they have not been applied to the prediction of individual CVD survival in hypertensive patients utilizing routinely obtained big digital electronic administrative health data [34].If a machine learning algorithm can be used to exploit the large administrative data set, it may be attainable to optimize the use of accumulated data sets to support in predicting patient outcomes, planning individualized patient care, monitoring resource utilization, and improving institutional performance.Comorbidity status, demographic information, laboratory test results, and medication information would improve prognostic evaluation and direct treatment decisions for hypertension patients [35].
In this study, we proposed a Gradient Boosting model to predict the existence of cardiovascular disease and to identify the most predictive value based on their Rough sets values.Afterward, a number of Machine Learning and Deep Learning techniques are used to analyze cardiovascular disease.Below are the main contributions of this study: • Utilizing cross-validation and split validation, discover a machine learning algorithm with improved performance that will be applied to the detection of cardiovascular disease.

Related work
Many researchers examine a number of cardiac disease expectation frameworks utilizing various data mining techniques.They utilizing datasets and various calculations, in addition to test findings and future work that would be possible on the framework, and achieving more productive results.Researchers completed numerous research attempts to accomplish efficient techniques and high accuracy in recognizing disorders associated with the heart.
Pattekari [36] study creating a model using the Naive Bayesian data mining presentation method.It's a computer program in which the user answers predetermined questions.It pulls hidden information from a dataset and evaluates client values to a preset data set.It can provide answers to difficult questions regarding heart disease diagnosis, allowing medical service providers to make more informed clinical decisions than normal choice emotionally supporting networks.It also helps reduce treatment expenses by providing effective treatments.
Tran [37] study built an Intelligent System using the Naive Bayes data mining modeling technique.It is a web application in which the user answers pre-programmed questions.It tries to find a database for hidden information and compares user values to a trained data set.It can provide answers to difficult questions about cardiac disease diagnosis, allowing healthcare professionals to make more informed clinical decisions than traditional decision support systems.It also lowers treatment costs by delivering effective care.
Gnaneswar [38] demonstrates the significance of monitoring the heart rate when cycling.Cyclists can cope with cycling meetings, such as cycling rhythm, to identify the level of activity by monitoring their pulse while accelerating.By managing their pedaling exertion, cyclists can avoid overtraining and cardiac failure.The cyclist's pulse can be used to determine the intensity of an exercise.Pulse can be measured using a sensor that can be worn.Unfortunately, the sensor does not capture all information at regular intervals, such as one second, two seconds, etc.Consequently, we will need a pulse expectation model to fill in the gaps.
Gnaneswar [38] work aims to use a Feedforward Brain Organization to construct a predictive model for pulse in consideration of cycling rhythm.On the second, pulse and rhythm are the data sources.The result is the predicted pulse for the following second.Using a feed-forward brain structure, the relationship between pulse and bicycle rhythm is represented statistically.Mutijarsa [39] expand of medical care administrations, based on these arguments.Numerous breakthroughs in remote communication have been made in anticipation of cardiac sickness.Utilizing data mining (DM) techniques for the detection and localization of coronary disease is highly useful.In their assessment, a comparative analysis of multiple single-and mixed-breed information mining calculations is conducted to determine which computation most accurately predicts coronary disease.
Yeshvendra [40] argues that the use of AI computations in the forecasting of various diseases is growing.This notion is so significant and diverse because of the ability of an AI computation to have a comparable perspective as a human for improving the accuracy of coronary disease prognosis.Patil [41] notes that a proper diagnosis of cardiac disease is one of the most fundamental biomedical concerns that must be addressed.Three information mining techniques: support vector machine, naïve bayes, and Decision tree.These techniques were used to create an emotionally supportive network for their preferred option.Tripoliti [42] argues that the identification of diseases with large prevalence rates, such as Alzheimer's, Parkinson's, diabetes, breast cancer, and coronary disease, is one of the most fundamental biomedical tests demanding immediate attention.Gonsalves [43] attempted to forecast coronary CVD using machine learning and historical medical data.Oikonomou [44] provides an overview of the varieties of information encountered in chronic disease settings.Using multiple machine learning methods, they elucidated the extreme value theory in order to better measure chronic disease severity and risk.
According to Ibrahim [45], machine learning-based systems can be utilized for predicting and diagnosing heart disease.Active learning (AL) methods enhance the accuracy of classification by integrating user-expert system feedback with sparsely labeled data.Furthermore, Pratiyush et al. [46] explored the role of ensemble classifiers over the XAI framework in predicting heart disease from CVD datasets.The proposed work employed a dataset comprising 303 instances and 14 attributes, with categorical, integer, and real type attribute characteristics, and the classification task was based on classification techniques such as KNN, SVM, naive Bayes, AdaBoost, bagging and LR.
The literature attempted to create strategies for predicting cardiac disease diagnosis.Because of the high dimensionality of textual input, many traditional machine learning algorithms fail to incorporate it into the prediction process at the same time [47][48][49][50][51][52][53].As a result, this paper investigates and develops a set of robust machine learning algorithms for improving the early prediction of CVD development, allowing for prompt intervention and recovery.

Methodology
This section describes the suggested classification scheme for heart disease instances.Initially, exploratory analysis is conducted.A comprehensive analysis is undertaken on both the target and the features, and category variables are converted to numeric values.Various criteria are utilized to compare models under consideration.The outputs of each model are analyzed, and the optimal model for the problem at hand is selected.The proposed model is thoroughly examined, and the Optuna library is used to tweak the model hyperparameters to see how much they have been enhanced.The suggested model is divided into three phases: (1) pre-processing, (2) Training, and (3) classification as shown in Fig. 1.In the following sections, the Authors will examine each of these components in further depth.

Pre-processing
Before training the selected models, it is important to address the Cholesterol missing values that were initially input as 0. To accomplish this, the data is separated into groups Fig. 1 The main steps of the proposed methodology based on the presence of a verified cardiac condition, and the mean of each group is used to fill in the missing values.To assess whether these variables are influential in predicting heart disease based on their Shapley Values, interaction terms were included to the models to capture any possible correlations between the data elements.SHAP (SHapley Additive exPlanations) employs game theory to identify the significance of each characteristic and can be used to explain both individual model predictions and aggregated model results.SHAP determines the magnitude of each predictor's contribution to the model's output by averaging the marginal contributions of each feature over all feasible feature combinations.
Before doing feature selection using Shapley values, a gradient boosting model containing all variables is trained.The final predictors will be selected from the characteristics with a Shapley value greater than 0.1 that contribute significantly to the model's prediction.Then, these predictors will be used to establish the most effective model.Due to the multicollinearity between the interaction variables, a variety of nonparametric tree-based methods for predicting the risk of CVD are explored to discover the best accurate method.

Training process
The machine learning algorithm will be correctly trained after preprocessing and normalizing the datasets.Following the modification of the data, it is arbitrarily categorized into a training set and a test set, with 70% of the rows assigned to the training set and 30% to the test set.The k-fold is a common cross-validation method that entails running a large number of pertinent tests to determine the model's typical accuracy metric.This technique has existed for quite some time.To examine the proposed strategy, such AI procedures as SVC [54], MultinomialNB [55], K-Neighbor [56], BernoulliNB [55], SGD [57], Random forest [58] and Decision tree [59] are deployed for best terms of result.
XGBoost (Extreme Gradient Boosting) is a supervised learning method for improving prediction accuracy by combining multiple decision trees.XGBoost iteratively adds decision trees using gradient boosting, with each subsequent tree attempting to correct the errors of the previous trees.The final prediction is the weighted sum of all the individual tree predictions.XGBoost's objective function includes a loss function as well as a regularization term, which helps to prevent overfitting.The XGBoost objective function equation is: where l is the loss function, y i is the true label for example i, ŷ(t−1) i is the predicted value from the previous iteration, f t (x i ) is the prediction of the t th tree for example i, and �(f t ) is the regularization term.
AdaBoost (Adaptive Boosting) is another boosting algorithm that also uses decision trees as weak learners.AdaBoost assigns weights to each training example, with higher weights given to examples that were misclassified by the previous weak learner.In each subsequent iteration, a new decision tree is trained on the weighted data, with the weights updated based on the accuracy of the tree.The final prediction (1) is the weighted sum of the predictions of all the individual trees.The equation for the prediction function of AdaBoost is: where T is the total number of trees, h t (x) is the prediction of the tth tree for input x, and α t is the weight assigned to the tth tree.Linear Support Vector Classifier (SVC) focus employs a straight-bit capacity to order data and operates superbly with enormous datasets [54].The Linear SVC has more restrictions, such as standardization of consequence and misfortune work.Due to the fact that direct SVC is dependent on the bit strategy, the part strategy cannot be modified.A Direct SVC is meant to handle the data by returning the "best fit" hyper-plane that partitions or sorts it.After acquiring the hyperplane, the highlights are placed within the classifier, which predicts which class they belong to.
The Naive Bayes algorithm assigns equal weight to all features or qualities.The algorithm becomes more efficient as one property has no effect on another.According to Yasin 2020, the Naive Bayes classifier (NBC) is a simple, effective, and well-known text categorization algorithm.NBC has used the Bayes theorem to classify documents since the 1950 s, and it is theoretically sound.A posterior estimate is used to determine the class using the Naive Bayes classifier.Characteristics, for example, are categorized based on their highest conditional potential.
Bernoulli Naive Bayes is a statistical technique that produces boolean results based on the presence or absence of required text.The discrete Bernoulli Distribution is fed into this classifier.When identifying an unwanted keyword or tagging a specific word type within a text, this type of Naive Bayes classifier is useful.It is also distinct from the multinomial approach in that it generates binary output such as 1-0, True-False, or Yes-No.A stochastic system or procedure is one that has a random fit solution as part of it.Stochastic Gradient Descent (SGD) randomizes a few data samples rather than the entire dataset in each iteration.As a consequence, rather than calculating the sum of the gradients for all instances, each iteration calculates the gradient of the cost function for a single example.SGD is a method for determining the optimal smoothness properties of a differentiable or sub-differentiable objective function that is iterative.
Decision Tree is a widely known Machine Learning technique in which data is repeatedly partitioned based on specific parameters.The tree has two traversable entities: nodes and leaves.Leaves represent decisions or outcomes, whereas decision nodes partition data [59].Decision trees can be used in combination to solve problems (ensemble learning).The Random Forest algorithm resolves the overfitting issues associated with decision tree algorithms.The algorithm is capable of dealing with regression and classification problems, as well as evaluating a large number of attributes to determine which ones are most important.Random data can learn without well-planned data alterations [58].
The K-Nearest Neighbor (K-NN) algorithm classifies new observations based on their distances from known examples.Based on the majority vote of its neighbors (2) and a distance function as a measuring tool, the case is designated to the class with the highest frequency among its k-nearest neighbors.In classification problems, k-NN returns the class membership.Whereas, in regression problems, it returns the object's property value.Whether k-NN is used for classification or regression has an effect on the output.Because this method relies on distance for classification, normalization can dramatically improve the training data.If the features correspond to different physical units or scales, standardization can significantly enhance the accuracy of the training data [56].

Classification
The proposed model is based on machine learning with strong generalization capabilities and a high degree of paradigm-specific precision.In this study, we will evaluate a number of machine learning algorithms and establish objectively which one delivers the greatest results.This is the primary purpose for the usage of machine learning: to combat the problem of overfitting that happens in machine learning.The curriculum also includes a structural concept of risk minimization.Machine learning can run bestdescribed classes, particularly in higher-dimensional space, and to suggest a hyper-plane with the largest possible separation.In this stage, labeling data is used as an input, and the most significant characteristics are extracted using a feature extraction process.Finally, the optimal model is used to categorize new instances of data.

Experimental evaluation
In the experiments of the study, we utilized Google Colab as the implementation platform for machine learning models.The platform includes a virtual machine that runs on Google's servers and gives users access to a Python environment that includes popular data science libraries like TensorFlow, PyTorch, and Scikit-Learn.Google Colab is a cloud-based Jupyter notebook environment that offers free access to computing resources such as a virtual machine with 12 GB of RAM and up to 100 GB of hard disk space.The memory size allocated to the virtual machine is up to 25 GB, and it is also possible to enable high-RAM options up to 52 GB for large-scale models or data.The virtual machine runs on Google's servers and is equipped with NVIDIA Tesla K80 GPU, enabling us to train deep learning models efficiently.Additionally, Google Colab provides a wide range of preinstalled libraries and tools, making it easy to install and use the necessary dependencies.The virtual machine is powered by a Linux-based operating system, ensuring that the implementation environment is stable and reliable.Also, the operating system used by the virtual machine is Linux Ubuntu, which comes preinstalled with various system libraries and tools commonly used in data science projects.
The following subsection discussed the dataset and the results of the machine learning models.

Data collection
The Heart Condition data utilized in this study is a synthesis of data sets from the UCI Machine Learning Repository and contains eleven features that can be used to forecast the existence of heart failure, a prevalent cardiovascular disease that significantly raises the probability of a CV-related mortality [60,61].The target variable is   1.Moreover, Table 2 presents the list of variables and the description of the features in the heart disease dataset.The dataset was created by combining a diverse range of datasets that were previously available independently, and were not combined before [60,61].In this dataset, five heart datasets are combined over 11 common features which makes it the largest heart disease dataset accessible for research purposes.The specific datasets utilized in the curation of this composite dataset are shown in Table 3.
The Heart Disease dataset has 918 observations and 12 columns [60,61].the statistics of categorical attributes.From this table, the unique values in ChestPain-Type attribute are 4 and the top is "ASY".Table 6 summaries the main details for the numeric features.It is clear that, the variable Sex has two main values male (M) and female (F) such that the proportion of Heart Disease for M is 90.2% and for F is 9.8%.Similarly, Table 6 presents the statistics of ChestPainType attribute, there are 4 values (ASY, NAP, ATA, and TA) and the most frequent is ASY of 77.2%.

Exploratory data analysis
Remarkably, the classifications in the heart disease attribute value are reasonably wellbalanced.508 of the 918 patients who participated in the study have been diagnosed with heart failure, while 410 have not.Patients with heart disease have a median age of 57, whereas those without heart disease have a typical age of 51.As illustrated in Fig. 2, around 63% of males have heart disease, whereas approximately 25% of females have been diagnosed with heart disease.A female has a chance of 25.91% having a Heart Disease.A male has a probability of 63.17% having a Heart Disease. Figure 3 demonstrates the heart disease ranges for Age, Systolic Blood Pressure, Cholesterol, Heart Rate, and ST Segment Depression.The boxplot of heart disease patients fall between the ages of 51 and 62, as depicted by the Age boxplot.There are also a few younger outliers below the lower margin in this category.Non-cardiovascular diseasefree individuals have an age range that is slightly more variable but more evenly distributed, and there are no outliers.The vast majority of patients falling into this category are quite young, with ages ranging from 43 to 57 [62].

Fig. 2 Prevalence of heart disease among men and women
Fig. 3 The distributions of heart disease for age, systolic blood pressure, cholesterol, heart rate and ST segment depression Furthermore, the boxplots between the groups for the Pulse Pressure Pressure variable are extremely similar.Both have upper and lower outliers, with the vast majority of patients' blood pressure falling between 120 and 145 mmHg.As demonstrated in Fig. 3, the median blood pressure in both groups is roughly 130 mmHg.Also, for the Cholesterol variable, the distribution of cholesterol appears to be skewed to the right, particularly among individuals with heart disease, where a substantial number of observations were reported with cholesterol values of 0. As illustrated in Fig. 3, those without heart illness have a median heart rate of 150 beats per minute, but those with heart disease have a median heart rate of 126 beats per minute.
In the case of the ST Segment Depression (OldPeak) variable, there is a variance between the distribution of ST segment depression groups.ST depression is more variable in patients with heart disease, with numerous larger outliers.The majority of these patients exhibit ST depressions between 0 and 2 mm, with a mean of 1.2 mm.In patients without heart disease, the range is narrower, between 0 and 0.6 mm, with a median ST depression of 0 mm, however the distribution of this group is more skewed overall, as illustrated in Fig. 3.
Figure 4 displays the correlation matrix associated with the heart disease dataset.heartdisease has the strongest positive link with OldPeak (correlation = 0.4) and the strongest negative association with MaxHR (correlation = − 0.4), according to the correlation matrix.Age and MaxHR also have a reasonably high link, with a correlation of − 0.38.As seen in Fig. 4, heart rate tends to decrease as age increases.Results observe a weak correlation between the numerical features and the target variable based on the matrix.Oldpeak (a depression-related number) correlates positively with heart disease.Heart disease is negatively correlated with maximal heart rate.Cholesterol has an interestingly negative association with heart disease.Nearly 80% of diabetic persons suffer heart problems.Patients with exercise-induced angina have an even greater incidence of cardiovascular disease, at over 85%.Over 65% of patients diagnosed with cardiac disease had ST-T wave abnormalities in their resting ECGs, the greatest percentage across the categories.Patients with a Flat or Declining ST Slope during exercise have the highest frequency of cardiovascular disease, at 82.8% and 77.8%, respectively.
Figure 6 explains data details regarding asymptomatic chest pain in heart disease at almost 77%, the absence of chest pain (asymptomatic) is the most prevalent symptom in patients with heart disease.In addition, heart disease is roughly nine times more prevalent in males than in females among patients with a cardiovascular diagnosis.A Overall insights obtained from the exploratory data analysis, Data for the target variable are near to balanced.The association between numerical features and the target variable is weak.Oldpeak (a depression-related number) correlates positively with heart disease.Heart illness is negatively correlated with maximum heart rate.Interestingly, there is a negative link between cholesterol and heart disease.Males are approximately 2.44 times more likely to suffer from heart disease than females.There are distinct variances between the types of chest pain.Patients with asymptomatic chest pain (ASY) are about six times more likely to suffer heart disease than those with Atypical Angina chest pain (ATA).Resting ECG: electrocardiogram values at rest are comparable.Patients with ST-T pulse abnormalities have a higher risk of developing heart disease than those who do not.ExerciseAngina: people who have exercise-induced angina are nearly 2.4 times more likely to have heart disease than people who don't.The slope of the ST segment at Fig. 6 Prevalence of chest pain in heart disease data maximum exertion varies.ST Slope Up has a considerably lower risk of cardiovascular disease than the other two segments.Exercise-induced angina with a 'Yes' score is nearly 2.4 times more likely to result in heart disease than exercise-induced angina with a 'No' score.

Performance evaluation
When dealing with imbalanced datasets, classification accuracy alone may not be the most suitable performance metric.Therefore, authors often use additional performance metrics to address this issue [63].The confusion matrix is frequently employed for expressing a classifier's classification results, with diagonal elements indicating correctly classified samples as positive or negative and off-diagonal elements indicating misclassification.As a consequence, performance improvement metrics such as accuracy, precision, recall (sensitivity), F1-score, and ROC curve are employed.F1-score accuracy, recall, and precision can be calculated using the Eqs.3, 4, 5, and 6, respectively.These formulas are based on the numbers of False Positive (FP), False Negative (FN), True Positive (TP), and True Negative (TN) samples in the test dataset [64].

Machine learning models
Studies are carried out using the collected dataset, which has approximately 918 rows.The final version of the updated data was split into training and testing sets in order to fit the model, with 70% of the data used for the learning set and 30% for the testing set.Table 7 shows the shapes of three datasets: training, validation, and test.The training set has 504 rows and 19 columns, while the validation and test sets both have 207 rows and 19 columns.AdaBoost, Gradient Boost, Random Forest (RF), k-nearest neighbor (KNN), Support Vector Machine (SVM), and Decision tree classifiers are used in this study [64][65][66].
(3)  In order to develop a robust classifier with high precision, it is vital to use an appropriate evaluation approach.One such method is the K-fold Cross-Validation, which generates diverse data samples to determine the average correctness of a model.The strategy of k-fold is a commonly used cross-validation technique, where a specified value of k is chosen, such as five, and the data is divided into k subsets of equal size.
In each iteration, one of the k subsets is used as the test set, and the remaining k − 1 subsets are used for model learning.This process is repeated until all subsets have been used as the test set once.
The k-fold cross-validation method employs computed values average as a performance metric.This approach provides a reliable estimate of the model's generalization ability, which is particularly useful when the data is limited and cannot be split into separate learning and testing sets.
Finally, the best hyperparameter values for each algorithm are determined through experimentation and optimization of the model, often through methods such as grid search and Bayesian optimization.The best hyperparameter values can serve as a starting point for developing new models or improving existing ones, as they provide insight into the values that have yielded the best performance for each algorithm.The results of hyper-parameter optimization of Machine learning models are shown in Table 8.
Table 8 presents the results of hyper-parameter optimization for four machine learning models: Extra Trees, Random Forest, AdaBoost, and Gradient Boosting.For each model, a range of hyper-parameters was explored using cross-validation, and the 88.9% 0.925 best parameters were selected based on the highest accuracy and AUC scores.The accuracy and AUC scores were calculated using a hold-out test set.
The AdaBoost model achieved an accuracy of 84.06% and an AUC score of 0.897 with the best parameters of learning_rate=0.25 and n_estimators=100.Finally, the Gradient Boosting model achieved the highest accuracy of 88.9% and the highest AUC score of 0.925 with the best parameters of boosting_type='dart' , colsample_bytree=1, learn-ing_rate=0.5, max_depth=3, min_child_samples=7, min_split_gain=1e-05, num_ leaves=30, and subsample=0.5.Overall, the results indicate that hyper-parameter optimization can significantly improve the performance of machine learning models, and the Gradient Boosting model performed the best on this particular dataset.
The results of the Chi-Squared test are presented in Table 9.Based on the p-values, which are less than 0.05, all discrete variables are included in the models as predictors.
The summary plot of Shapley values of feature importance in a machine learning model provides insights into the relative importance of different features in making predictions.The Shapley value is a concept from cooperative game theory that provides a way to allocate the main contribution of each feature to the final prediction.In a machine learning environment, the Shapley value of a feature represents the average contribution of that feature to the model output across all possible subsets of features.The calculation of Shapley values requires the evaluation of the model output for all possible subsets of features, which can be computationally expensive for high-dimensional datasets.However, there are several efficient algorithms for approximating the Shapley values, such as the KernelSHAP algorithm, which is based on sampling.
As shown in Fig. 7, the summary plot of Shapley values displays the top 20 predictors of heart disease in order of relevance.Each point on the graph represents a training set observation.When the points are to the right of the 0 lines, this suggests a greater risk of being diagnosed with heart disease, whereas points to the left of the 0 line indicate a lower likelihood.The values of each feature are represented by the color of the points, with light orange indicating high feature values and dark blue indicating low feature values.The shape of the points in each row is determined by the number of observations that overlap for that feature.Along with three independent features, Cholesterol, Age, and typical chest pain, nearly all of the variables in the plot are interaction terms that were included to the model.The variable in the first row represents the interaction between SBP at Rest and ST Slope Up.People with an upward ST slope and high blood pressure have a lower risk of heart disease, according to the Shapley values.The Shapley values for the second variable, "Sex M ST slope flat, " show that male patients with a flat ST Slope are more likely to develop cardiovascular disease.The fourth variable in the scatter plot, Cholesterol Sex M, indicates that men with high cholesterol are more likely to be diagnosed with cardiovascular disease.In addition, the order of relevance in the summary plot is determined by the feature's average absolute Shapley value, which quantifies the average amount by which the characteristic affects the projected chance of heart disease.There are 18 features that contribute at least 0.1 on average to the model's prediction.Table 10 provides a listing of the final predictors chosen and the feature importance of each."RestingBP ST Slope" appears in five of the top 19 of most significant predictors.

Model performance on the validation set
The ROC Curves (Fig. 8) illustrate the performance of the models at various thresholds.The y-axis indicates the True Positive Rate or Sensitivity of the models, which is a measure of how well the model identifies patients with heart disease (true positives), while the x-axis indicates the number of patients that the model incorrectly classifies as false positives.A model with a curve at the upper left corner of the graph, with a higher true positive rate and a lower false positive rate, shows a greater capacity to differentiate between the classes.On the test set, all of the models depicted in the above scatter plot produce strong results.Overall, Gradient Boosting has the highest Area Under the Curve at 0.927, but at specific thresholds, the Random Forest model offers somewhat superior results, since the curve surpasses that of Gradient Boosting.
By using Shapley features greater than 0.1, the Extra Trees classifier achieves an AUC of 0.89.After tweaking the model's hyperparameters, the classifier achieves an average accuracy of 88%, an F1-score of 89.5%, and a standard deviation of 6.7% on the validation set.With an AUC of 0.917, the Random Forest model outperforms the Extra Trees classifier across all three criteria.On the validation set, the model achieves an average precision of 88.7% and an F1-score of almost 90%.Evidently, there is a minor performance reduction in the AdaBoost model.The AUC declined to 0.91, and the overall accuracy and F1-score fell to 86.5% and 88% respectively.Despite the fact that the model yields a smaller standard deviation than the others.At 0.927, the Gradient Boosting model has the greatest Area Under the Curve among the classifiers.In addition, the model improves the validation set's accuracy to about 87% and the F1-score to 89%.
Comparing the cross-validation results shown in the boxplots of Fig. 9, it is clear that the Gradient Boosting model has the highest median F1-score of 90.3% and the highest median accuracy of 88.5%.It also has the smallest standard deviation of the distribution, at around 3. The Random Forest model comes in a close second with a median F1-score of 89.7% and a median accuracy of 88.2%, albeit with slightly greater score variability.
The recall, precision, and accuracy values for the Catboost model are shown in Table 11.Catboost is a model that can determine whether a patient has heart disease.Furthermore, the Catboost algorithm is biased because it is extremely sensitive to major class.When calculating the comprehensive performance measurement, the F1-score is also used to compare the algorithm's precision.Classification of heart disease was significantly improved by the Catboost method.The model achieved 93% accuracy in the "heart illness" category and 88% accuracy in the "No disease" category.The accuracy of the Catboost classification model was 91%.
Table 12 illustrates the classification results of the various classifiers on the dataset.The table reports the performance of various classifiers on a given dataset, measured in terms of Accuracy, Precision, Recall, and F1 score.Comparing the results of the proposed technique against those of other classifiers such as SVM [54], XGBoost, Ada-Boost, RandomForest [58], LinearDiscriminant [67], LightGBM, GradientBoosting,  The present study employs a confusion matrix (Fig. 10) to report the performance of models in accurately predicting cardiac disease for a given set of patients, with due consideration to both correctly classified and misclassified instances.Specifically, the Gradient Boosting model is found to exhibit the highest proportion of True Positives (TP) and True Negatives (TN) when evaluated on a test set.The computation of FN, FB, TN, and TP, values for the cardiac disease class is carried out using the Gradient Boost model, whereby the predicted values are expected to match the actual values.For instance, TP corresponds to the value at cell 1 of the confusion matrix, while FN is computed by adding the relevant row values, excluding TP (i.e., FN = 12).Similarly, FP is calculated as the total of column values, excluding TP, leading to a value of 11.Lastly, TN is determined by the combination of all columns and rows except the class under consideration (i.e., cardiac disease), which yields a value of 81.

Discussion
Despite the vast amount of data produced by healthcare systems, medicine faces unique obstacles in comparison to other data-driven businesses where machine learning has flourished.The Health Insurance Portability and Accountability Act (HIPAA) mandates strict, center-specific Institutional Review Boards (IRBs) to govern the usage of patient data.This significantly preserves patient privacy, but it has unwittingly created data silos across the nation [47].Consequently, the majority of published healthcare machine learning models rely on locally acquired datasets and lack external validation.58% of cardiovascular prediction models, according to the Tufts predictive analytics and comparative effectiveness cardiovascular prediction model have never been externally verified [69].Heart-related disorders are one of the leading causes of deaths and morbidity on a global scale [5][6][7].
It is common for those with heart disease to be unaware of their condition, and it is difficult to predict their health condition and diagnose their disease in its early stages in order to save their lives, minimize their complications and suffering, and reduce the global burden of disease and mortality [9].Machine learning models are capable of accomplishing this difficult task and can be of tremendous assistance in the early diagnosis and prediction of heart disorders [12][13][14].Medical machine learning offers a vast array of opportunities, including the discovery of hidden patterns that can be utilized to generate diagnostic accuracy on any medical dataset.
Previous research has demonstrated that machine learning can aid in the prediction of cardiovascular illness [15,16].For the diagnosis of cardiac disorders, this prior research employed various machine learning approaches, such as neural networks, Naive Bayes, Decision Tree, and SVM, and obtained varying degrees of accuracy [18,19].The accuracy of the proposed feature selection methodology algorithm (CFS+Filter Subset Eval), a hybrid method that combines CFS and Bayes theorem, Fig. 10 The confusion matrix results for Extra Trees, RandomForest, AdaBoost, and GradientBoost classifiers was 85.5%, according to [70].Shouman et al. [71] presented an integrated k-means clustering with the Naive Bayes approach for enhancing the accuracy of Naive Bayes in diagnosing patients with heart disease, with an accuracy of 84.5%.Using both Naive Bayesian Classification and Jelinek-Mercer smoothing techniques.Rupali et al. [72] developed decision support for the Heart Disease Prediction System (HDPS), with Laplacian smoothing for approximating important patterns in the data while avoiding noise; their accuracy was 86%.
Elma et al. [73] created a classifier for predicting heart illness that merged the distance-based approach K-nearest neighbor with a statistically-based NaiveBayes classifier (cNK) and achieved an 85.92% accuracy rate.Dulhare et al. [74] improved cardiac disease prediction methods using Naive Bayes and particle swarm optimization, attaining an accuracy of 87.91%.
To accurately predict CVDs in the present study, Shapley values were used to create a Gradient Boosting model with an Area Under the Curve of 0.927% for predicting the risk of a heart disease diagnosis.Using Shapley values, Authors discovered critical cardiac disease signs and their predictive power for a positive diagnosis.Interaction effects between a patient's medical information were some of the most relevant predictors in the model, particularly in features such as Age, Cholesterol, Blood Pressure, ST Slope, and Chest Pain kind.The proposed Catboost model offered the strongest results overall and can be utilized for the early identification and diagnosis of heart disease, with an overall F1-Score of 92.3% and an accuracy of 90.94%, when picking the optimal model.Overall, the proposed model is superior to earlier approaches for diagnosing cardiac disease.
However, this study is important but many Limitations exist.First, this research depends solely on secondary data using the available data at the selected cardiology and internal medicine departments.Hence, there were some missing data and some variables could not be included in the analysis.The cross-sectional design of the study is the second limitation that could not examine the longitudinal effects of the risk factors on the development of the CVDs.
The possible future orientation of this study is to improve prediction techniques by combining various machine learning techniques and increase the accuracy and precision of CVD prediction and early diagnosis, which has been shown to be superior to the majority of traditional state-of-the-art methods.Based on machine learning techniques, the suggested model for the prediction of heart disorders is a robust, effective, and efficient method for the prediction and early detection of heart ailments.It obtained and maximized classification performance with greater accuracy and precision percentages than other current models.One of the most significant outcomes of our proposed machine learning algorithms is that they achieved good accuracy while displaying fewer feature sets.This is crucial for clinical medical practice, which requires the most precise and straightforward methods for confirming a diagnosis in order to make a final therapeutic decision.Nonetheless, there are obstacles to the generality of the CVD prediction models reported in this study.Before being implemented into the clinical guidelines, the suggested machine learning algorithm must investigate different population datasets to minimize variation in CVD prevalence patterns and evaluate the possible impact on physicians' decision making or patient outcomes.

Conclusion
Prediction of cardiovascular diseases is crucial for assisting clinicians with early disease diagnosis.Instead of replacing clinicians, machine learning will be a supplement to the clinical portfolio, enhancing human-led decision-making and clinical practices.Furthermore, by using machine learning techniques, the cost of conducting a long list of expensive clinical and laboratory investigations will be eliminated, reducing the financial burden on patients and the healthcare system.This paper proposed new robust, effective, and efficient machine learning algorithms for predicting CVD based on symptoms, signs, and other patients' information from hospital records in order to improve the early prediction of CVD development in its early stages and to ensure early intervention with a warranted recovery.The new technique was more accurate and precise than existing standard art-of-state algorithms for the classification and prediction of heart disease.Future research evaluating the performance of the proposed machine learning algorithms on datasets containing a greater number of modifiable and non-modifiable risk factors will be crucial for the development of a more accurate and robust system for the prediction and early diagnosis of heart diseases.
that indicates a diagnosis of Heart Failure if HeartDisease is = 1 as illustrated in Table

Fig. 4
Fig.4 The correlation matrix for the Heart Disease dataset

Figure 5
Figure5illustrates the correlation between heart disease and category variables.Nearly 80% of diabetic persons suffer heart problems.Patients with exercise-induced angina have an even greater incidence of cardiovascular disease, at over 85%.Over 65% of patients diagnosed with cardiac disease had ST-T wave abnormalities in their resting ECGs, the greatest percentage across the categories.Patients with a Flat or Declining ST Slope during exercise have the highest frequency of cardiovascular disease, at 82.8% and 77.8%, respectively.Figure6explains data details regarding asymptomatic chest pain in heart disease at almost 77%, the absence of chest pain (asymptomatic) is the most prevalent symptom in patients with heart disease.In addition, heart disease is roughly nine times more prevalent in males than in females among patients with a cardiovascular diagnosis.A

Fig. 5
Fig. 5 Prevalence of heart disease by resting ECG.(a) Prevalence of Heart Disease in Patients with Diabetes.(b) Prevalence of Heart Disease in Patients with Exercise Angina.(c) Prevalence of Heart Disease by Resting ECG.(d) Prevalence of Heart Disease by ST Slope

Fig. 7
Fig. 7 The summary plot of Shapley values of features importance

Fig. 9
Fig. 9 Model performance on the validation set

Table 1 A
sample of the Heart Failure Dataset

Table 2
Table 4 summaries the main statistics for the numeric features.It is clear that, the mean value of age is 53 and the maximum is 77 as shown in Table 4. Similarly, Table 5 presents Symptoms, signs and laboratory investigations of the dataset of the heart disease

Table 3
The different datasets used to create the dataset of the heart disease

Table 4
Summary statistics of numeric variables

Table 5
Summary statistics of categorical variables

Table 6
The proportion of Heart Disease bold numbers mean the highest frequency and percentage

Table 7
Dataset shapes

Table 8
The results of hyper-parameter optimization of Machine learning models

Table 9
The results of Chi-Squared test

Table 10
The list of the final predictors selected and their feature importance

Table 11
[68]sification report for Catboost_tuned modelCatboost, ExtraTree, KNeighbors[56], and LogisticRegression[68]demonstrates the method's utility.The results of classifiers according to various metrics are displayed.The highest performing classifier based on all measures is Catboost_tuned, which achieved an accuracy of 0.9094, a precision of 0.9317, a recall of 0.9146, and F1 score of 0.9231.Other top-performing classifiers include RandomForest, LogisticRegression, SVM, and KNeighbors, with similar accuracy and precision scores, but slightly lower recall and F1 scores.In contrast, lower-performing classifiers such as XGBoost and AdaBoost exhibit moderate accuracy and precision scores, but relatively lower recall and F1 scores.Overall, the results suggest that the choice of classifier can have a significant affect on the performance of a predictive model.

Table 12
Comparative results on the Dataset using ML