Examining ALS: reformed PCA and random forest for effective detection of ALS

Alqahtani, Abdullah; Alsubai, Shtwai; Sha, Mohemmed; Dutta, Ashit Kumar

doi:10.1186/s40537-024-00951-4

Research
Open access
Published: 10 July 2024

Examining ALS: reformed PCA and random forest for effective detection of ALS

Abdullah Alqahtani¹,
Shtwai Alsubai²,
Mohemmed Sha¹ &
…
Ashit Kumar Dutta³

Journal of Big Data volume 11, Article number: 94 (2024) Cite this article

168 Accesses
Metrics details

Abstract

ALS (Amyotrophic Lateral Sclerosis) is a fatal neurodegenerative disease of the human motor system. It is a group of progressive diseases that affects the nerve cells in the brain and spinal cord that control the muscle movement of the body hence, detection and classification of ALS at the right time is considered to be one of the vital aspects that can save the life of humans. Therefore, in various studies, different AI techniques are used for the detection of ALS, however, these methods are considered to be ineffectual in terms of identifying the disease due to the employment of ineffective algorithms. Hence, the proposed model utilizes Modified Principal Component Analysis (MPCA) and Modified Random Forest (MRF) for performing dimensionality reduction of all the potential features considered for effective classification of the ALS presence and absence of ALS causing mutation in the corresponding gene. The MPCA is adapted for capturing all the Low-Importance Data transformation. Furthermore, The MPCA is objected to performing three various approaches: Covariance Matrix Correlation, Eigen Vector- Eigenvalue decomposition, and selecting the desired principal components. This is done in aspects of implying the LI (Lower-Importance) Data Transformation. By choosing these potential components without any loss of features ensures better viability of selecting the attributes for ALS-causing gene classification. This is followed by the classification of the proposed model by using Modified RF by updating the clump detector technique. The clump detector is proceeded by clustering approach using K-means, and the data reduced by their dimension are grouped accordingly. These clustered data are analyzed either for ALS causing or devoid of causing ALS. Finally, the model’s performance is assessed using different evaluation metrics like accuracy, recall, F1 score, and precision, and the proposed model is further compared with the existing models to assess the efficacy of the proposed model.

Introduction

ALS is a motor neuron disease (MND) [1], also termed Lou Gehrig disease [2]. It is a progressive neurodegenerative disorder that results in lower and upper motor neurons [3]. Hence, it is considered one of the liberal and fatal neurodegenerative diseases with adult-onset, and then it is further categorized by motor neuron degeneration in the spinal cord, primary motor cortex, and brainstem [4]. One of the two distinct forms of ALS includes familial ALS and sporadic ALS [5]. Sporadic ALS occurs in individuals without any family history of the disease [6], however, familial ALS occurs in at least 2 people in the family [7]. It has been revealed that around 85–90% of the patients suffer from a sporadic form of ALS, however, the remaining 10–15% suffer from Fals [8]. Though both genders are victims of ALS, men are slightly at higher risk of disease than women in the case of sALS. However, the risk remains similar in both genders for fALS. Hence, it is important to know the cause behind ALS, which gives better insights for analyzing the growth of ALS in the body. Therefore, it has been revealed that both environmental and genetic factors play a vital role in the development of ALS [9].

According to research, the probable risk factors of ALS include older age and family history associated with ALS. However, there are other factors that contribute to the growth and advancement of ALS in the body. Hence, some environmental factors, including pesticides, EMF (Electromagnetic Fields), high BMI, BMAA, and even physical activity, can also lead to the development of ALS [10]. Pesticides aid in inducing oxidative stress, $\alpha $-synuclein storage, and dysfunction of mitochondria at very high doses, which likely contributes to the growth of ALS in the body. Nevertheless, pesticides have been implicated in the pathogenesis of other neurogenerative diseases, which makes it harder to pinpoint the immersion of pesticides in ALS [11]. Hence, it should be analyzed that genetic predilection amalgamated with the long-term consequences of pesticide exposure plays a major role in the enlargement of ALS [12]. Similar to pesticides, exposure to different substantial metals like Hg, Fe, and Se is one of the major environmental factors leading to ALS growth [13]. It has been revealed that exposure to lead developed ALS about 50 years ago. Though environmental factors like pesticides and lead play a huge role in the development of ALS, BMI also plays a huge role in resulting in better survival outcomes [14]. Generally, weight loss is normal in ALS due to the mass loss of muscle and a lessening of fat mass in the body [14]. Hence, from the existing study, it has been demonstrated that patients with BMI Between 30–35 have a better chance of surviving than patients with BMI rates below 30 or above 35.

Like environmental factors, Gene mutation is one of the major factors for the causes of ALS. Well-documented genetic factors include mutations in the SOD1 gene [15], TAR (Transactive Response) [16], TBK1(TANK Binding Kinase 1) [17] and C9orf72 [18]. SOD1 is a recurrent gene target that occurs in ALS. It has been revealed that the SOD1 gene is the most common cause of fALS. The SOD1 gene is situated in 21 chromosomes and aids in encoding the enzyme. The conventional work of the SOD1 protein eradicates the reactive O2 species in mitochondria cellular cytosol. Hence, gene mutation can result in toxic loss or gain of function, which results in disruption of normal cellular homeostasis. However, in ALS, the neurodegeneration occurs in the mutation of SDO1 via a consortium of mechanisms, namely disorder of protein dilapidation, aggregation of toxic protein, and dysfunction of mitochondrial. In addition, certain mutation of SDO1 also acts as predictors for the survival of ALS. Like SDO1, C9ORF72is denoted as one of the most frequent causes of fALS [19]. Mutation in C9orF72 gene has been identified to cause ALS, and this mutation has the capability to affect the GGGGCC segment of the gene [20]. C9ORF72 is considered to be the most frequent autosomal dominantly inherited form of fALS. It has been revealed that over 35% - 45% of fALS accounts are due to C9orF72. Likewise, TAR and TBK1 are also common genes prone to mutations, which cause ALS [21].

Since ALS is a highly complex disorder and reflects a low life expectancy, proper detection of ALS is crucial, which comprehends the course of the disease to fetch the right treatment for the patient [22]. Hence, a suggested study has employed FFNN, RNN, and CNN architecture to detect the ALS model. FFNN selected the summarized and static longitudinal features. In the second architecture, the CNN model has been used for handling the longitudinal data directly. Likewise, RNN has also been used by the recommended model for handling the longitudinal data directly. The employment of 3 architecture aids in detecting ALS [23].

Similarly, the progression of ALS disease was identified by stacked autoencoders. The process employed in the recommended paper includes pre-processing the data, once the data has been pre-processed, it was subjected to predictive analysis using DL techniques. The primary objective behind the usage of DL was to accomplish cost-effective computation with higher gain in terms of accuracy using the DNN model. The implementation of DNN in the existing model for identifying the progression of ALS resulted in a fast convergence rate for the classification of ALS. Therefore, from the experimental outcome, it was identified that the prediction of ALS on raw data was 66%, whereas the accuracy rate obtained after cleaning up the data using the DL algorithm has been 87% [24].

Though existing works delivered reasonable rates for the detection and classification of ALS, the existing works still lack in delivering effective models for the prediction of ALS from the frequency of ALS with C9orf72 and SOD1 variants. Hence, the proposed model aided in delivering an accurate model for detecting ALS using modified PCA and RF. Modified PCA is accomplished by effective dimension reduction for performing feature prioritization utilizing the numerical analysis and classification of genotypes by utilizing M-RF. In M-RF, a clump detector is employed for effective classification of genotype. Finally, the proposed model is assessed using various metrics, and the model is compared with another existing model to compare the efficacy of the proposed model. Hence, the objectives of the proposed work include,

1.
To implement modified PCA for dimensionality reduction and feature selection by capturing the significant features and simplifying the data visualization.
2.
To perform classification of ALS by using modified RF which employs clump based classification approach and LI data transformation technique for distinguishing ALS and non ALS causing gene mutations.
3.
To internally compare the proposed model with ANN, SVM, DT, RF and XGBoost models.
4.
To assess the compare the efficacy of the proposed work with different metrics like accuracy, precision, recall and F1 score.

Paper Organization Section "Literature review" deals with traditional approaches done on a comparable domain with assorted methods, as shown in Further, "Proposed methodology" section signifies the methodology executed in the proposed framework. The outcomes accomplished by the projected method are shown in "Results and discussions" section . Finally, the conclusion and future work of the projected system is shown in "Conclusion" section .

Literature review

Various studies involving AI techniques for the detection and classification of ALS are reviewed under this subsequent section.

ALS is a fatal Motor neuron disease categorized by progressive deterioration of nerve cells in the brain and spinal cord [25]. ALS can further lead to difficulty in issues like trouble breathing and also affects the voluntary control of arms and legs. However, ALS does not affect intelligence, seeing or hearing. Therefore, the suggested study has focused on employing an ML model, which was introduced to identify the MND and aid in predicting the health impacts of the disease. The ML model utilized in the recommended study was NMLM (Neuro-Machine Learning Model) due to the muscle weakness suffered by ALS patients. From the experimental outcome, it was identified that the existing NMLM model provided a better accuracy rate than the existing ML algorithms [26]. Likewise, the recommended paper has employed a multi-class ML strategy for classifying 300 patients based on the radiological profile of the patients. PLS, ALS, and PMS have been considered in the existing paper, in which appropriate classification of MND was carried out using the MLP model. From the experimental outcome, it was identified that ALS accomplished better classification accuracy than PLS and PMS [27].

ALS possesses a complex genetic basis, in which a non-additive combination of different variants constitutes a disease that linear models could not perform [28]. Therefore, DL algorithms are used specifically employed for identifying complex relations. Hence, an existing study has implemented a DL model for the classification of healthy patients vs patients suffering from ALS patients. In order to proceed with this process, the ProjectMinE dataset has been used in the paper. The DL model employed in the study has been D-CNN architecture. The process involved 2 major steps, which include identifying the promoter region, which is likely to be related to the identification of ALS, and secondly, the classification is based on the selected genomic region. From the experimental outcome, it was identified that the existing model had the potential to predict the prevalence of ALS from genotype [29]. The recommended paper has employed DL-based molecular ALS classification and interpretation techniques. Hence, the framework of the suggested paper was based on training the CNN model on images, after which the RNA expressions are converted to pixels. Further, the SHAP (Shapey Additive Explanation) method has been used for extracting pixels with higher relevance for the classification of ALS, and then the genes are mapped back to the genes that made up them. From the experimental outcome, it was identified that genes detected using SHAP interpretation demonstrated the value of employing ML for performing molecular classification of ALS [30].

In most cases, needle EMG is used for sampling data from different muscle parts for diagnosis of ALS [31]. Though EMG signals from various parts of the muscle have altered consequences on diagnosing ALS, there are certain aspects of neurogenic injury for cross-muscle parts. Hence, to lessen pain of patients and enhance the effectiveness of diagnosing the disease, the suggested paper has employed DCN (Domain Contrast Network) to extract the common features of the neurogenic injury to identify ALS disease. Additionally, two loss functions have been used to decrease the modification between dual domain distributions and upsurge the distance between the normal and ALS samples. From the experimental outcome, it was identified that the recommended model was valuable in discovering sensitive muscle parts for initial detection of ALS diseases in quantifiable applications [32]. The employment of ML algorithms for the identification of ALS aided in various aspects, which include accelerating the progress of genetic causes and benefits in providing treatment for the patients. The process was carried out using 5 different datasets, including DisGeNet, tALSoD, ClinVar, manually curated, and finally, the culmination of all genes present in the dataset as a single dataset. The results obtained by the experimental outcome for precision and recall have been different for each dataset [33].

Like the detection of disease, the progression of disease in the body is important to monitor [34]. Hence, the objective of the existing study focused on simulating the propagation of disease was carried out by analyzing the network of cerebral MRI for better identification of ALS disease. Therefore, the recommended paper has employed a computational model for simulating the progressive network degeneration to carry out the process. From the experimental outcome, it was identified that computer-simulated aggregation aided in assessing the disease pattern in ALS patients. From the experimental outcome, it was identified that the utility of the computational model in ALS aids in predicting the disease progression and benefits in emphasizing the potential as a prognostic biomarker [35]. It has been revealed that the prediction and detection of ALS diseases are considered daunting tasks for brain-computer interfaces [36]. Therefore, different NN architectures have been used for the identification and classification of ALS diseases. However, there are a few limitations of NN that need to be considered, which include conversion of vector and feature extraction of ALS data. Therefore DL, based on the Bayesian classifier, has been used for predicting ALS disease. The specified DL algorithm has been used for feature extraction as DL used the DWT function. The DWT function putrid e-signal (Electroencephalogram) for processing in various sub-bands. After the classification process, the recommended model was compared with the existing classifiers for ALS disease detection. The performance of the proposed algorithm aided in enhancing the efficacy of the BCI system [37].

Table 1 Summary of existing works

Full size table

ALS is considered a shattering and incorrigible disease, which affects the motor neurons and leads to progressive paralysis [38] and death on average within 3 to 5 years old. Hence, an effective and reliable ALS model should be employed for predicting the presence of ALS disease. However, there are a few challenges that need to be addressed. Hence, a suggested study has employed RF, XGBoost, and MLP for moderately manipulating the dynamic hidden information in patients’ clinical records. From the experimental outcome, it was identified that the XGBoost model outperformed other models for ALS detection [39]. Most existing studies emphasize early detection and diagnosis of ALS, however, early detection of the disease is extremely challenging and daunting due to the huge sum of overlapped symptoms [40]. Hence, a suggested study has aimed to employ the DL method to automatically diagnose each patient into ALS or LMND (Lower Motor Neuron) groups and to recognize if the progression of the disease is considered to be fast or slow. From the experimental outcome, it was identified that a remarkable accuracy rate was obtained by the suggested model for diagnosis and prognosis of LMND and ALS function [41].

ALS is mostly characterized by quick functional diminishing or due to ventilator decline [42]. Hence, a suggested study has employed sdtDBN (sdt- Dynamic Bayesian Network) since the traditional DBN Model only includes variables that are dynamic in nature, however, the suggested sdtDBN model has learned both static and dynamic variables. 1214 patients from the Portuguese ALS dataset have been used in the existing model. From the experimental outcome, it was identified that better results were obtained when assessing the suggested model with the existing model [43]. DNN has become one of the major architectures used for the progression of disease prediction and aids in unveiling temporal phenotypes. Hence, the suggested paper has employed various DNNs for predicting ALS disease; one such NN was RNN with LSTM. The dataset used in the recommended framework is a huge cohort of Portuguese ALS patients. The RNN with LSTM enabled effective use of prediction and aids in providing clinical insights for disease progression for respiratory deficiency. From the experimental outcome, it was identified that the recommended RNN with the LSTM model aided in achieving better outcomes and interpretability required for the prediction of ALS [44]. Though various existing studies have focused on the detection and classification of ALS, less study have been utilized for estimating the survival rate of ALS. Therefore, the demonstrated paper employed CNN coupled with FC top layer for estimating the survival rate of the patients. The CNN and objective function has been tested on the huge dataset. Further, CNN and the objective function have been compared with the existing models for analyzing the efficacy of the suggested model. Eventually, the model was compared with the prevailing approaches for distinguishing the effectiveness of the recommended model [45]. Table 1 shows the summary of the existing works.

Problem identification

From the valuation of the above-prevailing works, primary trepidations are emphasized as explored below,

1.
The precision and recall value obtained by the recommended model is considerably less than the proposed model.
2.
Most of the existing studies focused on detecting ALS [26, 29, 35], however, predicting the genotype based on a particular gene mutation is not specified in existing studies.

Proposed methodology

ALS is one of the fatal motor neuron diseases. This is characterized by the progressive rate of degeneration of nerve cells located in the spinal cord and in the brain. This ALS results in affecting voluntary control upon the legs and limbs and trouble breathing. Mutations occurring in the gene causing ALS are potential causes of familial (hereditary) ALS. About 20–40% of familial ALS is due to a mutation occurring at the c9orf72 gene. This mutation results in protein aggregation upon motor neurons in both nerve cells and the brain. The progression rate of ALS is 2–5 years or 10 years. ALS is a rare disorder, developing 1.5–3 per 10,000 people population.

ALS resulted in the notoriously impenetrable monolith. Only 5–10% of patients with ALS carry genes known to cause ALS. Less ability to biopsy in the brain and the spinal tissue with their complete working mechanism can result in later or lesser curability rates of ALS.

Mutation Occurring in ALS gene Mutation or the disruption of the ALS gene involved in protein homeostasis pathways. These include VCP, Ubiquilin-2, and p62, resulting in TDP-43 protein aggregation. The downregulation of VCP initiates the cytosolic TDP-43 protein aggregation, resulting in auto-phagic effects on the body. Similarly, the reduction in CHMP2B and the expressions of these mutations are linked with progressive neurological deterioration and axonal pathology with an earlier mortality rate.

ALS Mutations in FIG4 The FIG4 is prone to encode about 907 amino acids and the lipid phosphatase, which regulates the excess of phosphatidyl-inositol-3, 5-biphosphate (PI (3, 5) P2). This results in severe tremors, abnormal ranges of gait, and slow degeneration of responding motor neurons of the brain and spinal cord. Similarly, the variants linked with ALS can result in the balance of the phosphoinositide processing phase, which affects the complete autophagic activity of an individual.

ALS Mutations in VAPB The Two-ALS linked mutations in the gene encoding for VAMP/VAPB are noted in some of the experimental rats, resulting in attenuation of UPR. The probable interaction with the ATF-6 molecule is one of the key factors initiating the UPR response. Concurrently, about 20% of familial ALS are due to the mutations occurring in the copper or the zinc superoxide dismutase1 (SOD1). This is used in the discovery of two precarious aspects such as,

1.
Mutation in SOD1 causes ALS by gaining a toxic property.
2.
Pathogenesis of the ubiquitous mutant of SOD1 is of non-cell-autonomous process.

This is recognized by the gene excision from the selective types of cell, which is one of the disease onset driven out by mutation in the motor neuron. The primary determinants of the ALS disease progression are the mutant SOD1 synthesis within the two additional glial cell types and microglia.

Future Therapeutic Strategies for ALS The hexanucleotide repetition occurs hundreds to thousands of times, whereas in a healthier individual occurs only about 30 times. The present studies aimed at targeting the mutant of C9orf72 Dipeptide Repeat Proteins known as DPR. In determining the therapeutic procedure, lowering the mutant or in the progression of the protein is one of the potential procedures for detecting ALS. Likewise, the characterization and the validation of Antibody (Ab) of this DPR detection have been carried out. Thus, the ALS development institute developed models of human cell lines capable of producing the C9orf72 DPRs. These can be used in analyzing the DPR toxicity in mammalian cells and their response towards cells. On the other hand, characterization of ALS using transgenic mouse models of C9orf72 mediated ALS is carried out. This is used in an adequate understanding of the disease in yielding the determined therapeutics.

ALS in Saudi Arabia ALS procures the contribution of multiple healthcare professionals. About 47.3% of the population in SSA are affected due to ALS. The ALS care measure items, about 63.1%, were perceived by more than 50% of ALS. Motorized wheelchairs are used by 76.9% of the population of ALS, and 66.7% are found with respiration monitoring systems. Concurrently, 74.4$ of people are prone to end-life discussion. The disease lead to disability state of the patient performing daily activities and a lag in self-care. This, in turn, is devoid of the economic criteria of a state and a disability to the society. Hospitalization of ALS can result in higher economic investments, which can be reduced using probabilistic Quality-Of-Life (QOL) care to the ALS surviving patients. Numerous guidelines have been established in aspects of ensuring high-quality care. These guidelines comprise ALS care measures that enhance the QOL and survival rate. Abiding by these guidelines can aid in identifying and averting patients’ redundant anguish and subsequent redundancy of the emotional and financial impression of patients and their caregivers. This, in turn, reduces the economic impact on the overall healthcare system. To prospects of this viability, the proposed framework has designated to perform classification of patients with and without ALS in Saudi Arabia by incorporating meta-heuristic approaches for better outcomes. Initially, the ALS genotype dataset is loaded for the data-preprocessing technique. When the raw form of data is pre-processed, it enhances the data’s accuracy and reliability. This is used in removing the inconsistent data from the actual data needed for the process. Data cleaning and data integration can aid in unifying the dataset. Following this, the Modified PCA is initiated in performing the Low-importance Data Transformation. The overall flow of the proposed framework is pictured in Fig. 1.

The MPCA in the proposed framework endures in capturing all of the potential data features, which are highly relied on to accurate cast of ALS presence and absence and classification. More than feature selection, MPCA is adapted for capturing all of the features without limiting any of the features due to their increased dimensionality. This is used in apprehending the potential forms of data for further processing. The complete data is split into train and test data in the ratio of 80:20 and proceeded for the classification.

The Modified RF is initiated for classification in patients with and without ALS. The MRF works on the process of classification via clump formation. This is done using bootstrapping and rounding weights of the data. This is proceeded using the K-means clustering technique, where the clumps are grouped and then classified based on the presence and absence of ALS by the range of mutation that occurred in the ALS-causing gene. The overall prediction by the model is validated using the probabilistic metrics comprising the Accuracy, Recall F1-score, and the Recall rates of the model. This can result in taking the effective and earlier forms of curable measures in treating ALS. Figure 2 shows the illustrative architecture of ALS.

Principal Component Analysis (PCA) PCA is capable of providing a combination method. Indeed, PCA relies on the concepts of liner combination, where a maximum number of features rely. PCA is intended for large-scale datasets with high-dimension space. The provided features for performing the covariance matrix and the dot matrix using X₁,..., X_n with X₁ $\in $ R^d. The data is constructed using the matrix, X=[X₁,..., X_n]. This is of size d$*$n. The covariance matrix is provided using

$$\begin{aligned} C= \frac{1}{n}\sum _{i=1}^{n}x_{\textrm{t}}X_{\textrm{t}}^{\textrm{T}}= \frac{1}{n}XX^{\textrm{T}} \end{aligned}$$

(1)

The dot matrix is defined using the X^TX. The dot product of elements I and j are given using, X_i_TX_j Whereas the eigenvector conversion is provided by the $\lambda $ and V, this is given by,

$$\begin{aligned} X^{\textrm{T}}Xv=\lambda v \end{aligned}$$

(2)

$$ \left( {XX^{{\text{T}}} } \right)\,\left( {Xv} \right) = \lambda v $$

(3)

Here, Xv is the Eigenvector of the covariance matrix under a similar Eigenvalue. The normalized eigenvector is

$$\begin{aligned} \Bigg(\frac{1}{\sqrt{\lambda }}\Bigg)Xv \end{aligned}$$

(4)

Thus, the corresponding eigenvector of the dot matrix is,

$$\begin{aligned} XX^{\textrm{T}}u\,= \,& {} \lambda u \end{aligned}$$

(5)

$$\begin{aligned} \left(X^{\textrm{T}}X\right)\left(X^{\textrm{T}}u\right)\,=\, {} \lambda \left(X^{\textrm{T}}u\right) \end{aligned}$$

(6)

Thus, the X^T u will be the resultant Eigen Vector of the dot matrix under the same Eigenvalue of $\lambda $. Hence, the normalized form of the Eigenvector is,

$$\begin{aligned} \left(\frac{1}{\sqrt{\lambda }}\right)X^{\textrm{T}}u \end{aligned}$$

(7)

Modified PCA

The ability of the PCA is to identify the potential dimensions from the input data using Eigenvectors by reforming the normal space and indicating the deviations of normal space. When all the features are combined, the PCA loses the capability of capturing all the entropy features. However, PCA is modified since traditional PCA might not capture the most relevant features for the model, therefore, PCA is modified by prioritizing features based on task relevance providing more flexible dimensionality reduction. Thus, Modified PCA focuses on capturing the informative dimension for specific task, Kernel PCA allow for non-linear dimensionality reduction by mapping data into higher-dimensional spaces and modifications robust PCA adapt to non-normal distributions for enhancing the performanceThe proposed framework implies the Modified PCA for feature selection. This is done by effective dimension reduction to perform feature prioritization by employing the numerical analysis. This is established using numerical analysis by measuring the correlation among the characteristics in aspects of determining the potential components of data. Modified PCA is used for feature selection and dimensionality reduction by leveraging Eigenvectors to reconstruct the original data into a lower-dimensional space, thereby reducing data redundancy and preserving important features. By employing MPCA, the research aims to facilitate downstream analysis tasks such as cell clustering and lineage construction, which are crucial in biomedical research and diagnostics. Unlike traditional PCA methods, MPCA offers improved performance in capturing important features and simplifying data visualization, making it suitable for analyzing complex biomedical datasets.

MPCA is used in reconstructing the original n-dimensional forms of features to the respective K-dimensional features by (K< n). Here, K-dimensional features are the new orthogonal attributes referred to as Principal Components. These can reduce or minimize the data redundancy to achieve the goal of dimension reduction. Thus, the MPCA is utilized in determining the important n features of data.

The modified gradients of the clustering enable the data to move together. Thus, g_i denotes the E concerning the ith latent representative of x_i. This is given using,

$$\begin{aligned} grad_{\textrm{i}} = \frac{\partial E}{\partial x_{\textrm{i}}} \end{aligned}$$

(8)

whereas the Eigen-decomposition

$$\begin{aligned} M= VecDV^{\textrm{T}} \end{aligned}$$

(9)

Here, the v is composed of

$$\begin{aligned} V= (vec_{\textrm{1}},vec_{\textrm{2}},\ldots vec_{\textrm{N}}) \end{aligned}$$

(10)

Each of these Eigenvectors is arranged concerning the column size. This is given by,

$$\begin{aligned} \lambda _{\textrm{1}} \ge \lambda _{\textrm{2}} \ge ,\ldots ,\ge _{\textrm{N}} \end{aligned}$$

(11)

Hereby, each of the leading Eigenvectors is arranged as an aggregated gradient, whereby the linear combination of the gradients are the entries of coefficients. This is given by corresponding entries of the V_i as coefficients. This is defined using,

$$\begin{aligned} a_{\textrm{i}} = \sum _{j=1}^{N} Vec_{\textrm{ij}}grad_{\textrm{j}} \end{aligned}$$

(12)

Here, the matrix form is given using

$$\begin{aligned} A= (a_{\textrm{i}};.;a_{\textrm{l}}) = GVec_{\textrm{1}} \end{aligned}$$

(13)

Here, the construction of each new point as, for each i = 1,..., N as a linear combination of aggregated directions. The original contributions of the data aggregation are given by the data points. This is done concerning,

$$\begin{aligned} \bar{grad_{\textrm{l}}} = \sum _{j=1}^{l} Vec_{\textrm{ij}}a_{\textrm{j}} \end{aligned}$$

(14)

whereas the construction of $\bar{grad}$ is given using,

$$\begin{aligned} \bar{Grad} = \left(\bar{grad_{\textrm{1}}},\ldots ,\bar{grad_{\textrm{N}}}\right) = A Vec_{\textrm{l}}^{\textrm{T}}= GrVec_{\textrm{l}}Vec_{\textrm{l}}^{\textrm{T}} \end{aligned}$$

(15)

The complete working of the MPCA algorithm is presented in Algorithm-1.

Thus, the MPCA in the proposed approach is capable of resulting in an increased rate of dimensionality reduction, enabling the process of downstream analysis, which composes some of the aspects such as cell clustering and lineage construction. This, in turn, facilitates easier frequency measuring ranges of a gene based on unique molecular biomarkers. This also simplifies the data visualization process and enables effective preservation of the original features of the corresponding expression matrix and their effects for later cell analysis tasks. Figure 3. shows the traditional PCA and modified PCA in dimensionality reduction. And Fig. 4 depicts the process involved in the Proposed model.

Classification-modified random forest classifier

RF is the amalgamation of tree predictors, where each of the trees rely on the value of a random vector, being sampled independently along the same distribution of trees in the forest. The generalization error of the tree classifier depends on the strength of the trees in the forest. This method makes use of the bagging process in aspects of generating the random subset of features, resulting in a lower correlation among the decision trees. But, RF considers only certain features of them. Since RF handles huge data sets, they can deliver precise predictions but are sluggish to process the data as they compute each data for each decision tree. Since RF processes larger data sets, it requires more resources to store that data. Moreover, the prediction of a single DT is easier to understand when compared to a forest, which is more complex.

Thus, modified random forest proposed algorithm predominantly focuses on reducing the overfitting and improve the generalization performance by limiting set of features considered for each spilt. This facilitates that each DT emphasizes on most informative features, thereby increasing the tree diversity and reduces correlation. Besides, Lower Importance (LI) utilized in the model, permits the algorithm to capture the contribution of each node in a decision path to increased sample purity. This helps in understanding which features are most influential at different stages of the decision-making process within each tree. By considering LI, the algorithm can make more informed decisions about feature importance and node splitting, potentially leading to better classification performance of ALS. Likewise, clump detector employed in the proposed model aids in identifying the cluster samples by evaluating the uncertainty related with each predicted label within a decision path, the algorithm can assess the reliability of its predictions. Incorporating the probability density function of feature distributions and propagating probability values through decision trees allows the algorithm to account for feature uncertainty. By adjusting split decisions based on the reliability level of features, the algorithm can adapt its structure to better handle data. This can lead to more effective use of information and improved classification accuracy. Therefore, implementation of these proposed functions aids in improving the efficacy and efficiency of the proposed model for ALS classification. Moreover, Modified Random Forest uses clump-based classification approach within the Random Forest framework, leveraging K-means clustering for data grouping. The use of clump-based model is that it permits for more granular analysis and classification, mainly valuable in discriminating between ALS-causing and non-ALS-causing gene mutations. MRF incorporates confidence scores for each clump, enhancing the classification process and improving prediction accuracy. Thus, by combining predictions from multiple clumps using soft voting, MRF aims to provide more robust and accurate classification outcomes compared to traditional Random Forest algorithms. Thus, the parameters used in the proposed MRF is depicted in Table 2.

Table 2 showcases the parameter and values used in the modified RF for improving the efficacy of the model. The process of bootstrapping and limiting the leaf split decision to a subset of features is done to stream data. Figure 5. depicts the working of a modified Random Forest.

Table 2 Modified RF parameter and value

Full size table

This is attained via changing the base of the tree induction algorithm, which is effectually done by limiting the set of features measured for further splits of a random size. Here, $\hbox { m}<\hbox {M}$. This corresponds to the total number of features present. This is done after the reset of learners is done after the drift is signaled. This tends to reduce the ensemble classification performance. This, in turn, results in making a negative impact on the overall ensemble predictions. To resolve these issues, the MRF is made to capture the Local Importance (LI) of each node of the predicted label, which is provided using,

$$\begin{aligned} LI_{\textrm{k}}^{\textrm{t}} (y) = r_k^t (y)-r_{\mathrm{parent(t,k)}}^t (y) \end{aligned}$$

(16)

Here, the parent (t, K) signifies the parent node, and the Kth node of the decision path of the X tree is represented as t. Thus, the path of decision is recorded as Clus₁ to clus_n. Followed by, the corresponding Decision Path (DP) is represented as

$$\begin{aligned} DP = [Clus_1,Clus_2,Clus_3,..., Clus_n ] \end{aligned}$$

(17)

The finally predicted sample category of the clus₆ is 2, where the LI of each node in the decision path is provided by,

$$\begin{aligned} \begin{aligned} Clus_1: LI_{1}^{\textrm{12}} (y = 2) = r_{1}^{\textrm{12}} \\ Clus_2: LI_{2}^{\textrm{12}} (y = 2) = r_{2}^{\textrm{12}}\\ Clus_3: LI_{3}^{\textrm{12}} (y = 2) = r_{3}^{\textrm{12}}\\ Clus_n: LI_{\textrm{n}}^{\textrm{12}} (y = 2) = r_{\textrm{n}}^{\textrm{12}} \end{aligned} \end{aligned}$$

(18)

Here, the LI₁¹² (y = 2) denotes the LI of node K of the DP of x_n present in the Decision Tree (t) which has a predicted label of y=2. The classification process occurring for n samples in the nth DT is the vital contribution to the increased sample purity compared to the parent node. Node 2 has the higher importance factor nth node has the least importance. This is inferred using the data missing on the preceding node, which results in a substantial error over the predicted outcome.

The corresponding classification outcome can be attained on each DT, with a similar approach to the procedure done on each sample. This ensemble algorithm aims to attain the concluding outcome from the classification outcomes of each weak classifier, depending upon certain voting rules. Especially in an RF model, both the soft and hard voting rules are carried out.

Depending upon the hard voting rule, each DT can provide a classification outcome of a specific sample. Depending upon the majority voting law, the result having more occurrences will be the final classification result. This is given using,

$$\begin{aligned} H_{\textrm{RF,hard}}(X) = Argmax y \in Y \left[\left( \frac{1}{T}\right)\sum _{t=1}^{T} vt(x,y)\right] \end{aligned}$$

(19)

where

$$\begin{aligned} v_t (x,y) = 1, ft(x) =y || 0, ft(x) 6 = y \end{aligned}$$

(20)

Hereby, x signifies the test sample. Label value y belongs to Y, and Y denotes the collection of all possible results obtained using classification. Hereby, f_t(x) denotes the classification outcomes of the DT (t) for x here, v_t (x,y) represents the Voting variable, and similarly, H_{(RF, hard)}(x) represents the overall classification result based upon the hard voting. Concurrently, on soft voting, the average value of the probabilities of all the sub-classifiers, which predicts the samples are in a definite category, is used as the deciding factor, with a category of highest probability which is selected and used for the final classification outcome. This is provided using,

$$\begin{aligned} H_{\textrm{RF,soft}}(X) = Argmax y \in Y \left[\left( \frac{1}{T}\right)\sum _{t=1}^{T} vt(x,y)\right] \end{aligned}$$

(21)

Here, V_t (x,y) indicates the probability of the prediction label, which belongs to u in the t. Moreover, the Clump Detector (CD) of the t is given using the sample x is provided by,

$$\begin{aligned} CD_{\textrm{t}}(x,y) = 1-\frac{\sum _{k \in DR_{\textrm{miss}}(t,x)}^{\prime } LIt'_{\textrm{k}}y}{\sum _{k \in DR(t,x)} LItk(y)} \end{aligned}$$

(22)

whereas DP(t,x) and the DP_miss(t,x) denote the collection of all the nodes present in the decision path of the t and the collection of all the equivalent missing nodes.

In aspects of measuring the CD, the t having various lengths of decision path, all the CD in the RF are normalized, which is given using,

$$\begin{aligned} CD_{\textrm{t}}^{\textrm{norm}} (x,y) = \frac{CD_t (x,y)-CD_{-}min(x,y)}{ CD_{-}max(x,y) -CD_{-}min(x,y)} \end{aligned}$$

(23)

Here, CDmin(x,y) and CDmax(x,y) denote the minimum and the maximum values of the CD for all the (t) in an RF. This is separately provided using,

$$\begin{aligned} CDmin(x,y) = min\left( {t^{\prime} \in [1, \ldots T]} \right)CD_{{{{t^{\prime}}}({\text{x}},{\text{y}})}} \end{aligned}$$

(24)

Similarly, for maximum,

$$\begin{aligned} CDmin(x,y)= max(t'\in [1,\ldots T]) CD_{\mathrm{t'(x,y)}} \end{aligned}$$

(25)

By summing up their CD, the total clump detection is detected for the predicted category. The label with the higher CD score from all categories is selected as the concluding outcome of the prediction. This is estimated using a Decision Path based on MRF. This is given using,

$$ H_{{DPRF,hard\,\left( x \right)}} \, = \,Argmax\,\left( {y \in Y} \right)\,\left[ {\sum\limits_{{t\, = \,1}}^{T} {v_{t} \,\left( {x,y} \right).CD_{t}^{{norm}} \left( {x,y} \right).P_{{Hig}} ^{b} } } \right] $$

(26)

Similarly, the category with a higher predicted probability is considered for the final classification for the sample of x. This is given by,

$$\begin{aligned} H_{\textrm{DPRF,hard}} (x) = Argmax(y \in Y) \left[\sum _{t=1}^{T} p_{\textrm{t}}(x,y). CD_{\textrm{t}}^{\textrm{norm}}(x,y).P_{\textrm{Hig}}^{\textrm{b}}\right] \end{aligned}$$

(27)

Similarly, the uncertainty of the input features is included as a modifying factor in the splitting criterion. The uncertainty present in the features is expressed using the standard deviation, $\sigma $_m of their respective distribution. The Probability Density function (PDF) of the respective distributive feature of f_m is evaluated using two different quantities such as,

$$\begin{aligned} P^{\mathrm{*}}_{\textrm{Low}}= & {} \int _{-\infty }^{t} CD(x, \sigma _{\textrm{m}})\end{aligned}$$

(28)

$$\begin{aligned} P^{\mathrm{*}}_{\textrm{High}}= & {} \int _{-\infty }^{t} CD(x, \sigma _{\textrm{m}}) \end{aligned}$$

(29)

Here, the probabilities given using P^*_Low and P^*_High represent the Reliability level, of the feature, f_m from the distribution values which are lower to the P^*_Low or the P^*_High than the Threshold value, T_n,m. These values are propagated using the next level of DT using,

$$\begin{aligned} P^{\textrm{b}}_{\textrm{Low}}=\, & {} P^{\mathrm{x-1}} *P_{\textrm{Low}} \end{aligned}$$

(30)

$$\begin{aligned} P^{\textrm{b}}_{\textrm{Hig}}=\, & {} P^{\mathrm{x-1}} *P_{\textrm{Hig}} \end{aligned}$$

(31)

Finally, the probability values as independent forecasts upon the origin of data are combined using,

$$\begin{aligned}P_{{\text{i}}} \frac{{\left[ {\Pi _{{{\text{M = 1}}}}^{{\text{M}}} \left( {\frac{{P_{{\text{m}}} ,i}}{{1 - p_{{\text{m}}} ,i}}} \right)^{\omega } m} \right]a}}{{1 + \left[ {\Pi _{{{\text{M = 1}}}}^{{\text{M}}} \left( {\frac{{P_{{\text{m}}} ,i}}{{1 - p_{{\text{m}}} ,i}}} \right)^{\omega } m} \right]a}} \end{aligned}$$

(32)

Thus, the overall working of the Modified RF is presented in Algorithm 2.

Thus, adapting the proposed MRF for the classification process by the clumping method using K-means clustering analysis to perform classification. This clump-based classification is used to classify ALS and the non-ALS-causing gene mutations. This earlier prognostication of the disease can aid in a precautious approach to resolving measures and in inhibiting adverse effects of ALS in individuals.

Though the proposed model can exhibit better performance for ALS classification, outcomes and the proposed model are compared with 5 models comprising XG-Boost, RF, ANN, SVM and DT. These are compared in aspects of affirming the effectiveness of the proposed model in prospects of performing classification among the ALS-causing gene and non-ALS-causing gene. Thus, Table 3 highlights the classifiers along with its hyperparameter values.

Hyperparameters and its value for different classifiers are highlighted in Table 3 for ALS classification, in which hyperparameter values considered for different classifiers varies accordingly.

Table 3 Hyperparameter for classifiers

Full size table

Results and discussions

The corresponding section prevails in exhibiting the performance of the proposed model in aspects of detecting the gene mutation causing ALS in an individual. The proposed framework is emphasized with a suitable description for the dataset used in the framework, internal comparison, and experimental results.

Dataset description

The proposed framework infers using the ALS Genotype dataset. This dataset is composed of information on genes and the genetic variants that are related to ALS. This dataset is composed of many variants. To considerations, c9orf72, and SOD1 mutations are considered. These datasets are implied for the frequency analysis, Variant analysis, and Prognostic Indicators. The frequency analysis is done for gene testing upon the c9orf72 and SOD1 variants. Whereas Variant analysis is carried out to identify the specific type of mutation that occurred in the dataset. Finally, the prognostic indicator test is used to find the predictive factors (attributes) that are based on or closely related to variants resulting in ALS.

Furthermore, statistical analysis is conducted to ensure the dataset’s quality. This test determines the difference between the observed and expected data and their relationship among variables in data. As a result, the chi-square is an ideal choice for understanding the interpretation of the categorical variables present in the corresponding data. This is a non-parametric test, which is implied to test the hypothesis regarding the distribution of a categorical variable. Corresponding to these advantages, the dataset is exposed to the Chi-squared test, which has attained an Overall P-value of 0.948. Moreover, an equal range of class distribution is carried out in the dataset by the class distribution technique. This affirms the quality of the dataset used in the proposed study.

Defects occurring in the ALS gene

The Mutation occurring in the c9orf72 gene resulting in ALS is due to the hexanucleotide repetitive expansion, composed of nucleotides as (Guanine and Cytosine) in the order of (GGGGCC). This nucleotide set is repeated to hundreds and thousands of numbers, resulting in ALS. These c9orf72 and SOD1 variants contribute to ALS under varying pathological mechanisms. Some of them are the Toxic Protein Aggregation, which is misfolded at the site of SOD1. This results in self-aggregating prions, which include domain, altered forms of RNA granules, and overall dysfunction of the Power Quality system. These are potential factors causing Protein aggregation in the ALS genotype. This is one of the common characteristics of causing injury and death to the cells, causing the severity of a disease. This is followed by the RNA Foci Formation, one of the abnormal RNA products formed in the c9orf72 variant. Other pathological dysfunctions are complete cellular dysfunction and oxidative stress to the gene. Followed by, Neuroinflammation is also identified in the data as a response to disease development. In preventing these risks, variant-related risk communication has to be proclaimed. These can be either early detection, clinical trials, and personalized therapies.

Contribution of c9orf72 and SOD1 variants in ALS: These variants also contribute to Frontotemporal Dementia (FTD). The exact mechanisms of c9orf72 result in impaired processing of RNA molecules. These are complex and are difficult to recognize. Similarly, SOD1 variants are specific for causing hereditary ALS cases. This SOD1 encodes an enzyme that protects the cells prone to oxidative cell stress. This is done by neutralizing the reactive Oxygen (O2) species. The mutations occurring in this variant result in increased toxic functioning of the cells, resulting in protein misfolding and the aggregation of this ubiquitinated protein within the corresponding motor neurons. Concerning ALS, these proteins replicate or duplicate themselves without the involvement of the DNA or the RNA, which spreads to the other parts of nerves and affects the corresponding neural function. This protein aggregation in cells results in cellular disruption of function-impaired axial transport and initiates nuclear stress and response, resulting in the degeneration of motor neurons.

Effects of c9orf72 and SOD1 variants in analyzing ALS

Diagnosis and gene testing: These variants are used in diagnosing familial ALS with clear forms of genetic phases. This genetic testing affirms the presence of a mutation in the gene of an individual and their hereditarily passed ALS.

Disease progression prediction: The mutation occurring upon the c9orf72 and SOD1 variants indicates higher progression, earlier onset, and distinct clinical features.

Understanding the mechanism of disease: The mutation occurring in c9orf72 and SOD1 variants aids in offering insights into ALS development.

Therapeutic targets and drug discovery: Mutations in these Variants are used in studying the defects of the ALS in an individual. This research induces the development of the right therapeutic drug and in understanding further motor neuron function and disease progression.

Clinical trials and personalized therapy: The mutations occurring can aid in enrolling clinical trials to enable tailored treatments for ALS. Personalized therapies can be implemented to heal or halt the disease progression rates. When the C9orf72 and SOD1 are significant, other gene variants can also contribute to ALS initiation. Some other vital insights into ALS and its advantages are tabulated in Table 1 (see Table 4).

Table 4 Insight analysis and their contribution to ALS

Full size table

Exploratory data analysis (EDA)

EDA is adapted for analyzing and investigating the dataset and summarizing the potential characteristics. This is done by employing data visualization approaches. This is one of the techniques used in extracting the vital features and the trends that the particular model uses. No model imposition is followed in EDA. This is done in an aspect of affirming the appropriateness of the data used in the study.

Figure 6 Represents the correlations of the network nodes, which represent the functional relationships. This is implied to find the co-expressed molecules of the gene and explore the association between the gene network and the phenotype of interest. This graph is used to reduce the overall complexity and interpretation and to control each of the pipelines in a procedure for providing flexibility in selecting correlation methods. From the figure, it can be inferred that the network graph provides connectivity among the attributes, which can be implied in finding the common aspects causing ALS. Some of them are, Extract_Date, Genotype, Coriell_Lot, Allele_one_RP-Count, Allele_two _RP_count.

Figure 7 Characterizes the box plot, which compares the allele count in providing insight into the distribution of allele count for various genotypes. This is used in presenting the differences among the allele counts among the genotypes of homogeneous and heterogeneous types. Concurrently, the Fig. 8 Represents the correlation among the various numerical values present in the dataset. From the dataset used in the proposed framework, the positive and the negative correlations indicate the relationship among the variables. For instance, ’Allele_one_RP_count’ and ’Allele_two_RP_count’ are correlated positively. This suggests the relationship between the two values. Correspondingly, Fig. 9 Representing the presence of Minor allele aids in understanding the concepts of how often the minor alleles are present in the dataset. This is used in understanding the overall genetic diversity. The least common allele is referred to as the minor allele. The presence of these allele is found using the dbSNP. The common variants are distinguished from the rare variants. Likewise, Table 5 shows the values obtained by the model for different k fold validations, in which better accuracy has obtained when k = 5.

When number of k fold increases, accuracy also increases, this highlights the reliable and robust estimation of the model performance. Thus larger training set aids the model learn the underlying patterns in the data effectively.

Table 5 K fold validation

Full size table

Performance analysis

The performance of the proposed model is evaluated in the subsequent section using metrics like Accuracy, Precision, recall, F1-Score, and confusion matrix (see Fig. 10 and 11) .

From Fig. 12 Exhibiting the train and validation accuracy correspondingly is used in affirming the model overfitting and underfitting capability. If the gap among the plot is large, it depicts the model behavior of over-fitting. This has a higher score on training and a lower score on validation. From the Figure, it is evident that the ANN model did not attain a better range of training and validation accuracy rates.

Similarly, Fig. 13 Depicting the train and validation loss of the ANN affirms the model in terms of good fit. Higher rates of validation loss and gradual decrease express the addition of training samples and do not enhance the performance of the unseen data. The black line indicates the training data, and the blue line represents the validation data of the model.

From Figs. 14, 15, 16 and 17 exhibit the Correlation Matrix (CM) of the model’s RF, ANN, XG-Boost, and the Proposed RF. The CM of the RF model interprets that, though the model made 357, 380, and 370 right predictions upon normal, Fully-mutated, and the intermediate stages of gene-causing ALS. The model has 27, 9, and 11 mispredictions in normal, Fully-mutated, and intermediate stages of gene-causing ALS.

Similarly, Fig. 15 represents the CM of ANN in aspects of correct and mispredictions of ALS gene mutation levels. This model has predicted 359, 362, and 363 right predictions. Nevertheless, the model has mispredicted the ALS mutation ranges of 1, 8, and 1 in aspects of normal, Fully-mutated, and the intermediate stages of gene mutation resulting in ALS.

Figures 16 and 17 show CM for XG-Boost and Proposed RF model in making correct and the mispredictions of the ALS gene mutation level resulting in ALS. The normal gene mutation ranges are less or not capable of causing ALS. Whereas intermediate mutation ranges on these variants can result in later ALS damage to an individual. Earlier detection and identification of these intermediate ranges of mutation in gene variants can reduce the rates of ALS progression and the severity of damage upon ALS. Some of the adverse effects due to fully mutated gene variation causing ALS can result in complete speech loss, motor response of neurons, instability of body organs, and unable stage of locomotion.

The CM of XG-boost makes a right prediction in ranges of 364, 385, and 341 for normal, Fully-mutated, and the intermediate stages of gene mutation resulting in ALS. Whereas the same model has made mispredictions in the range of 20, 13, and 26 for the same normal, Fully-mutated, and intermediate stages of gene mutation rate.

Table 6 Performance analysis for LDA

Full size table

Table 7 Internal comparison results for PCA in terms of performance metrics

Full size table

Table 8 Internal Comparison Results for Modified PCA in terms of Performance Metrics

Full size table

Similarly, concerning Fig. 17 presents the CM of the proposed M-RF, the model has outperformed the other models in making the right prediction ranges. The MRF has prevailed 380, 382, and 365 right predictions of normal, Fully-mutated, and intermediate stages of gene mutation rate.

CM of SVM and DT are portrayed in Figs. 15 and 16. Similarly, with reference to Fig. 17 presents the CM of the proposed M-RF, the model has outperformed the other models in making the right prediction ranges. The MRF has prevailed 380, 382, and 365 right predictions of normal, Fully-mutated, and intermediate stages of gene mutation rate. Likewise, the model has proclaimed 1, 8, and 1 in aspects of mispredictions. Thus, implementing the improved and right models can aid in the earlier detection of ALS, reducing the adverse disease progression rates and inhibiting further health implications.

In Table 6, comparative analysis for LDA is presented in which proposed model has delivered better accuracy, precision, recall and F1 score value for ALS classification by obtaining accuracy value of 96.89% precision, recall and F1 score value of 95.89%, 94.35% and 94.86%. Figure 18 showcases the graphical representation of table.

Nonetheless, Fig. 19. And Table 7 represents the comparison results of the PCA evolved in selecting the potential components of the data in performing effective ALS genotype prediction causing ALS.

From Fig. 19 and Table 7, it is evident that the PCA for selecting the important features is under all performance metrics at a rate of 95.2%. The PCA is probably more effective in selecting the features than ANN, XG-Boost, and the RF algorithms, SVM and DT. On the other hand, Table 8 and Figure 120 depict the internal comparative outcomes of the Modified PCA with other models.

From Table 8 and Fig. 20 unveil that the proposed MPCA has perceived an overall accuracy rate of 98.4%, which is comparatively better performing than the RF, XG-Boost, ANN models, SVM and DT. The proposed MPCA effectively performs the effective components for better detection of ALS. Moreover, MPCA has overall precision, recall, and F1-score rates of 98%. This affirms the better working of the proposed model of the MPCA. From the experimental outcome, it was identified that the proposed model had delivered better outcomes than the existing models due to the incorporation of M-PCA and M-RF. The inclusion of M-PCA aided in reducing the loss of data, which ultimately resulted in delivering outcomes for the model, and M-RF aided in better classification of the model for better prediction and classification of genotype. Due to these advantages, the proposed model delivered better results than the existing model.

Conclusion

ALS is a rare and progressive form of neuro-genitive disorder, affecting both the neuron response of upper and lower portions. The molecular basis of ALS is highly elusive and needs high-sequencing technologies. Progressive muscle atrophy and weakness result in paralysis and eventually death. No cognitive or physical behavior is seen. Similarly, disruption in executive body functions and dementia in the frontotemporal area are noted. Clinical and basic research suggest that multiple factors are involved in ALS etiology. Familial ALS is inherited in an autosomal dominant manner for about 10%, and 90% are classified as sporadic ALS. Mutations occurring on more than 30 genes are associated with ALS. Thus, earlier detection and classification of these mutations occulting in the ALS-causing gene are potential pipelines in curing ALS from severity. ALS management therapies can be effective in reducing the adversity of ALS. The toxin produced by these star-shaped cells, termed astrocytes, affects the nearby neurons, gradually affecting the motor response of an individual. Thus, as a part of the contribution to the classification of ALS and the non-ALS-causing gene, the mutation in the c9orf72 and SOD1 are analyzed. This is done by implying metaheuristic approaches such as MPCA and MRF for dimensionality reduction and for performing classification. The MPCA is objected to performing three various approaches: Covariance Matrix Correlation, Eigen Vector- Eigen value decomposition, and selecting the desired principal components. This is done in the approach of implying the Low-Importance Data Transformation. Selecting these potential components without any loss of features ensured better viability of selecting the attributes for ALS-causing gene classification.

This is followed by the classification of the proposed model using Modified RF, which had imposed updation of clump detector in M-RF. The classification is proceeded by clustering approach using K-means, and the data reduced by their dimension are grouped accordingly. These clustered data are analyzed either for ALS causing or devoid of causing ALS. The model is evaluated using the probabilistic performance metrics. For unveiling the performance of the proposed model, comparisons with RF, ANN, and XG-Boost are also conducted. The proposed model is affirmed with an accuracy rate of 0.98. Though the proposed model has more advantageous aspects, some of the concerns associated with the proposed model are their computational time and increased dataset size to increased model complexity. As a part of future applications and their contribution towards the medical etiology, the proposed model can be implied in performing the classification at early stages, affirming the lesser effects of ALS-prone individuals and reducing the number of ALS cases in the future. Early prognostics of ALS can result in reducing the adverse effects of ALS and initiating them to medication at the initial stages. The reduced adversity can aid in reducing the mortality rates and in faster chances of performing lab diagnostic sessions for ALS mortality rate reduction. Proper gene sequencing and splicing sessions can be performed early for mutation detection in ALS-causing genes to determine the appropriate cause factor. This can enhance the lab modalities and curable measures to reduce the number of ALS patients in the Saudi Kingdom.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

$\varDelta $ :: Gradient
$\nabla $ :: Whole gradient matrix
$\textbf{a}$ :: Gradient-subspace
$\textbf{X}$ :: Eigenvector
$\lambda $ :: Eigenvalues
$\textbf{D}$ :: Diagonal
$\textbf{M}$ :: Eigen-decomposition

References

de Boer EM, et al. Familial motor neuron disease: co-occurrence of PLS and ALS (-FTD). Amyotroph Lateral Scler Frontotemporal Degener. 2023. https://doi.org/10.1080/21678421.2023.2255621.
Article Google Scholar
Bajaj S, Fuloria NK, Fayaz F, Kumar B, Fuloria S, Pottoo FH. Bioactive nutraceuticals for amyotrophic lateral sclerosis. In: Pottoo FH, editor. Exploring complementary and alternative medicinal products in disease therapy. Hershey: IGI Global; 2023. p. 1–37.
Google Scholar
Orrell RW, Guiloff RJ. Clinical aspects of motor neurone disease. Medicine. 2023. https://doi.org/10.1016/j.mpmed.2023.06.009.
Article Google Scholar
Chaudhary R, Agarwal V, Rehman M, Kaushik AS, Mishra V. Genetic architecture of motor neuron diseases. J Neurol Sci. 2022;434: 120099.
Article Google Scholar
Mariosa D, et al. Antidiabetics, statins and the risk of amyotrophic lateral sclerosis. Eur J Neurol. 2020;27(6):1010–6.
Article Google Scholar
Nel M, Agenbag GM, Henning F, Cross HM. C9orf72 repeat expansions in South Africans with amyotrophic lateral sclerosis. J Neurol Sci. 2019;401:51–4.
Article Google Scholar
Masrori P, Van Damme P. Amyotrophic lateral sclerosis: a clinical review. Eur J Neurol. 2020;27(10):1918–29.
Article Google Scholar
Cheng HW. From recombinant proteins to cells: targeting TDP-43 in preclinical ALS and FTD therapeutic development, in Doctoral dissertation. 2023.
Nowicka N, Juranek J, Juranek JK, Wojtkiewicz J. Risk factors and emerging therapies in amyotrophic lateral sclerosis. Int J Mol Sci. 2019;20(11):2616.
Article Google Scholar
Dhasmana S, Dhasmana A, Narula AS, Jaggi M, Yallapu MM, Chauhan SC. The panoramic view of amyotrophic lateral sclerosis: a fatal intricate neurological disorder. Life Sci. 2022;288: 120156.
Article Google Scholar
Richardson JR, Fitsanakis V, Westerink RH, Kanthasamy AG. Neurotoxicity of pesticides. Acta Neuropathol. 2019;138:343–62.
Article Google Scholar
Borg R. Identification and functional characterisation of genes linked to motor neuron disease, 2021.
Koski L, Ronnevi C, Berntsson E, Wärmländer SK, Roos PM. Metals in ALS TDP-43 pathology. Int J Mol Sci. 2021;22(22):12193.
Article Google Scholar
Ludolph A, Dupuis L, Kasarskis E, Steyn F, Ngo S, McDermott C. Nutritional and metabolic factors in amyotrophic lateral sclerosis. Nat Rev Neurol. 2023. https://doi.org/10.1038/s41582-023-00845-8.
Article Google Scholar
Spargo TP, et al. SOD1-ALS-Browser: a web-utility for investigating the clinical phenotype in SOD1 amyotrophic lateral sclerosis. Amyotroph Lateral Scler Frontotemporal Degener. 2023. https://doi.org/10.1080/21678421.2023.2236650.
Article Google Scholar
Guha G. Motor neuron disease: the contribution of TAR-43 gene in amyotrophic lateral sclerosis. Bengal Phys J. 2023;10(2):50–4.
Article Google Scholar
Toth RP. Investigating the role of TANK-binding kinase 1 in autophagy and amyotrophic lateral sclerosis, Doctoral dissertation, Macquarie University, 2022.
Chua JP, De Calbiac H, Kabashi E, Barmada SJJA. Autophagy and ALS: mechanistic insights and therapeutic implications. Autophagy. 2022;18(2):254–82.
Article Google Scholar
Wiesenfarth M, et al. Clinical and genetic features of amyotrophic lateral sclerosis patients with C9orf72 mutations. Brain Commun. 2023;5(2):fcad087.
Article Google Scholar
Akçimen F, et al. Amyotrophic lateral sclerosis: translating genetic discoveries into therapies. Nat Rev Genetics. 2023. https://doi.org/10.1038/s41576-023-00592-y.
Article Google Scholar
Kim G, Gautier O, Tassoni-Tsuchida E, Ma XR, Gitler ADJN. ALS genetics: gains, losses, and implications for future therapies. Neuron. 2020;108(5):822–42.
Article Google Scholar
Lamprini K. Emotional and behavioral symptoms in neurodegenerative diseases. In: Lamprini K, editor. Handbook of computational neurodegeneration. Cham: Springer; 2022. p. 1–20.
Google Scholar
Pancotti C, et al. Deep learning methods to predict amyotrophic lateral sclerosis disease progression. Sci Rep. 2022;12(1):13738.
Article Google Scholar
Menon SP. Deep learning for prediction of amyotrophic lateral sclerosis using stacked auto encoders. Int J Big Data Manag. 2020;1(2):119–34.
Article Google Scholar
Zhang X, Yang K, Le W. Autophagy and motor neuron diseases. Autophagy Biol Dis Clin Sci. 2020. https://doi.org/10.1007/978-981-15-4272-5_3.
Article Google Scholar
Sekar G, Sivakumar C, Logeshwaran JJN. NMLA: The smart detection of motor neuron disease and analyze the health impacts with neuro machine learning model. NeuroQuantology. 2022;20(8):892–9.
Google Scholar
Bede P, et al. Phenotypic categorisation of individual subjects with motor neuron disease based on radiological disease burden patterns: a machine-learning approach. J Neurol Sci. 2022;432: 120079.
Article Google Scholar
Luo X, Kang X, Schönhuth AJNMI. Predicting the prevalence of complex genetic diseases from individual genotype profiles using capsule networks. Nat Mach Intell. 2023;5(2):114–25.
Article Google Scholar
Yin B, et al. Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype. Bioinformatics. 2019;35(14):i538–47.
Article Google Scholar
Karim A, et al. Molecular classification and interpretation of amyotrophic lateral sclerosis using deep convolution neural networks and shapley values. Genes. 2021;12(11):1754.
Article Google Scholar
Tannemaat M, et al. Distinguishing normal, neuropathic and myopathic EMG with an automated machine learning approach. Clin Neurophysiol. 2023;146:49–54.
Article Google Scholar
Zhang H, et al. Domain Contrast Network for cross-muscle ALS disease identification with EMG signal. Biomed Signal Process Control. 2023;82: 104582.
Article Google Scholar
Bean DM, Al-Chalabi A, Dobson RJ, Iacoangeli A. A knowledge-based machine learning approach to gene prioritisation in amyotrophic lateral sclerosis. Genes. 2020;11(6):668.
Article Google Scholar
Mead RJ, Shan N, Reiser HJ, Marshall F, Shaw PJJNRDD. Amyotrophic lateral sclerosis: a neurodegenerative disorder poised for successful therapeutic translation. Nat Rev Drug Discov. 2023;22(3):185–212.
Article Google Scholar
Meier JM, et al. Connectome-based propagation model in amyotrophic lateral sclerosis. Ann Neurol. 2020;87(5):725–38.
Article Google Scholar
Verbaarschot C, et al. A visual brain-computer interface as communication aid for patients with amyotrophic lateral sclerosis. Clin Neurophysiol. 2021;132(10):2404–15.
Article Google Scholar
Rusiya P, Chaudhari NS. Amyotrophic lateral sclerosis EEG classification using deep neural network And TLBO. In: Proceedings of the International conference on innovative computing and communications (ICICC). 2020.
Štětkářová I, Ehler EJD. Diagnostics of amyotrophic lateral sclerosis: up to date. Diagnostics. 2021;11(2):231.
Article Google Scholar
Hadad B, Lerner B. Domain adaptation from clinical trials data to the tertiary care clinic-application to ALS. In: Hadad B, editor. 2020 19th IEEE International conference on machine learning and applications (ICMLA), vol. 2. Miami: IEEE; 2020. p. 1.
Google Scholar
Gribkoff VK, Kaczmarek LK. The difficult path to the discovery of novel treatments in psychiatric disorders. In: Macaluso M, Preskorn SH, Shelton RC, editors. Drug development in psychiatry. Cham: Springer; 2023. p. 255–85.
Chapter Google Scholar
Greco A, et al. Using blood data for the differential diagnosis and prognosis of motor neuron diseases: a new dataset for machine learning applications. Sci Rep. 2021;11(1):3371.
Article Google Scholar
Young C, et al. Medical therapies for amyotrophic lateral sclerosis-related respiratory decline: an appraisal of needs, opportunities and obstacles. Amyotrop Lateral Scler Frontotemporal Degener. 2022;23(1–2):66–75.
Article Google Scholar
Leão T, Madeira SC, Gromicho M, de Carvalho M, Carvalho AMJJOBI. Learning dynamic Bayesian networks from time-dependent and time-independent data: unraveling disease progression in amyotrophic lateral sclerosis. J Biomed Inform. 2021;117: 103730.
Article Google Scholar
Müller M, Gromicho M, de Carvalho M, Madeira SCJCM, Update PIB. Explainable models of disease progression in ALS: Learning from longitudinal clinical data with recurrent neural networks and deep model explanation. Comput Methods Prog Biomed Update. 2021;1: 100018.
Article Google Scholar
Grisan E, Zandona A, Di Camillo B. Deep convolutional neural network for survival estimation of amyotrophic lateral sclerosis patients. Belgium: ESANN; 2019.
Google Scholar

Download references

Acknowledgements

The authors extend their appreciation to the King Salman Center for Disability Research for funding this work through Research Group no KSRG-2023-395.

Author information

Authors and Affiliations

Department of Software Engineering, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj, 11942, Saudi Arabia
Abdullah Alqahtani & Mohemmed Sha
Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj, 11942, Saudi Arabia
Shtwai Alsubai
Department of Computer Science and Information System, College of Applied Sciences, Al Maarefa University, Riyadh, Saudi Arabia
Ashit Kumar Dutta

Authors

Abdullah Alqahtani
View author publications
You can also search for this author in PubMed Google Scholar
Shtwai Alsubai
View author publications
You can also search for this author in PubMed Google Scholar
Mohemmed Sha
View author publications
You can also search for this author in PubMed Google Scholar
Ashit Kumar Dutta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdullah Alqahtani.

Ethics declarations

Competing interestss

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Alqahtani, A., Alsubai, S., Sha, M. et al. Examining ALS: reformed PCA and random forest for effective detection of ALS. J Big Data 11, 94 (2024). https://doi.org/10.1186/s40537-024-00951-4

Download citation

Received: 17 March 2024
Accepted: 21 June 2024
Published: 10 July 2024
DOI: https://doi.org/10.1186/s40537-024-00951-4

Examining ALS: reformed PCA and random forest for effective detection of ALS

Abstract

Introduction

Literature review

Problem identification

Proposed methodology

Modified PCA

Classification-modified random forest classifier

Results and discussions

Dataset description

Defects occurring in the ALS gene

Effects of c9orf72 and SOD1 variants in analyzing ALS

Exploratory data analysis (EDA)

Performance analysis

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interestss

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords