Algorithms of the Möbius function by random forests and neural networks

The Möbius function µ( n ) is known for containing limited information on the prime factorization of n . Its known algorithms, however, are all based on factorization and hence are exponentially slow on log n . Consequently, a faster algorithm of µ( n ) could potentially lead to a fast algorithm of prime factorization which in turn would throw doubt upon the security of most public-key cryptosystems. This research introduces novel approaches to compute µ( n ) using random forests and neural networks, harnessing the additive properties of µ( n ) . The machine learning models are trained on a substantial dataset with 317,284 observations (80%), comprising five feature variables, including values of n within the range of 4 × 10 9 . We implement the Random Forest with Random Inputs (RFRI) and Feedforward Neural Network (FNN) architectures. The RFRI model achieves a predictive accuracy of 0.9493, a recall of 0.5865, and a precision of 0.6626. On the other hand, the FNN model attains a predictive accuracy of 0.7871, a recall of 0.9477, and a precision of 0.2784. These results strongly support the effectiveness and validity of the proposed algorithms.


Introduction
The era of big data has revolutionized various industries and enabled significant advancements in data science, leading to transformative applications in fields such as healthcare, finance, and artificial intelligence.However, along with the tremendous potential of big data comes the paramount concern for data security during transmission and within applications.Despite rapid progress in data science, the theoretical foundation of data security has been somewhat overlooked, leading to a gap between data science development and robust security infrastructure.To address this issue, our study aims to take the first step in proposing machine learning solutions to bolster our understanding of data security in the era of big data.
It is a common belief that the prime factorization of a large integer n cannot be computed in the polynomial time of log n .This belief forms the basis for the security of prominent cryptographic systems like the Rivest-Shamir-Adleman (RSA) public-key cryptosystem (Rivest et al. [1]) and now has become the security foundation of much of the internet communication and monetary transactions, online commerce, digital financial instruments, and an important part of daily life.This so-called "not in class P" belief, however, has no theoretic or scientific evidence, while an increasing number of number theorists believe that the the opposite is actually true (cf.Sarnak [2]).It is our long-term goal to seek scientific evidence to support this disbelief.
The Möbius function is defined as In other words, µ(n) vanishes if n is not square-free, i.e., if there is a prime p with p 2 dividing n.On the other hand for square-free n's, µ(n) provides the factorization parity of n.Consequently, the Möbius function µ(n) detects two features of n: (1) whether n is square-free, and (2) if n is square-free, what its factorization parity is.Question ( 1) is well studied with a known algorithm without using factorization of n by Booker, Hiary, and Keating [3].Their algorithm runs in a deterministic subexponential time on log n .Although its computational speed is not the fastest among known algorithms, it is the first non-factorization algorithm of µ(n) when n is not square-free.
This paper attempts to answer Question (2).Since square-free moduli are used in the RSA cryptosystem, Question (2) may have a huge potential impact on cyberspace security.While µ(n) contains much less information than the prime factorization of n, all existing algorithms of µ(n) for square-free n still rely on factorization.
Finding properties, randomness, and/or faster algorithms of µ(n) are seemingly easier problems than finding a factoring algorithm in class P. In recent years, much theoretical efforts on µ(n) for square-free n's have be focused on the randomness of µ(n) and its dynamic behaviors.In this paper, we will take a different approach.While current theoretical approaches cannot produce new algorithms of µ(n) for square-free n's, we propose novel algorithms by machine learning.
Training machine learning models for µ(n) and other multiplicative number theoretic functions may take a long time and require large datasets, and hence may only be done for n's of a moderate size.We hope the machine learning models obtained in this paper, however, are "true" algorithms of µ(n) in the sense that they can be applied to much larger n's of the size currently used in the RSA cryptosystem, say about 10 300 .It is in this sense that our machine learning models might provide efficient algorithms of µ(n) .This is the work in progress of the authors which requires machine learning techniques for high precision integers.
The function values of µ(n) constitute an infinite dataset, presenting typical, if not more complex, challenges for big data analysis and algorithms.The current study also offers an illustrative approach to address the broader challenges posed by big data analysis and algorithms.

Related work
Mathematical research has evolved notably since the 1960s, with the integration of computers aiding in pattern discovery and conjecture development, crucial for establishing theorems.A prime example is the Birch and Swinnerton-Dyer conjecture, a Millennium Prize problem, highlighting this blend of mathematics and computational tools.
Although publications in this field are few, existing research showcases the innovation in this interdisciplinary domain.Some research in various pure mathematical fields can be categorized as employing primitive machine learning models.Davies et al. [4] utilize fully connected feed-forward neural networks in Knot Theory to analyze hyperbolic knots, employing various datasets with a focus on predicting algebraic invariants.In the field of Algebraic Geometry, He, Hirst, and Peterken [5] apply deep neural networks to study dessins d'enfants, investigating their connections with modular subgroups and Seiberg-Witten curves.Bao et al. [6] employ machine learning techniques, such as neural networks and a Random Forest Classifier, in Geometry to analyze Hilbert series within the framework of quantum field theory, using datasets derived from geometric sources and the Graded Ring Database.Additionally, Bao et al. [7] implement multiplayer perceptions and convolutional neural networks in their Combinatorial Geometry study to predict the properties of lattice polytopes by analyzing data from 2D and 3D polygons.Moreover, He, Lee, and Oliver [8] use Bayesian classifiers in Number Theory to study the arithmetic of hyperelliptic curves, utilizing datasets comprising elliptic curves and Sato-Tate groups.Finally, Lample and Charton [9] demonstrate that neural networks, traditionally known for handling statistical or approximate problems, can excel in complex mathematical tasks like symbolic integration and solving differential equations, proposing a new syntax for mathematical problem representation and dataset generation methods.
Advanced machine learning models, particularly transformers [10], have shown impressive capabilities in mathematical applications.Notably, transformers excel in tasks like symbolic integration, solving differential equations, and cryptosystem attacks.Charton [11] explores the ability of small transformers to calculate the greatest common divisor of two positive integers, achieving 98% accuracy by optimizing training distribution and representation base.Additionally, Wenger, Chen, Charton, and Lauter [12] train transformers to perform modular arithmetic and combine them with statistical cryptanalysis to develop SALSA, a novel machine learning attack on cryptographic schemes based on the Learning with Errors problem.

Number theoretic background
The RSA public-key cryptosystem [1] uses two distinct primes p and q as private keys.The product n = pq and an exponent k ∈ Z + are made public.Anyone can encode a message a by computing b ≡ a k (mod n) , which can be done by the method of suc- cessive squaring in polynomial time of log n .To decode b, a decoder needs to use p and q to compute the Euler φ-function φ(n) = (p − 1)(q − 1) , solve the congruence ku ≡ 1 (mod φ(n)) by the Euclidean algorithm, and compute b u (mod n) by successive squaring to recover the original message a ≡ b u (mod n) .All these steps are in polyno- mial time of log n , except the factorization n = pq if one does not know the private keys p and q.The belief that factorization cannot be done in class P makes the RSA crypto scheme secure.
Integer factorization is believed to be slow because all known algorithms use trial division to detect prime factors of n, or use trial checking to find enough suitable smooth numbers.The latter approach is called a sieve method.The most advanced sieve method is the general number field sieve (cf.Pomerance [13]) which can factor n in about time, for some constant c 1 > 0 , which is much slower than a polynomial time c 2 (log n) c 3 = c 2 exp c 3 log log n for some constants c 2 , c 3 > 0 .For comparison, a factori- zation method by trial division has a computational time around for some constant c 4 > 0.
For square-free n's, additive relationships among values of µ(n) are identified in Luo and Ye [14].More precisely, for finite X and 1 ≤ h ≤ 1, 000 , [14] shows that the condi- tional expectation of µ(n [19, (341)] (cf.also Matomäki, Radziwiłł, and Tao [20]) as modified for square-free integers, predicts that these two conditional expectations converge to each other when X → ∞ .Since for algorithms and computational complexity of µ(n) the integers n are always finite, the former case rules and hence serves as a theoretical foundation of the present study.

General principles
A great variety of random forest methods are currently used in supervised machinelearning classification problems.For the purpose of this study, we will implement the random forest algorithm proposed by Breiman [21].The underlying principle of random forests is to aggregate a collection of random decision trees.First of all, to establish a classification model utilizing a decision tree algorithm, the set of all feasible values for feature variables is partitioned into distinct and non-overlapping regions.The prediction for a given observation can be made by identifying the class that occurs most frequently among the training observations in the region where it falls.The objective is to find these regions that minimize the error rate of classification regions that correspond to the fraction of the training observations that do not belong to the most commonly occurring class in that region.Decision trees can be easily explained, displayed graphically, and outperformed on cases assumed non-linear decision boundaries than some commonly used linear models like regressions.However, these tree models can be non-robust and suffer from high variances.One technique that is applied to overcome these disadvantages is through bootstrap by taking repeated samples from the training set, building a (1) separate prediction model using each sample, recording the class predicted by each tree, and taking a majority vote (James et al. [22]).By introducing random perturbations to individual decision trees, the forest can extensively explore a broader spectrum of potential tree predictors, which, in practice, yields enhanced predictive performance.Specifically, the supervised classifier and random forests can be set up as follows ( [21] and Genuer and Poggi [23]).Let L n = {(X 1 , Y 1 ), . . ., (X n , Y n )} be a learning sample composed of n couples of inde- pendent and identically distributed observations, coming from the same common joint unknown distribution (X, Y ) .Assuming X ∈ X , a space of dimension p and Y ∈ Y = {1, 2, . . ., C} , the classifier ĥ : X → Y is a Borel measurable function which associates a prediction ŷ of the response variable Y corresponding to any given input observation x ∈ X .Let ( ĥ(., � 1 ), • • • , ĥ(, .�q )) be a collection of classification trees, with � 1 , � 2 , . . ., � q be q independent and identically distributed random variables inde- pendent of the learning sample L n .The random forests predictor ĥRF in classification is obtained by aggregating this collection of classification trees as the majority vote among individual trees, i.e.

Random forest with random inputs (RFRI)
We implement RFRI to our target Möbius function, which exhibits two significant characteristics.Firstly, during the construction of each tree, a subset of mtry variables is randomly chosen at each node.This random selection is achieved by uniformly drawing mtry variables, without replacement, from the pool of p available input variables.Among these selected variables, the optimal split is determined by considering all possible splits.Secondly, the RFRI trees are not pruned.To summarize, the algorithm for random forests classification with RFRI trees can be outlined as follows: 1. Draw ntree bootstrap samples from the original data.2. For each of the bootstrap samples, grow an unpruned classification tree in the following manner: at each node, randomly sample mtry of the predictors and choose the best split from among those variables.3. Predict new data by aggregating the predictors of the ntree trees, which is obtaining the majority votes for classification.
There are two tuning parameters involved in building the RFRI and predicting values of the Möbius function.The first parameter, ntree, represents the number of trees in the model.A larger number of trees generally leads to better performance, and thus the value of ntree is selected based on the computational cost of the model.It is considered sufficiently large when further increases in the number of trees do not result in significant improvements in prediction accuracy [23].The second parameter, mtry, determines the number of variables chosen at each node.The tuning process of mtry will be described in "Implementations and model validations" section.It will be shown that different values of mtry may lead to different prediction results.

Feedforward neural network (FNN)
The non-parametric nature of neural networks makes them attractive choices for learning tasks where the underlying functional form is unknown, such as in the case of the Möbius function.A typical neural network architecture consists of layers of elementary processing units called neurons, interconnected based on the specific type and purpose of the network.The first layer, known as the input layer, consists of neurons equal to the number of distinct features in the input data.Subsequently, the data passes through neurons of custom-sized hidden layers.Finally, the output layer, containing one neuron for each predicted feature, completes the network architecture.Upon entering the input layer of a neural network, data is transformed into output by propagating through neurons via connections (edges) between them.In FNNs, data propagates strictly towards the output layer, i.e. neurons output exclusively to neurons of subsequent layers.Consider the a-th neuron in the L-th layer of a given neural network, which we will denote by n L a .Based on the topology and architecture of the neural network, n L a receives signals from m neurons and transmits signals to q other neurons.Then, the neuron n L a can be characterized by two parameters: an m-dimensional weight vector (w 1a , w 2a , . . ., w ma ) T and a bias term b a .When n L a receives signals x 1 , x 2 , . . ., x m from neurons n 1 , n 2 . . ., n m in the preceding layer, the signal o a transmitted by n L a to each of the q output connections is calculated using the following formula (Warner and Misra [24]): where σ L is an activation function specified for all neurons of layer L, and m i=1 w ia x i − b a is referred to as the input signal.This transformation is depicted in Fig. 1.

Application of machine learning to the Möbius function
The goal of this study is to find the Möbius function value µ(n) , for a given large square-free number, n.

Construction of a database
Let P be the set of some primes ≤ X where X is a fractional power of n.Choose positive integers k and ℓ , and pairwise coprime positive integers m 1 , . . ., m ℓ .The database con- sists of records of ℓ + 2 variables.The first variable is m which is either a prime in P, the product of two distinct primes in P, or the product of k distinct primes in P. Consequently, the number of records in this database is 1≤j≤k |P| j .The second variable of a record is µ(m) = ±1 which is known by the construction of m.The rest ℓ variables are remainders of m divided by some chosen integers m ν , ν = 1, . . ., ℓ.

Data balancing
The data will be imbalanced with a severe difference between the two classes, Class  [25]).Instead of simply oversampling the minority class, SMOTE first selects examples from the minority class and finds a certain number of the nearest neighbors for an example in the (ℓ + 2)-dimensional feature space.Then, a ran- domly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in the feature space.

Data splitting
In this study, a random selection of 80% of the data was utilized for training purposes, leaving the remaining 20% for testing.Consequently, the training dataset consists of 317,284 observations, while the test dataset contains 79,322 observations.

RFRI classification
The experiment employs the RFRI classification model, which is implemented using the programing language R and the R library caret (Kuhnm [26]) and randomForest (Liaw and Wiener [27]).The dataset used in this study consists of 396,606 observations of six variables, with the response variable µ(n) generated by the algorithm described in "Con- struction of a database" section.A list of prime numbers up to 263 is obtained and used to compute the products of two, three, and four distinct primes, resulting in integer values that are included in the "n"column.The maximum value recorded in the "n"column is 4,088,647,181, which is the product of prime numbers 241, 251, 257, and 263.The choice of other features is based on several considerations, including relevance, computational efficiency, and training difficulty.The selected features include integer values n from the "n"column and the congruent values of n modulo 4, modulo 9, modulo 25, and modulo 49.
To prepare the data for analysis, the feature values are pre-processed using the standardization procedure, which involves subtracting the mean and dividing by the standard deviation.To enhance the robustness and generalization performance of the model, the training set is subjected to three rounds of 10-fold cross-validation conducted on the training set with SMOTE resampling technique.This approach helps to address the issue of class imbalance and improve the model's ability to generalize to new data.
We use a tuning algorithm to aid in managing the training process and improving model outcomes.Recall that mtry refers to the number of variables that are randomly selected to be sampled at each split, while ntree pertains to the number of trees in the random forest model.In this experiment, ntree is set to be 500 and the highest training accuracy is achieved when the mtry is set to 3, see Fig. 2.

Single-layer FNN classification
To facilitate comparison and validation, we employed a FNN to analyze the identical training and test data sets utilized in our random forests experiment.The same SMOTE resampling technique and standardization procedures were utilized and three repeated 10-fold cross-validations were performed.
The FNN was developed using the R package nnet (Venables and Ripley [28]), incorporating a single hidden layer.The results from the repeated cross-validation showed that the model attained its peak ROC value of 0.92 when configured with a single hidden unit, as depicted in Fig. 3.The refined model comprises one hidden layer with a single unit, employing the Sigmoid activation function.The optimization process employed the Broyden-Fletcher-Goldfarb-Shannon (BFGS)method, coupled with a least-squares loss function, a full batch, and only one epoch.The weighted decay is 0.1.

Learning ideas
The database will be constructed in a way to avoid factorization.In particular, we will not seek to have records to cover an entire neighborhood of n.There will be smaller values of m in the database, which we believe is an important feature of the database, because these m's and their Möbius function values may provide easier hints for the model to learn.
Residue classes modulo m 1 , . . ., m ℓ are incorporated in the database because values of the Möbius function have hidden additive properties as pointed out in Luo and Ye [14].Values of the Möbius function are deterministic and are not random.For a given modulus m i , the distribution of µ(m) on the arithmetic progression m ≡ a (mod m i ) are presumably different from the distribution of µ(m) on m ≡ b (mod m i ) , when a ≡ b (mod m i ) .This difference itself manifests an additive property of the Möbius function.
The multiplicative properties of µ(m) are easy to understand based on its multiplica- tive definition.Its additive properties seem to be complicated and beyond human comprehension so far.It is our hope that a machine learning model may discover some of these additive properties and use them to formulate a fast algorithm.
A central issue for an algorithm of µ(n) is its computational complexity, which has two stages, the training time complexity and the run-time complexity.For a random forests model, the run-time complexity is simply the depth of the trees.The training time complexity, on the other hand, is estimated to be O(NDT log N ) , where N is the number of points in the training set, D is the dimension of the data, and T is the number of decision trees (Kumar [29]).These complexity bounds will be used in dataset construction and model selection.
On the other hand, it is known (Kearns and Valiant [30]) that for a pseudorandom function f(n), even if it is polynomial-time computable, there is no way to learn it from examples in polynomial time (cf.Arora and Barak [31, 9.5.5]).It is our surmise that the Möbius function is not pseudorandom.
The proposed algorithm is of course a probabilistic algorithm in the sense that its accuracy is based on the metrics to be discussed in "Metrics for prediction performance" section below.We hope that the accuracy may be improved by adjustments to the database structure and model parameters.

Performance metrics for the RFRI classifier
To evaluate the performance of our models comprehensively, we considered essential metrics, including Accuracy, True Positive Rate (TPR), False Positive Rate (FPR), Precision, F 1 -Score, ROC, and Area Under the ROC Curve (AUC).
For the RFRI classifier, Class − 1 was designated as the positive class.The Accuracy metric, which indicates the percentage of correct predictions out of the total number of predictions made, achieved an overall rate of 0.9493.This implies that 94.93% of instances in the test set were correctly classified by the model.However, it is important to note that the test set was imbalanced, with significantly more instances in Class 1 than in Class − 1 (see Table 1).In such cases, relying solely on Accuracy may not provide a complete picture of prediction performance.Classifiers that constantly predict the majority class could still achieve high Accuracy, even if their performance in the minority class is poor.Therefore, additional metrics like TPR, FPR, Precision, and F 1 -Score are crucial, especially in imbalanced datasets.These metrics take into account the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN), providing a more nuanced evaluation of the classifier's performance.The definitions and corresponding results of these metrics for the RFRI model can be found in Table 2.
TPR, also known as Sensitivity or Recall, assesses how well a model can identify true positives.Specifically, it represents the percentage of actual Class − 1 instances that are correctly predicted by the model.In this case, the TPR of the classifier is 0.5865, meaning that 58.65% of the actual Class − 1 instances in the test set are correctly identified by the model.The TPR metric is particularly important in this case because correctly classifying Class − 1 instances is more crucial than correctly classifying Class 1 instances due to the imbalanced test dataset, where Class − 1 instances are the minority class.
Precision and F 1 -Score are alternative metrics that can be used to evaluate the per- formance of a predictive model, especially in the context of class imbalance.Precision   measures the proportion of the TP among all positive predictions made by the model.
In other words, it counts the percentage of positive predictions that are correct.In this experiment, when the trained model predicts Class − 1, it is correct 66.26% of the time.High Precision is desirable because it means that the model is highly accurate when predicting positive instances, even if it may miss some positive cases.For the high TPR model, it succeeds well in finding all the positive cases in the test dataset, even though it may also wrongly predict some negative cases as positive cases.Both high Precision and high TPR are preferred, but in reality, there is often a trade-off between them.Increasing one metric often results in a decrease in the other.Therefore, it is crucial to find a balance between Precision and TPR based on the specific requirements of the problem.The F 1 -Score is computed by taking the harmonic mean of Precision and TPR.High values in F 1 -Score are desirable since they indicate both high Precision and high TPR.In this case, the F 1 -Score is 0.6223 (Fig. 4).
The ROC curve is a probability curve with a horizontal axis from 0 to 1 of the FPR, and a vertical axis from 0 to 1 of the TPR.A perfect classifier would have a TPR of 1 and an FPR of 0, implying it can correctly classify all positives and negatives.This ideal classifier would closely hug the upper left corner of the ROC curve.In contrast, a random classifier would have a diagonal line from the bottom-left to the top-right corner, indicating that it performs no better than random guessing (cf.Fawcett [32]).The AUC metric measures a binary classifier's ability to distinguish between positive and negative classes.A perfect classifier would have an AUC of 1, signifying complete separation of the two classes.
In our case, we observed a near-perfect ROC curve with an AUC very close to 1, despite some misclassification cases.The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.It is important to note that the AUC only measures the ranking of probabilities and not the actual probability values themselves.As such, it is possible to have a perfect AUC even if the probabilities are poorly calibrated or biased.Additionally, using a decision threshold above 0.5 can result in a zero error rate for a perfect AUC score.However, selecting the optimal threshold may vary depending on the specific application and the relative costs of FP and FN errors (cf.[32]).

Performance metrics for the single-layer FNN
In terms of performance metrics (as shown in Table 2), the FNN model exhibits an overall Accuracy rate of 0.7871, indicating that it accurately classified 78.71% of the instances in the test set.The TPR (Sensitivity/Recal)l value is 0.9477, signifying that it correctly identified 94.77% of the actual Class − 1 instances in the test set.On the other hand, the FPR value is 0.2248 suggests that 22.48% of Class 1 instances were incorrectly classified as Class − 1.The Precision score of 0.2384 indicates that when the model predicts Class − 1, it is accurate 23.84% of the time.Comparatively, the F 1 -Score of 0.3809 is much lower than that achieved by the RFRI model.

Conclusion
The proposed algorithms utilize the neighboring values of the Möbius function to predict µ(n) based on the additive relationships discovered in Luo and Ye [14].Instead of factorization, the algorithms generate these neighboring values by multiplying a set of primes.To encourage the learning process towards additive structures, the algorithms select congruent values of the neighboring integers modulo 4, 9, 25, and 49 as features.
The algorithms yield promising results with satisfactory performance metrics, indicating that the learning model has successfully uncovered hidden additive properties of the Möbius function.This outcome also suggests that a similar approach could be applied to other multiplicative number-theoretic functions and may pave a way towards developing efficient machine learning algorithms for integer factorization.
The novelty of this paper includes (1) application of machine learning technology to the study of the Möbius function, (2) machine learning models as algorithms of µ(n) with satisfactory performance metrics, (3) a route map towards efficient algorithms of µ(n) , and (4) potential application to cyberspace security.

Limitation
The use of the R programing language imposed a constraint on the scale of this study.To extend the algorithms' applicability to larger integers, a different software solution may be necessary.Additionally, the choice of programing language limited the number of primes that could be used to generate neighboring integers through multiplication.Allowing for products of more primes would provide greater flexibility and reduce data imbalances.
The current study does not address the computational complexity and speed of the algorithms.Since this is the first of its kind, a significant amount of time was dedicated to model research and fine-tuning.However, we believe that with further development, the learning process can be formalized, and a model could be quickly constructed when presented with a large integer n.
Other than the constraints imposed by the programming language R, training of models for multiplicative number theoretic functions may be inherently complex computationally in time and memory.The present paper is an attempt to address this difficulty.

Future work
Our ultimate objective is to develop efficient machine learning algorithms for integer factorization.Initially, we experimented with deep learning using feedforward neural networks with multiple hidden layers but we did not incorporate the resampling and standardization processes.Regrettably, the performance results indicate that the predictions were essentially random, with a chance of only around 50% being classified as either positive or negative, even after fine-tuning.On the other hand, the implementation of a random forests model and a single-layer neural networks model exhibited significantly better predictive performance.In our future studies, we also intend to explore the potential of deep learning techniques to get other machine learning algorithms.We believe that combining deep learning techniques with approaches such as support vector machines or gradient boosting might yield better results for integer factorization.Additionally, we plan further refine the current algorithms for the Möbius function and focus on developing algorithms for other multiplicative number-theoretic functions.
The potential of our machine learning models lies on their possible ability to be applied to large integers well beyond the scope of the training datasets.This is work in progress of the authors and might have huge impact on cyberspace security.

Fig. 1
Fig.1Feedforward propagation in a single hidden layer neuron.

Fig. 2
Fig. 2 Model Tuning Results for RFRI Model: Achieving Peak Accuracy of approximately 0.96 with mtry = 2 after Repeated Cross-Validation.

Fig. 3
Fig. 3 Model Tuning for the Single-layer FNN: Receiver Operating Characteristic (ROC) Reaches Maximum Value with One Hidden Unit after Repeated Cross-Validation and Various Weight Decay Options.

Table 1
Confusion matrix of the testing dataset

Table 2
Prediction performance metrics for two classifiers: RFRI and single-layer FNN