A novel feature extraction method based on highly expressed SNPs for tissue-specific gene prediction

Gene expression provides a means for an organism to produce gene products necessary for the organism to live. Variation in the significant gene expression levels can distinguish the gene and the tissue in which the gene is expressed. Tissue-specific gene expression, often determined by single nucleotide polymorphisms (SNPs), provides potential molecular markers or therapeutic targets for disease progression. Therefore, SNPs are good candidates for identifying disease progression. The current bioinformatics literature uses gene network modeling to summarize complex interactions between transcription factors, genes, and gene products. Here, our focus is on the SNPs’ impact on tissue-specific gene expression levels. To the best of our knowledge, we are not aware of any studies that distinguish tissue-specific genes using SNP expression levels. We propose a novel feature extraction method based on highly expressed SNPs using k-mers as features. We also propose optimal k-mer and feature sizes used in our approach. Determining the optimal sizes is still an open research question as it depends on the dataset and purpose of the analysis. Therefore, we evaluate our algorithm’s performance on a range of k-mer and feature sizes using a multinomial naive Bayes (MNB) classifier on genes in the 49 human tissues from the Genotype-Tissue Expression (GTEx) portal. Our approach achieves practical performance results with k-mers of size 3. Based on the purpose of the analysis and the number of tissue-specific genes under study, feature sizes [7, 8, 9] and [8, 9, 10] are typically optimal for the machine learning model.

such as which genes predispose an individual to disease and rare genetic variants that can alter gene function. Moreover, SNP expression levels are powerful indicators that can be used to distinguish genes across tissues. Many studies have associated variants such as SNPs with disease progression, e.g., cardiovascular risks [4], obesity [5], hypoxia (a condition characterized by a limited oxygen supply) [6,7], and coronary artery disease [8]. Furthermore, most existing work has focused on developing disease prediction models based on SNPs associated with a single disease only, for example, breast cancer [9], inflammatory bowel disease [10], and obesity [11]. These studies use each SNP as a feature. A small number of SNPs are used as they are associated with a single disease. This is possible as a machine learning model derives its power from its ability to differentiate patterns from the data itself. Reference [12] is a recent survey that provides a detailed review of the different supervised machine learning algorithms for disease prediction.
However, we are not aware of any study that leverages the power of machine learning to identify patterns from a much larger number of SNPs, i.e., SNPs associated with multiple diseases, and then uses them to predict their respective genes. In particular, in this study, we focus on using patterns obtained from expression levels of SNPs in each gene of human tissue and uses them to predict their respective genes. Gene expression provides a means for an organism to produce gene products necessary for the organism to live. Variation in the significant gene expression levels can distinguish the gene and the tissue in which the gene is expressed. Tissue-specific gene expression, often determined by single nucleotide polymorphisms (SNPs), provides potential molecular markers or therapeutic targets for disease progression [13]. Therefore, SNPs are good candidates for identifying disease progression.
The GTEx portal provides essential data that can be used for machine learning algorithms to mine valuable patterns from tissue-specific genes. The GTEx consortium is a community of more than a hundred researchers from various groups and countries working together to increase our understanding of how changes in genes affect human health and disease. This knowledge will improve health care for future generations [14]. In particular, the Common Fund started the GTEx Program in 2010. The GTEx portal that provides access to GTEx resources was launched in 2013 [15], and is still active. The GTEx portal catalogs genetic variation, including SNPs, to gene expression levels. Each gene has a set of SNPs associated with different expression levels, including p-values indicating the degree of significance. The lower the p-value, the more significant the SNP is associated with the gene. Currently, GTEx contains data on 49 different types of human tissues. Manual processing can be used to access the SNPs per gene or even SNPs for a group of genes. This becomes a labor-intensive task if many SNPs or even genes need to be analyzed.
The current bioinformatics literature uses gene regulatory network modeling to summarize complex interactions between transcription factors, genes, and gene products [16]. In our work, we focus solely on SNP expression levels for gene prediction and not other interacting factors such as transcription factors. Thus, we cannot use gene regulatory network modeling to study the effects of the expression levels. In contrast to existing machine learning approaches that use each SNP as a feature for single disease prediction, we focus on tissue-specific gene expression across all 49 human tissues available in the GTEx portal. Tissue-specific gene expression may also be related to multiple diseases, resulting in a large number of SNPs overlapping among tissues. For example, SNP rs142433332 from the DARS2 gene may be related to leukoencephalopathy with brain stem and spinal cord involvement and lactate elevation; gait imbalance; and hypertensive disorder [17]. Moreover, single disease studies use multiple samples per gene due to the small number of candidate genes. On the other hand, it is impossible to have multiple samples per gene in our study as it is too costly. More specifically, our work involves 305,572 tissue-specific genes.
Our initial exploration using each SNP as a feature for machine learning tissue-specific genes confirms the challenge presented by such a large number of genes. As expected, the machine learning model cannot be used for gene prediction as it produces very low accuracy (less than 1% for 1925 genes of the Brain amygdala tissue). To overcome this challenge, we implement natural language processing (NLP) techniques in machine learning to predict tissue-specific genes using their highly expressed SNPs. The most critical task in any NLP application is word segmentation [18]. Nevertheless, it is challenging, especially for text without explicit word boundary delimiters, such as Chinese, Japanese, or even DNA/protein sequences. Moreover, even for space-delimited text like English, relying on white space alone does not generally result in adequate segmentation [19].
Neural networks and many machine learning algorithms cannot handle non-numerical data [20,21]. Thus, word embeddings, a technique that maps words from a vocabulary into a vector of real numbers, are used [22]. The most common word embedding technique is k-mer counting [20,[23][24][25] due to its simplicity and efficiency, where low memory is used as long as the k-mer size k is small. In this technique, sample data is transformed into a sequence of k-mer words stored in a one-dimensional numerical vector [20]. A dictionary of k-mer words is then used to count the number of times each k-mer word occurs in the sample. The studies in [20,[23][24][25] use 5-mers on DNA barcode, ribosomal, splice, and promoter datasets. In contrast, the study in [26] proposes trimer usage while exploring DNA sequences wrapping around histone proteins in yeast datasets [27] as well as splice and promoter datasets. On the other hand, reference [28] is a survey of current gene prediction tools and recommends using hexamers, as were shown in 1992 [29] to be the most discriminative size in identifying protein-coding genes in DNA sequences.
Another important factor related to k-mer size is the number of k-mers to be used as a feature, i.e., the feature size. The feature size should be discriminating enough to be a feature, i.e., the degree of similarities between features of other genes must be low. Thus, determining the optimal feature size is also essential for this study.
Thus, we conclude from the above studies that determining optimal k-mer and feature sizes is still an open research question, with the answer depending on the dataset and purpose of the analysis. An optimal solution for one dataset or application is not necessarily optimal for another dataset or application. The contributions of this paper include: • Proposes a novel feature extraction algorithm based on highly expressed SNPs to select k-mers as features of varying sizes. This is because it becomes computationally expensive for a machine learning model to learn from a large number of features and genes.
• Proposes optimal k-mer and feature sizes for tissue-specific gene prediction.
• Provides a comprehensive analysis of our experimental results on the GTEx tissuespecific datasets. We show that patterns learned from SNP expression levels with the highest and lowest p-values contain similar discriminatory power.

Method
This section presents the datasets we analyze and the necessary preprocessing, feature extraction methods, classification, and evaluation metrics.

Datasets
To test the effectiveness of our approach, we downloaded all 49 tissues from the GTEx portal [30] and selected genes with at least 25 SNPs. Table 1 summarizes the tissue and the number of selected genes in each tissue. For example, thyroid tissue has the most genes (12,805), while the kidney cortex tissue has the least genes (615).

Feature-size-based extraction method
If we take each k-mer as a feature, we generate a large number of features that make learning not only challenging but also computationally expensive. To perform dimensionality reduction on the set of k-mers, we propose using the number of k-mers in a particular range, which we call feature size, as shown in Fig. 1. The example shown in this figure demonstrates feature creation for the TSPAN6 gene indicated by the GEN-CODE Gene ID, i.e., ENSG00000000003.14. First, all of the SNPs indicated by the 'alt' field are collected. Then, the SNPs are sorted based on the expression levels, indicated by the 'pval _ nominal' field, before being concatenated into a string. Finally, features are created based on feature size. While exploring several k-mer and feature sizes, we found that using three subsequent numbers as partition sizes gives enough discriminating power during the classification task. For example, the first feature consists of seven k-mers, i.e., "CCT CTG TGG GGG GGC GCA CAC", the second feature consists of eight k-mers, i.e., "CCT CTG TGG GGG GGC GCA CAC ACC", while the last feature consists of nine k-mers, i.e., "CCT CTG TGG GGG GGC GCA CAC ACC CCC". Both the k-mer and feature sizes parameters can either underfit or overfit the machine learning model. Thus, the effects of varying these parameters are explored in the experiments.
The pseudocode of the feature-size-based feature extraction method is presented as Algorithm 1. Pseudocode FSE begins by reading all the SNPs per gene and then sorting them based on the lowest p-value to extract the top x highly expressed SNPs. (During our initial explorations, we were not sure on the exact number of SNPs to use when Fig. 1 The overall idea of the feature-number-based extraction method is presented using an example gene that is indicated by its unique identifier, ENSG00000000003.14, which refers to the TSPAN6 gene. See in text for details selecting genes. Taking a larger number, such as 100 reduces the number of genes per tissue that are selected for our experimental datasets. Taking x = 25 yielded most genes in all tissues.) These SNPs are concatenated into a string. A sequence of k-mers is then generated, where the feature size, [a, b, c], is used to create a feature set, consisting of features of lengths a, b, and c, which we call the forward feature set. In contrast, the reverse feature set is created by sorting based on the highest p-value. Note that, because the largest feature size considered in our work is [10,11,12], we only need the 14 ( c + 2 , where c = 12 ) most highly expressed SNPs for feature creation. Source code is attached as additional supplementary material (Additional file 1: Code S1).

Classification
We chose the multinomial naive Bayes (MNB) classifier due to its computational efficiency and optimality for classification tasks even when the conditional independence assumption is invalid [31,32]. Moreover, it is known to outperform even more sophisticated classification methods, such as decision trees and k-nearest neighbors [33].
For the first experiment, we used forward and reverse feature sets on the MNB classifier on two tissues: Adipose subcutaneous, representing a large dataset (10,884 genes), which we refer to as Test-1; and a small dataset, Brain amygdala (1925 genes), which we refer to as Test-2. These tissues are also chosen because they are quite distinct in that one represents adipose subcutaneous samples from the leg while the other represents neural tissues sampled from a structure deep within the brain. The purpose of this experiment was twofold. First, it determines the top x number of SNPs required for the gene prediction. Second, it determines the optimal k-mer and feature sizes without underfitting or overfitting the machine learning model.
To further gauge the performance of the selected k-mer and feature sizes, we tested the machine learning model on all 49 human tissues shown in Table 1 , which we refer to as Test-3, as a second experiment. Finally, the last experiment explores the efficacy of the k-mer and feature sizes on the 175 genes common to all tissues, resulting in a dataset comprising 8575 genes, which we refer to as Test-4. This experiment is different from the other two experiments as it shows the discriminative power of SNPs expression levels to identify genes across different tissues. In other words, even though different tissues may express the same gene, the SNP expression level of the gene may be different and may be used to classify the tissues. These distinguishing genes are important as they are also potential biomarkers related to a disease or tissue.

Evaluation metrics
We evaluated our experiments using two machine learning evaluation metrics: accuracy and F1-score. All values were between 0 and 1. Accuracy is a metric tied to precision and recall. High precision and recall scores show that the classifier gives accurate results (high precision), and the majority of the results are positive (recall). The F1-score considers both precision and recall by taking their harmonic mean. A micro-average considers each sample equally, whereas a macro-average considers each class equally. The former is preferable for imbalanced datasets, and the latter is preferable for balanced datasets. Both the micro-and macro-averages will report the same scores if the datasets are balanced. This paper reports the macro-average F1-scores as each gene has the same number of features.
Machine learning models automatically learn patterns from a certain amount of training and may adapt too much to the dataset's peculiarities during training. Thus, it is crucial to evaluate the model on a dataset that is unseen during training. However, the common approach of dividing the dataset into separate training and test subsets does not make the most efficient use of data. An alternative is the stratified k-fold cross-validation technique. This technique ensures that each fold is a good representative of each class by splitting the dataset into k equal parts known as folds, and k training/test cycles are performed. For each cycle, a fold is set aside for testing, and the remaining k-1 folds are used for training. The reported evaluation score is the average of scores obtained during each cycle. All data is eventually used for both training and testing, but each test set's data is never accessible to the model during training for that test set [34]. Thus, the feature extraction algorithm was evaluated using stratified-3-fold cross-validation as we have three features per gene. Furthermore, we evaluated the efficiency of our method in time-space tradeoff using total run time and peak RAM usage.

Results and discussion
Experiments were conducted on the Test-1, Test-2, Test-3, and Test-4 datasets on an otherwise ideal 3.1 GHz Dual-Core Intel Core i5 with 16 GB main memory. The operating system was macOS Big Sur.
The Test-1 dataset comprises 10,884 genes. Both the forward and reverse feature sets consisting of 32,652 features each, used about the same run time (< 2 minutes) and peak memory (< 2.2 GB) with varying k-mer and feature sizes. Generally, underfitting occurs with a k-mer size of 2, and overfitting occurs with k-mer sizes of 4 and 5. The fit is optimal with a k-mer size of 3. In regards to feature sizes, underfitting occurs with feature sizes [7,8,9] and below. In contrast, overfitting occurs with feature size [10,11,12], and possibly [9,10,11]. The fit is always optimal with feature size [8,9,10]. See Table 2a for details.
Likewise, in Table 2b, we show the same measures for the Test-2 dataset, consisting of 1925 genes. The forward and reverse feature sets comprising 5,775 features each, used similar run times ( ≤ 15 seconds) and peak memory ( ≤ 270 MB) with varying k-mer and feature sizes. It is worthwhile to mention that even though this is a much smaller dataset, underfitting still occurs with a k-mer size of 2, while overfitting still occurs with k-mer sizes 4 and 5. The fit is still optimal with a k-mer size of 3. In contrast to the results seen in Test-1, the Test-2 model overfits with feature size [8,9,10]. The fit is always optimal with feature size [7,8,9]. As a result, we conclude that k-mer size 3 is an optimal fit for the machine learning model, whereas the optimal feature size scales with the number of genes. Both the forward and reverse feature sets in the two datasets gave similar accuracy and F1-scores which shows that using the SNPs with the highest and lowest p-values contains similar discriminatory power. Table 3 shows the effects of varying feature sizes on the Test-3 dataset of 49 human tissues when the k-mer size is set to 3. As expected, a smaller feature size is a good fit for tissue with fewer genes compared to a tissue with a more significant number of genes where a larger feature size is a better fit. Generally, depending on the number of genes and the purpose of the analysis, feature sizes greater than [6,7,8] but less than [9,10,11] is an optimal fit for the machine learning model. A similar observation was seen for the reverse feature set and is attached as Additional file 2: Table S1.
The purpose of the final experiment is different from the previous experiments, as it focuses on the ability of the expression levels of SNPs to differentiate common genes that exist across the 49 human tissues. Table 4 summarizes the results for the Test-4 dataset comprising 8,575 genes when k-mer of size 3 is used. In this case, we conclude that feature size [9,10,11] is an optimal fit for the machine learning model.

Conclusions
Our results demonstrate that our approach yields practical performance results on realworld SNP datasets. A total of 25 SNPs is sufficient to represent the highly expressed SNPs as both the forward and reverse feature sets gave similar accuracy and F1-scores. Generally, a k-mer of size 3 is an good fit while other k-mer sizes either underfit or overfit the machine learning model. On the other hand, in regards to feature sizes, depending on the number of genes and the purpose of the analysis, feature sizes [7,8,9] and [8,9,10] are typically optimal for the machine learning model. We also show that variation in the significant gene expression levels can distinguish the genes and the tissues the genes are expressed in. These distinguishing genes are potential molecular markers or therapeutic targets for disease progression. Therefore, these genes are good candidates for identifying potential SNPs associated with disease progression for future studies.
In this paper, we have (1) proposed a novel feature extraction algorithm based on feature sizes; (2) proposed an optimal k-mer of size 3 and feature size [7,8,9] or [8,9,10] depending on the number of genes and the application for tissue-specific genes prediction; and (3) provided a comprehensive analysis of the experimental results. We have

Table 2 (continued)
Total run time is reported in seconds and peak memory usage in megabytes. Accuracy indicated by Acc and F1-Score indicated by F1   Table 4 Effects of varying feature size on a dataset that comprises of all the common genes that exist across of all the 49 human tissues The k-mer size is set to 3. Accuracy indicated by Acc and F1-Score indicated by F1