Artificial Neural Network is an algorithm inspired by biological neurons and is used to estimate functions that can depend on a large number of inputs, and they are generally unknown [29, 30]. It is presented as interconnected systems of "neurons" that are used to exchange messages. The associations between neurons have numeric loads that can be changed dependent on experience, making neural organizations versatile to sources of info and ready to learn. It is an assortment of an enormous number of interconnected handling neurons cooperating to tackle given issues (Fig. 3).
Like other approaches, an ANN approach that can be used for POS tagger developments requires a pre-processing activity before working on the actual ANN-based tagger [11, 14]. The output from the pre-processing task would be taken as an input for the input layer of the neural network. From this pre-processed input, the neural network trains itself by adopting the value of the numeric weights of the connection between input layers until the correct POS tag is provided.
Hidden Markov Model
The hidden Markov model is the most widely implemented POS tagging method under the stochastic approach [6, 23, 31]. It follows a factual Markov model in which the tagger framework being demonstrated is thought to be explored from one state to another with an inconspicuous state. Unlike the Markov model, in HMM, the state is not directly observable to the observer, but the output that depends on the hidden state is visible. As stated in [23, 32, 33], Hidden Markov Model is a familiar statistical model that is used to find the most frequent tag sequence T = {t1, t2, t3… tn} for a word sequence in sentence W = {w1, w2, w3…wn} [33]. The Viterbi algorithm is a well-known method for tagging the most likely tag sequence for each word in a sentence when using a hidden Markov model.
Maximum Entropy Markov Model
Maximum Entropy Markov is a conditional probabilistic sequence model [12, 34, 35]. Maximum entropy modeling aims to take the probabilistic lexical distribution that scores maximum entropy out of the distributions to complement a certain set of constraints. The constraints limit the model to perform as per a set of measurements collected from the training corpus.
The most commonly deployed statistics for POS tagging are: how often a word was annotated in a certain way and how often labels showed up in a sequence. On the other hand, unlike HMM in the maximum entropy approach, it is likely to effortlessly characterize and include much more complex measurements, which are not confined to n-gram sequences [36]. Also, the problem of HMM is solved by the Maximum Entropy Markov model (MEMM) because it is possible to include random features sets. However, the MEMM approach has a business problem in labeling because it normalizes not the whole sequence; rather, it normalizes per state [35].
Artificial intelligence methods for POS tagging
This section provides a general methodology of the AI-based POS tagging along with the details of the most commonly deployed DL and ML algorithms used to implement an effective POS tagging. Both DL and ML are broadly classified into supervised and unsupervised algorithms [22, 32, 37, 38]. In supervised learning algorithms, the hidden information is extracted from the labeled data. In contrast, unsupervised learning algorithms find useful features and information from the unlabeled data.
Machine Learning Algorithms
Machine Learning could be a set of AI that has all the strategies and algorithms that enable the machines to learn automatically by using mathematical models to extract relevant knowledge from the given datasets [15, 38,39,40,41,42]. The most common ML algorithms used for POS taggers are Neural Network, Naïve Bayes, HMM, Support Vector Machine (SVM), ANN, Conditional Random Field (CRF), Brill, and TnT.
Naive Bayes
In some circumstances, statistical dependencies between system variables exist. Notwithstanding, it may be hard to definitively communicate the probabilistic connections among these factors [43]. A probabilistic graph model can be used to exploit these casual dependencies or relationships between the variables of a problem, which is called Naïve Bayesian Networks (NB). The probabilistic model provides an answer for "What is the probability of a given word occurrence before the other words in a given sentence?" by following conditional probability [44].
Hirpassa et al. [39] proposed an automatic prediction of POS tags of words in the Amharic language to address the POS tagging problem. The statistical-based POS taggers are compared. The performances of all these taggers, which are Conditional Random Field (CRF), Naive Bays (NB), Trigrams'n'Tags (TnT) Tagger, and an HMM-based tagger, are compared with the same training and testing datasets. The empirical result shows that CRF-based tagger has outperformed the performance of others. The CRF-based tagger has achieved the best accuracy of 94.08% during the experiment.
Support vector machine
Support vector machines (SVM) is first proposed by Vapnik (1998). SVM is a machine learning algorithm used in applications that need binary classification, adopted for various kinds of domain problems, including NLP [16, 45]. Basically, an SVM algorithm learns a linear hyperplane that splits the set of positive collections from the set of negative collections with the highest boundary. Surahio and Maha [45] have tried to develop a prediction System for Sindhi Parts of Speech Tags using the Support Vector Machine learning algorithm. Rule-Based Approach (RBA) and SVM experiment on the same dataset. Based on the experiments, SVM has achieved better detection accuracy when compared to RBA tagging techniques.
Conditional random field (CRF)
A conditional random field (CRF) is a method used for building discriminative probabilistic models that segment and label a given sequential data [12, 33, 46,47,48]. A conditional random field is an undirected x, y graphical model in which each yi vertex represents a random variable whose distribution is dependent on some observation variable X, and each margin characterizes a dependency between xi and yi random variables. The dependency of Yi on Xi is defined in a set of functions of f(Yi-1,Yi,X,i). Khan et al. [22] proposed a conditional random field (CRF)-based Urdu POS tagger with both language dependent and independent feature sets.
It used both deep learning and machine learning approaches with the language-dependent feature set using two datasets to compare the effectiveness of ML and DL approaches. Also, Hirpassa et al. [39] proposed an automatic prediction of POS tags of words in the Amharic language to address the POS tagging problem. The statistical-based POS taggers are compared. The performances of all these taggers, which are Conditional Random Field (CRF), Naive Bays (NB), Trigrams'n'Tags (TnT) Tagger, and an HMM-based tagger, are compared with the same training and testing datasets. The empirical result shows that CRF-based tagger has outperformed the performance of others. The CRF-based tagger has achieved the best accuracy of 94.08% during the experiment.
Hidden Markov model (HMM)
The Hidden Markov model is the most commonly used model for part of speech tagging appropriate [49,50,51,52]. HMM is appropriate in cases where something is hidden while another is observed. In this case, the observed ones are words, and the hidden one is tagged. Demilie [53] proposed an Awngi language part of speech tagger using the Hidden Markov Model. They created 23 hand-crafted tag sets and collected 94,000 sentences. A tenfold cross-validation mechanism was used to evaluate the performance of the Awngi HMM POS tagger. The empirical result shows that uni-gram and bi-gram taggers achieve 93.64% and 94.77% tagging accuracy, respectively. The other author, Hirpassa et al. [39], proposed an automatic prediction of POS tags of words in the Amharic language to address the POS tagging problem. The statistical-based POS taggers are compared. The performances of all these taggers, which are Conditional Random Field (CRF), Naive Bays (NB), Trigrams'n'Tags (TnT) Tagger, and an HMM-based tagger, are compared with the same training and testing datasets. As the empirical result shows, CRF-based tagger has outperformed the performance of others. The CRF-based tagger has achieved the highest accuracy of 94.08% during the experiment.
Deep learning algorithms
Currently, deep learning methods are the most common word in machine learning to automatically extract complex data representation at a high level of abstraction, especially used for extremely complex problems. It is a data-intensive approach to come with a better result than traditional methods (Naïve Bayes, SVM, HMM, etc.). During the text-based corpora, deep learning sequential models are better than feed-forward methods. In this paper, some of the common sequential deep learning methods such as FNN, MLP, GRU, CNN, RNN, LSTM, and BLSTM are discussed.
Multilayer perceptron (MLP)
The neural network (NN) is a machine learning algorithm that mimics the neurons of the human brain for processing information (Haykin, 1999). One of the widely deployed neural network techniques is Multilayer perceptron (MLP) in many NLP and other pattern recognition problems. An MLP neural network consists of three layers: an input layer as input nodes, one or more hidden layers, and an output layer of computation nodes. Besides, the backpropagation learning algorithm is often used to train an MLP neural network, which is also called backpropagation NN. In the beginning, randomly assigned weights are set at the beginning of algorithm training. Then, the MLP algorithm automatically performs weight changing to define the hidden layer unit representation is mostly good at minimizing the misclassification [54,55,56]. Besharati et al. [54] proposed a POS tagging model for the Persian language using word vectors as the input for MLP and LSTM neural networks. Then the proposed model is compared with the results of the other neural network models and with a second-order HMM tagger, which is used as a benchmark.
Long short-term memory
A Long Short-Term Memory (LSTM) is a special kind of RNN network architecture, which has the capability of learning long-term dependencies. An LSTM can also learn to fill the gap in time intervals in more than1000 steps [14, 57, 58].
Bidirectional long short-term memory
Bidirectional LSTM contains two separate hidden layers to process information in both directions. The first hidden layer processes the forward input sequences, while the other hidden layer processes it backward; both are then connected to the same output layer, which provides access to the future and past context of every point in the sequence. Hence BLSTM beat both standard LSTMs and RNNs, and it also significantly provides a faster and more accurate model [14, 58].
Gate recurrent unit
Gated recurrent unit (GRU) is an extension of recurrent neural network which aims to process memories of sequence of data by storing prior input state of the network, which they plan to target vectors based on the prior input [14, 58].
Feed-forward neural network
A feed-forward neural network (FNN) is one artificial neural network in which connections between the neuron units do not form a cycle. Also, in Feedforward neural networks, information processing is passed through the network input layers to output layers [59].
Recurrent neural network (RNN)
On the other hand, a recurrent neural network (RNN) is among an artificial neural network model where connections between the processing units form cyclic paths. It is recurrent since they receive inputs, update the hidden layers depending on the prior computations, and that make predictions for all elements of a sequence [33, 46, 60,61,62].
Deep neural network
In a normal Recurrent Neural Network (RNN), the information pipes through only one layer to the output layer before processing. But Deep Neural Networks (DNN) is a combination of both deep neural networks (DNN) and RNNs concepts [33, 63].
Convolutional neural network
A convolutional neural network (CNN) is a deep learning network structure that is more suitable for the information stored in the array's data structure. Like other neural network structures, CNN comprises an input layer, the memory stack of pooling and convolutional layers for extracting feature sets, and then a fully connected layer with a softmax classifier in the classification layer [64,65,66,67,68].
Evaluation metrics
This section describes the most commonly deployed performance metrics for validating the performance of ML and DL methods for POS tagging. All the evaluation metrics are based on the different metrics used in the Confusion Matrix, which is a confusion matrix providing information about the Actual and Predicted class which are; True Positive (TP)—assigns correct tags to the given words, false positive (FP)—assigns incorrect tags to the given words, false negative (FN)—not assign any tags to given words [14, 55, 72].
-
i.
True Positive (TP): The word correctly tagged as labelled by experts
-
ii.
False Negative (FN): The given word is not tagged to any of the tag sets.
-
iii.
False Positive (FP): The given word tagged wrongly.
-
iv.
True Negative (TN): The occurrences correctly categorized as normal instances.
In addition to these, the various evaluation metrics used in the previous works are,
-
Precision: The ratio of correctly tagged part of speech to all the samples tagged words:
$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
-
Recall: The ratio of all samples correctly tagged as tagged to all the samples that are tagged by expert (aka a Detection Rate).
$${\text{Detection Rate}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
-
False alarm rate: the false positive rate is defined as the ratio of wrongly tagged word samples to all the samples.
$${\text{False Alarm Rate = }}\frac{{{\text{FP}}}}{{{\text{FP}} + {\text{TN}}}}$$
-
True negative rate: The ratio of the number of correctly tagged samples to all the samples.
$${\text{True Negative Rate}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}$$
-
Accuracy: The ratio of correctly tagged part of speech to the total number of instances (aka Detection accuracy).
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
-
F-Measure: It is the harmonic mean of the Precision and Recall.
$${\text{F - Measure}} = 2\frac{{\left( {{\text{Precision}} \times {\text{Recall}}} \right)}}{{{\text{Precision}} + {\text{Recall}}}}$$