Survey of transformers and towards ensemble learning using transformers for natural language processing

Zhang, Hongzhi; Shafiq, M. Omair

doi:10.1186/s40537-023-00842-0

Survey
Open access
Published: 04 February 2024

Survey of transformers and towards ensemble learning using transformers for natural language processing

Hongzhi Zhang¹ &
M. Omair Shafiq¹

Journal of Big Data volume 11, Article number: 25 (2024) Cite this article

2969 Accesses
3 Citations
Metrics details

Abstract

The transformer model is a famous natural language processing model proposed by Google in 2017. Now, with the extensive development of deep learning, many natural language processing tasks can be solved by deep learning methods. After the BERT model was proposed, many pre-trained models such as the XLNet model, the RoBERTa model, and the ALBERT model were also proposed in the research community. These models perform very well in various natural language processing tasks. In this paper, we describe and compare these well-known models. In addition, we also apply several types of existing and well-known models which are the BERT model, the XLNet model, the RoBERTa model, the GPT2 model, and the ALBERT model to different existing and well-known natural language processing tasks, and analyze each model based on their performance. There are a few papers that comprehensively compare various transformer models. In our paper, we use six types of well-known tasks, such as sentiment analysis, question answering, text generation, text summarization, name entity recognition, and topic modeling tasks to compare the performance of various transformer models. In addition, using the existing models, we also propose ensemble learning models for the different natural language processing tasks. The results show that our ensemble learning models perform better than a single classifier on specific tasks.

Graphical Abstract

Introduction

In this section, we introduce the motivation of this research, the main research questions, and the structure of the paper.

Motivation

Natural language processing (NLP) stands as a technology that facilitates computer-human interaction through the medium of natural languages. It encompasses the artificial processing of human language, empowering computers to not only read but also comprehend it. There are many applications in the field of NLP, covering machine translation, speech recognition, grammar analysis, semantics, and pragmatics. The core of NLP is to segment text corpora for processing, employing tools like ontology dictionaries, word frequency analytics, and contextual semantic scrutiny to isolate the smallest units of meaning.

The key to NLP is to enable natural language communication between humans and computers, where computers not only grasp the meaning of textual language but also express intentions and thoughts in a similar manner. This duality is categorized into ’natural language understanding’ and ’natural language generation,’ forming the two pillars of NLP. However, both domains present formidable challenges. Even in the current theoretical and technological environment, creating a high-quality NLP system remains a challenging long-term goal. However, practical systems with substantial NLP capabilities have emerged for specific applications, some even achieving commercial or industrial success. Examples include multilingual database interfaces, expert systems, machine translation platforms, full-text information retrieval systems, and automatic abstracting tools. The resolution of NLP tasks remains a significant global challenge in today’s context.

The advent of the transformer model in 2017 proposed by Google ushered in a wave of transformer-based models, including BERT, XLNet, and RoBERTa. However, there have not been many studies to comprehensively examine how these models perform in different NLP tasks. Consequently, this paper undertakes the task of comparing five transformer-based models across six distinct NLP tasks through a series of rigorous experiments. Subsequently, we employ the experimental findings to dissect the different performances of these models and delve into the underlying factors contributing to these outcomes. Beyond model comparisons, our objective extends to the integration of these models’ strengths through ensemble learning techniques to yield a more robust and high-performing collective model.

Research questions

The following are the research questions of this research:

1.
What is the role of transformer models towards advanced NLP and text analytics?
2.
In different NLP tasks, what are the differences in the performance of different transformer-based models? And why are there these differences?
3.
Is there a classification method that can combine the advantages of different models?

Contributions

This paper has the following main contributions:

1.
We have read and organized papers on NLP tasks in the past few years from the literature. In addition, building upon the existing literature, we analyzed the advantages of these papers and what needs to be added.
2.
We use five different existing and well-known transformer-based models from the literature to experiment on six types of existing and well-known NLP tasks. We compare the performance of the model based on the experimental results, and analyze the results from the perspective of model structure and training methods.
3.
Through analyzing the advantages and disadvantages of different models on different tasks, two ensemble learning models based on the existing models are proposed to be applied to three NLP tasks. Experiments demonstrate that our proposed ensemble learning models outperform a single model based on the transformer model.

Structure of the paper

In “Introduction” section, we study the research questions and motivations. In “Background” section, we introduced the research background, and described the research task and the model used. In “Review of related works” section, we sorted out the related literature. In “Experimental setup” section, we briefly describe the experimental process and experimental conditions. In “Results and discussion” section, we analyze the results and discuss the reasons. In “Using the ensemble learning methods” section, we proposed an ensemble learning model and conducted experiments on three natural language processing tasks. In “Gap analysis and next steps” section, we analyzed the defects of the article and the next stage of work. In “Conclusions” section we summarize the full text.

Background

When it comes to models for solving NLP tasks, many people may think of long short-term memory (LSTM) and other recurrent neural networks (RNNs). But at present, LSTM is becoming less popular in the field of NLP because the parallel computing power of the LSTM model is poor. In addition to that, the transformer model [1] proposed by Google in 2017 has a stronger ability to extract features. In the Stanford Reading Comprehension Dataset (SQuAD2.0) list, the machine’s performance has exceeded human performance, which is largely due to the proposal of the pre-training model BERT which is built based on the encoder of the transformer model. In this paper, we review and compare the different transformer models which are the BERT model, the XLNet model, the RoBERTa model, the GPT2 model, and the ALBERT model. Attention Is All You Need [1] is a paper proposed by Google that takes attention to the extreme. This paper [1] proposed a new model called transformer. This model has the same structure as the seq2seq model and is also an encoder–decoder architecture. Such a structure consists of 6 coding blocks and 6 decoding blocks, as shown in Fig. 1. The encoder maps the text to the middle layer, so the middle layer has a vector form with text information. Then the decoder translates the text information in the middle layer, and many NLP problems can be solved through this process.

Key natural language processing tasks

In this section, we briefly introduce some of the well-known NLP tasks in the literature.

Sentiment analysis

With the growth of the internet, people have become more likely to express themselves online. For example, product reviews on e-commerce sites and what people say on social media about the products and quality of specific brands. These reviews have a huge impact on the product. For instance, brand companies can promptly respond to shifts in public sentiment on social media, especially when negative feedback increases. Sentiment analysis is a core application designed to measure positivity or negativity in text evaluation.

Sentiment analysis is a common task in natural language processing, frequently found in shopping platform reviews. It is the key to product improvement. Through this analysis, businesses gain comprehensive insights into product attributes, enabling improvements across various dimensions.

Question answering

Question Answering Systems (QAS) [15] represent an advanced evolution of information retrieval systems. They possess the capability to provide accurate and concise responses in natural language to users’ queries, also expressed in natural language.

Within the area of natural language processing, the QAS stands as a important topic. Its purpose is to address inquiries posed by individuals in natural language form, encompassing a wide array of practical applications. These applications span from intelligent voice interactions and online customer support to knowledge acquisition and empathetic chat services. QAS can be categorized into generative and retrieval-based systems, single-round and multi-round QAS, and those designed for open-ended or domain-specific contexts.

Translation

In today’s era of accelerated communication and internet advancements, the exponential growth of information and increasing global interconnectivity have accentuated the challenges of language barriers. Consequently, the demand for machine translation is on the rise. Within the ongoing wave of artificial intelligence, machine translation theory, technology, and future prospects have attracted high interest.

Machine translation involves the transformation of grammatical structures to align with the target language, followed by the translation of individual words from the source language to the target language. This process ensures effective cross-lingual communication in an increasingly interconnected world.

Text generation

In NLP, text generation is an important application area. Keyword extraction and text summarization are all applications in the field of text generation. The main techniques of text generation are as follows: synonym-based rewriting method, template-based rewriting method, the rewriting method based on a statistical model and semantic analysis generation model, and neural network-based method.

Text summarization

Text summarization tasks revolve around creating a concise and coherent summary that retains the essence and core meaning of key information.

The procedure of utilizing computers to process extensive textual content to generate refined and succinct summaries is what defines text summarization. Summaries serve as efficient means for individuals to grasp a text’s primary content, enhancing both time savings and reading effectiveness. However, manual summarization is labor-intensive and time-consuming, making it inadequate to meet the ever-growing information demands. Hence, the emergence of computer-assisted automatic summarization became imperative. Automatic summarization primarily employs two methods: Extraction and Abstraction.

1.
Extraction: Extraction, as an automatic summarization technique, generates summaries by extracting existing keywords and phrases from the source document. These extracted elements form the basis of the summary.
2.
Abstraction: Abstraction, on the other hand, takes a generative approach to automatic summarization. It creates summaries by constructing abstract semantic representations and utilizing natural language generation technology to produce coherent summaries.

Named entity recognition

Named entity recognition (NER) [2] serves as a fundamental tool across various application domains, including information extraction, QAS, and machine translation. In essence, NER is tasked with identifying three primary categories of textual entities: entities, temporal expressions, and numerical values.

The NER system excels at extracting these entities from unstructured text. And it can be extended to cover a wide range of entity types, such as product name, model number, price, etc., based on specific business requirements. Therefore, the concept of "entity" is broadly defined to include any text fragment that is relevant to a business need, and the goal of NER is to extract these specific entities from the text. In order to achieve this, NER systems typically use both rule-based and model-based approaches.

1.
Rule-based methods: Rule-based approaches offer a straightforward means of entity extraction. They are particularly effective for entities with distinct contextual cues or entities with features defined in the text.
2.
Model-based methods: From a modeling perspective, named entity recognition constitutes a sequence labeling task. In this case, the input to the model is a sequence containing various elements, such as text and time expressions. The model’s output is also a sequence, with each unit in the input sequence assigned a specific label corresponding to its entity type.

In summary, NER plays a key role in the identification and extraction of relevant textual entities. It is a generalized approach adapted to specific business needs.

Topic modeling

Topic modeling is a key technique for identifying topics, and central concepts in a given text.

At its core, topic modeling is a statistical approach that uncovers abstract topics by analyzing a collection of documents. It operates under the premise that if an article touches upon a specific topic, certain distinctive words related to that topic will recur frequently within the text. Typically, an article encompasses multiple topics, each with varying proportions.

From a structural perspective, topic modeling offers a method to reveal underlying themes in textual content. Each topic manifests as a probability distribution over words in the vocabulary. This framework, known as a topic model, assumes a generative role. In other words, it assumes that each word within an article arises from a dual probability selection process: one for choosing a topic and another for selecting a word from the distribution of that topic.

In summary, topic modeling empowers us to uncover the underlying structure of textual data by identifying and characterizing distinct themes, providing valuable insights into the content and main ideas contained within.

Machine learning and deep learning approaches for NLP before transformer

In this section, we describe some of the well-known machine learning and deep learning methods for NLP tasks in the literature.

Rule-based methods

Establishing systems for vocabulary analysis, syntactic and semantic interpretation, question-answering, chatbots, and machine translation predominantly relies on rule-based frameworks. This approach leverages human introspective knowledge, does not require heavy reliance on data, and facilitates rapid system deployment. However, it is not without its drawbacks, notably in terms of limited robustness and generalization capabilities.

The rule-based methodology in NLP often involves abstracting extensive sets of sentences, tailored for specific human-computer interactions, into rules via grammar productions. These rules incorporate key information markers. Subsequently, the system can employ finite state automata generated from these rule sets to convert linguistic input into a parameter sequence. This sequence then guides the corresponding information processing methods. This approach not only enhances the efficiency of natural language understanding but also underscores the rule set’s scalability.

A NLP system, rooted in grammar rule matching, mainly focuses on the transformation of natural language input into machine-understandable parameter data. It achieves its functions primarily through three core modules: word segmentation, parameter labeling, and grammar rule matching.

Methods based on machine learning

The concept underlying machine learning-based approaches involves harnessing annotated data to construct a machine learning system, predicated on manually defined features. This system employs learning techniques to determine its parameters, which are then utilized during runtime for data processing and output generation. Notable successes in employing statistical methods have been witnessed in applications such as machine translation and search engines.

NLP tasks encompass a multitude of subtasks within its purview. Traditional machine learning methodologies, including support vector machines (SVM), Markov models, conditional random fields (CRF), and others, have been effectively employed to address these subtasks. Nonetheless, practical applications reveal certain limitations:

1.
Dependency on training set quality: Traditional machine learning models heavily rely on the quality of their training data. Manual labeling of the training set is a requisite, which can undermine training efficiency.
2.
Field-specific variations: Traditional machine learning models may exhibit varying performance across different domains, weakening their adaptability and underscoring the limitations of a singular learning approach. Creating a training dataset that suits multiple domains requires significant human resources for manual labeling.
3.
Complex language features: When faced with high-level, abstract natural language, traditional machine learning struggles to manually label these intricate language characteristics. Consequently, it is limited to learning predefined rules, unable to capture nuanced language features beyond established rules.

Methods based on deep learning

Deep learning models are increasingly applied to NLP tasks, utilizing architectures like Convolutional Neural Networks (CNNs) and RNNs.

When applying a fully connected network to NLP tasks, several challenges arise:

1.
Variable input length: Different input samples can have varying input and output lengths, making it impossible to fix the number of neurons in the input and output layers.
2.
Inability to share features: Features learned from different positions in the input text cannot be shared, leading to inefficiencies.
3.
Model complexity: The model tends to have a large number of parameters and requires extensive computations.

To address these issues, RNNs come into play. RNNs scan input data, enabling parameter sharing across all time steps. At each time step, they not only receive input from the current moment but also incorporate information from the previous step, allowing past information to influence current decisions effectively.

Traditional RNNs, however, face limitations. They tend to simply pass along all learned knowledge to the next time step without any processing. Consequently, early knowledge may be overwritten by more recent information, and long-range dependencies are challenging to capture.

LSTM models introduce a gating mechanism to mitigate the vanishing gradient problem in training with long sequences.

By learning word embeddings, deep learning enables the completion of natural language classification and understanding. Compared to traditional machine learning, deep learning-based NLP offers several advantages:

1.
Continuous learning: Deep learning continuously learns language features based on word or sentence vectorization, grasping higher-level and more abstract language features to accommodate a broad range of NLP tasks.
2.
Automatic feature learning: Deep learning eliminates the need for manual definition of training sets by automatically acquiring high-level features through neural networks.

Different models based on transformer

In this section, we introduce several models based on the transformer model available in the literature.

BERT [3]

Google AI’s introduction of the Bidirectional Encoder Representations from Transformers (BERT) model in 2018 sent shockwaves through the NLP industry, heralding a milestone in the field’s evolution. Notably, BERT exhibited exceptional performance on reading comprehension datasets, but its impact extended far beyond. What set BERT apart was its capacity to concurrently fine-tune contextual representations across different layers, distinguishing it from contemporaneous language models. This unique feature rendered the pre-trained BERT model a versatile tool, well-suited for addressing intricate NLP tasks, often necessitating only minor structural adjustments for tasks like sentiment analysis and question answering.

BERT’s training regimen comprises two main phases: pre-training and fine-tuning. During the pre-training phase, the model undergoes training on diverse unlabeled data, engaging in various pre-training tasks. The subsequent fine-tuning phase initializes the pre-trained BERT model and updates its parameters using task-specific datasets. Despite their shared architectural foundation, fine-tuned BERT models exhibit distinct parameterizations, underscoring their individuality. This nuanced difference between pre-trained and final models stands as a hallmark of the BERT framework.

GPT2 [4]

OpenAI introduced the Generative Pre-trained Transformer (GPT) model in their paper titled “Improving Language Understanding by Generative Pre-Training.” Following this milestone, OpenAI also presented the GPT-2 model in their paper titled “Language Models are Unsupervised Multitask Learners.” These models have significantly contributed to the field of natural language processing and have garnered substantial attention for their capabilities in language understanding and generation. The structure of the GPT-2 and the GPT is not much different, but GPT-2 uses a larger dataset for experiments. GPT-2 has a very large amount of training data. BERT, which has attracted widespread attention, used 300 million parameters for training and refreshed 11 NLP records. The GPT-2 launched by OpenAI has as many as 1.5 billion parameters. It is trained on an 8 million web page dataset and covers a wide variety of topics. In the deep learning method, The BERT and GPT-2 models both use transformer technology. The difference is that BERT uses a two-way language model for pre-training, while GPT2.0 uses an earlier one-way language model. Therefore, the types of architectures that GPT-2 can use in pre-training are therefore restricted and cannot fully integrate context.

XLNet [5]

The XLNet paper first put forward a point of view, dividing the current pre-training model into two types: Auto Regression (AR) and Auto Encoder (AE). GPT is an AR method that continuously uses the information currently obtained to predict the next output. The BERT model is an AE method that masks some words in the input sentence and then restores the data through the BERT model. This process is similar to a denoising autoencoder. XLNet combines the advantages of the AR and AE methods, and permutation language model (PLM) is used to achieve this goal. The XLNet model can shuffle the order of the tags in the model, and then AR is used for prediction. By using this method, when predicting the token, the two-way information of the token can be used at the same time, and the dependence between tokens can be learned, as shown in Fig. 2.

In order to realize PLM, XLNet proposed Two-Stream Self-Attention and Partial Prediction. In addition, XLNet also uses the Segment Recurrence Mechanism and Relative Positional Encoding in Transformer-XL.

RoBERTa [6]

After the XLNet model outperformed the BERT model in NLP tasks, Facebook proposed the a Robustly Optimized BERT Pretraining Approach model (RoBERTa). Compared with the BERT model, the RoBERTa model does not have too many structural differences, but the methods in the pre-training phase have changed. Compared with the BERT model, the RoBERTa model has more model parameters, a larger batch size, and more training data. In addition, the RoBERTa model also has different pre-training methods. First, it deletes the next sentence prediction (NSP) task. Second, it uses dynamic masks. The BERT model gets a static mask during data preprocessing. The RoBERTa model uses dynamic masks: different mask modes are applied to different data sequences. Through this method, different masking methods can be learned by the RoBERTa model for different language representations after a large amount of data training.

ALBERT [7]

That the size of a model can have an impact on its performance is a lesson learned from the ongoing advances in language representation learning. Surprisingly, experiments have shown that merely augmenting the number of hidden layers in a model does not necessarily improve performance. To address these challenges, Google researchers proposed a lightweight variant of BERT known as ALBERT, boasting significantly fewer parameters than the original BERT model.

ALBERT accomplishes parameter reduction through two distinctive strategies. Firstly, it undertakes factorization of the embedding parameterization. The model’s objective is to increase the number of hidden layers without expanding the parameter count for word embedding. To achieve this, ALBERT decomposes the extensive word embedding matrix into two smaller matrices, effectively decoupling the hidden layer size from the word embedding size. Secondly, ALBERT introduces parameter sharing across different layers, a technique that reduces parameter spreading as the depth of the network increases. By applying these methods, the model achieves a reduction in parameters while exerting the least possible impact on performance.

Word vectors in transformer models

Word vectors, commonly referred to as word embeddings, are the foundation of NLP. These embeddings provide a dense vector representation for words, capturing semantic relationships and nuances in meaning. For instance, words with similar meanings tend to be closer in the vector space, enabling models to understand semantic similarities and differences between words.

In the context of transformer models, word vectors play a pivotal role. The initial input to transformer models is the word embeddings of the text. These embeddings are then processed through multiple self-attention mechanisms and feed-forward networks present in the transformer architecture [1]. As the information flows through the layers of a transformer, the model refines these embeddings, capturing contextual information from surrounding words. This ability to understand context is a significant advancement over traditional word embeddings, which are static and lack contextual awareness.

The application of word vectors in transformer models, such as BERT, has led to breakthroughs in various NLP tasks [3]. For instance, models like BERT utilize these embeddings to understand the context around a word, enabling superior performance in tasks like question answering, sentiment analysis, and more. The dynamic nature of transformers, combined with the foundational knowledge captured in word vectors, has made them the state-of-the-art choice for many NLP applications.

What makes transformer model better for NLP tasks

In this section, we introduced several special mechanisms in the transformer model.

Self-attention [1]

There are many similarities between self-attention and attention, but the transformer model uses self-attention to understand and translate other related words into a translation method of the word we are dealing with. Let us look at an example: a wolf does not want to eat a rabbit because it is too thin. Does it represent a wolf or a rabbit? It can be easily judged for us. But for the machine, it is difficult to judge. The Self-attention mechanism can make the machine associate it with the rabbit.

First, the self-attention mechanism will calculate three new vectors, which are query, key, and value. Each query key will make a dot product calculation process. Then use SoftMax to normalize them. Finally, it will be multiplied by value and used as an attention vector. The formula is taken from [1] and is shown in formula 1.

$$\begin{aligned} Attention(Q,K,V) = softmax\left( \frac{QK^T}{\sqrt{d_k}}\right) V \end{aligned}$$

(1)

Multi-head attention [1]

Multi-Head Attention is equivalent to the fusion of several different self-attentions. The transformer uses 8 self-attentions for integration. This can enhance the expressiveness of each layer of attention without changing the number of parameters. And in this way, parallel calculations can be realized, making the calculation more efficient.

Positional encoding [1]

Positional Embedding is a very important part in the transformer model. We found that self-attention can extract the dependency relationship between words, but it cannot extract the absolute position or relative position relationship of words. If the order of key and value is disturbed, the result of attention obtained is still the same. In the NLP task, the order between words is very important, so the transformer model uses Positional Embedding to retain word information. Each position in the sequence is assigned a unique numerical identifier, with each number corresponding to a specific vector. These vectors are subsequently added to the word embeddings, thereby incorporating distinct positional information into each word representation.

Mask [1]

Mask can mask some values when the parameters are updated, and the masked values will not work during the update. In the transformer model, two types of masks play pivotal roles: padding masks and sequence masks. Since input sequences in a batch can have varying lengths, it’s essential to standardize their lengths. Padding masks serve this purpose by ensuring that input sequences share the same length.

On the other hand, sequence masks are designed to prevent the decoder from accessing future information during the decoding process. In a sequence, at any given time step ’t,’ the output should solely depend on the preceding information up to time ’t.’ Sequence masks are used to conceal information occurring after time ’t,’ ensuring that the decoder remains unaware of future context. This mechanism is integral to the model’s autoregressive nature and its ability to generate output one step at a time, maintaining coherence and adherence to the order of the sequence.

Review of related works

Financial practitioners often pay attention to economic-related news because they can learn stock trends from this news. For example, the stock price in the past will reflect the past information, and the latest news will participate in the changes in the stock price. Therefore, financial practitioners need to obtain positive or negative information from the latest news on time to make decisions. And people can analyze the information in the news through the sentiment analysis model. However, due to the unavailability of domain-specific languages and large-scale label datasets, financial sentiment analysis is challenging. MISHEV [8] and his team conducted comprehensive research on NLP-based financial sentiment analysis methods. This research covers multiple natural language processing methods, ranging from dictionary-based methods, word and sentence encoders, and transformer models. Compared with other evaluation methods, the transformer shows excellent performance. The text expression method is the main advancement in the accuracy of sentiment analysis. This method inputs the semantics of words and sentences into the model.

Kaliyar et al. [9] studied the bi-directional model BERT. Compared with other word embedding models, BERT is based on the bi-directional idea. It uses a transformer encoder-based architecture to calculate word embedding. Although compared with the BiLSTM model on the transmission encoder, BERT is a powerful feature extractor. But in a larger corpus, BERT has longer training and inference time. It also contains large memory requirements. By designing a fine-tuned BERT model for future research, these practical problems can be alleviated. For small datasets, the performance improvement of BERT will be more noticeable. It shows that the use of pre-trained networks like BERT may be critical to achieving performance in such context-related tasks.

As mentioned earlier, the BERT model has a very good performance in natural language processing tasks. However, in actual tasks, the BERT model requires a lot of computing power. Sun et al. [10] proposed a patient knowledge distillation method, which can compress a large-scale BERT model into a smaller-scale model. Insufficient problems in the calculation of the scale model can be solved by such means. The Patient-KD method introduced in their work achieves multi-layer distillation, enabling the student model to comprehensively absorb the knowledge embedded within the teacher network model. They substantiated the model’s efficacy by subjecting it to a battery of natural language processing tasks, thereby validating its effectiveness and utility.

In the initial phase, pre-training models have gained substantial traction in the realm of natural language processing tasks. However, the extensive adoption of large-scale models has also brought about challenges related to real-time processing and computational constraints. Addressing these concerns, Sanh et al. [11] introduced DistilBERT, an enhanced iteration of the BERT model. DistilBERT features reduced parameters, expedites training, and preserves model performance. Their work demonstrated the viability of training a universal language model through distillation and conducted in-depth analysis of different components via ablation studies.

Bi-directional attention learning can greatly help self-attention networks (SAN), such as the BERT and XLNet models. Song et al. [12] proposed a pre-training scheme “Speech-XLNet” similar to XLNet for unsupervised acoustic model pre-training to learn the voice representation of SAN. The parameters of the pre-trained SAN model were adjusted and updated under the hybrid SAN/HMM framework. They speculate that by shuffling the sequence of speech frames, the permutation in Speech-XLNet can be used as a powerful regularization function to make the SAN model use its attention weight method to focus on the global structure. In addition, the Speech-XLNet model can perform speech representation learning by exploring the context. Various experiments show that Speech-XLNet is better than the XLNet model in training efficiency and performance.

Effectively identifying trends in human emotions in social media can play an important role in maintaining personal emotional health and collective social health. Alshahrani et al. [13] fine-tuned the XLNet model to predict the sentiment of Twitter messages at the personal message level and user level. Since the XLNet model can collectively capture the context and use multi-head attention to calculate the context representation, their method greatly improves the technical level of the benchmark dataset. In addition, using deep consensus algorithms, they can significantly improve accuracy.

Compared with static word embedding, the word embedding method represented by context performs better in many NLP tasks. For example, how is the contextual representation model generated by the BERT model generated? Through research, Ethayarajh et al. [14] learned how words will be represented in a natural context. Initially, their investigation revealed that the uppermost layer of the BERT model and analogous models generated notably more context-specific representations compared to the lower layers. This heightened level of contextual specificity consistently coincided with an increased degree of anisotropy.

Klein and Moin [15] proposed a simple and effective problem generation method. They bundle GPT-2 and BERT and then use an end-to-end trainable approach to promote semi-supervised learning. The problems generated by this method are of high quality and have a higher degree of semantic similarity. In addition, the experiments performed show that the proposed method allows problems to be generated and greatly reduces the burden of complete annotations. The word embedding in a two-way context makes the pre-trained BERT perform better in question answering tasks. In addition, since BLEU and similar scores are weak metrics for evaluating generation ability, they recommend using BertQA as an alternative metric for evaluating the quality of problem generation.

The BERT model performs very well on many NLP tasks and it not only has an English version but also many other voice versions. The study found that the BERT model trained by a single voice is better than the BERT model trained by multiple languages. Therefore, training the BERT model of a specific speech has a great effect on the natural language processing task of a specific language. De Lobel et al. [16] proposed a new Dutch model based on RoBERTa, called Robbert. And through different NLP tasks, it is proved that its performance is better than other language models based on the BERT model. And, they found that Robbert’s model performs better when dealing with smaller datasets.

Chernyavskiy et al. [17] proposed a system specially developed for SemEval-2020 Task 11 in a news article. The model they proposed is based on the architecture of RoBERTa, and then the final model is completed through integrated learning of the model after subtask training.

In work [18], Polignano et al. proposed an Italian language understanding model, ALBERTo. This model is used for training with hundreds of millions of Italian tweets. After training, the model is fine-tuned on a specific Italian task, and the final result is better than other models.

The research of Moradshahi et al. [19] shows that different NLP tasks cannot transfer knowledge through the BERT model. So they proposed HUBERT, a modified version of the BERT model. This model separates the symbols from the roles in the BERT representation by adding a decomposition layer on the BERT model. The HUBERT architecture utilizes tensor product notation, where the notation of each word is constructed by binding two separate attributes together. In extensive empirical research, HUBERT has shown continuous improvement in knowledge transfer across various language tasks.

Wu et al. [20] proposed two methods to identify offensive language behaviors in social media. First, they use supervised classification. Second, they use different pre-training methods to generate different data. In addition, they did good preprocessing work, and they translated the emoji into words. Then, they use the BERT model for identification.

Gao et al. [21] study the feature engineering model, based on the related work in the embedded neural network, and try to use the BERT model with deep neural networks. Then, they proposed TD-BERT models with different forms. In different NLP tasks, they compared its performance with other methods. The results show that the TD-BERT model performs best. Experiments show that the complex neural network used to bring good performance through embedding does not match the BERT and incorporating target information into the BERT can stably improve performance.

In this work, González-Carvajal et al. [22] compared the BERT model with traditional machine learning methods in many aspects. The traditional machine learning NLP method uses the TF-IDF algorithm to train the model. The article compares and analyzes the text analysis experiments of the four methods. In all these classifications, we use two different classifiers: BERT and traditional classifiers.

Baruah et al. [23] use classifiers based on BERT, RoBERTa, and SVM to detect aggressiveness in English, Hindi, and Bengali texts. Their SVM classifier performed very well on the test set, with 3 out of 6 tests ranking second in the official results and fourth in the other. However, through more careful analysis, it can be seen that the SVM classifier performs better because the SVM model has a better classification effect. It is found that the BERT-based classifier can better predict minority groups. It was also discovered that their classifier did not correctly handle spelling errors and deliberate spelling errors. FastText word embedding works better when dealing with orthographic changes.

Lee et al. [24] trained the Korean version of the BERT model, KR-BERT, by using a small corpus. Due to the particularity of Korean and the lack of corpus, it is also very important to use the BERT model for language representation. For this reason, they compared different tokenizers and gradually narrowed the minimum range of tokens to build a better vocabulary for their model. After these modifications, the KR-BERT model they proposed can achieve better results even with a small corpus.

In this paper [25], Li et al. compared the BERT and XLNet models, especially from the comparison of the computational characteristics of the two. Through comparison, they found two points. The first is that the two models have similar computational characteristics. The second is that the XLNet model has a relative position encoding function. On modern CPUs, they have 1.2 times the arithmetic operation and 1.5 times the execution time. At this cost, a better benchmark score was obtained.

As multiple geographic locations are involved, the data is inherently multilingual, leading to frequent code-mixing. Sentiment analysis of the code-mixed text can provide insights into popular trends in different regions, but it is a challenge due to the non-trivial nature of inferring the semantics of such data. In this paper [26], the author use the XLNet framework to solve these challenges. They used the available data to fine-tune the pre-trained XLNet model without any other pre-processing mechanisms.

Ekta et al. [27] proposes a method for studying machine reading comprehension. This method uses eye-tracking data for training and studies the connection between visual attention and neural attention. However they show that this connection does not apply to the XLNet model, although XLNet performs best in this difficult task. Their results show that the neural attention strategies learned by different architectures seem to be very different, and the similarity of neural and human attention does not guarantee optimal performance.

Natural speech processing technology has been widely used in real life. But models such as BERT and RoBERTa need to consume a lot of computing resources. Iandola et al. [28] found that the grouped convolution method improves the efficiency of the computer vision network, so they used this technology in SqueezeBERT. Experiments show that its training speed is 4.3 times faster than BERT.

The BERT model has a good performance in several NLP tasks. However, its performance in certain professional fields is limited. Therefore, Chalkidis et al. [29] found that the application of the BERT model in the professional field requires the following steps: use the additional pre-training of the specific domain corpus to adjust the BERT model or pre-train the BERT model from scratch on the specific domain corpus.

Lee et al. [30] uses the BERT model to implement word embedding. Then the text processing is performed by integrating the two-way LSTM model and the attention mechanism. The accuracy of such an integrated method can reach 0.84.

Bashmal et al. [31] also used an ensemble learning method based on the BERT model. After preprocessing Arabic Tweets, they encode emoticons. Then through the integration of the BERT model and an improved BERT model for processing sentences, a high accuracy rate was finally obtained.

The transformer model has achieved great results in many NLP tasks. However, the transformer model has many parameters and requires a lot of space and computing resources. So how to add a smaller and faster model has become a problem. Nagarajan et al. [32] proposed a new method to reduce the size of the transformer model. The approximate calculations to use simple computing resources and reduce the use of some unimportant weights. Doing so allows the model to gain faster speed with only a loss of accuracy.

There are generally two methods of normalization in neural networks, layer normalization and batch normalization. Shen et al. [33] described why the transformer model uses layer normalization. Later, they proposed a power normalization method, which achieved better results.

While the transformer model has demonstrated proficiency in addressing numerous natural language processing challenges, fine-tuning the model remains an intricate endeavor. In their work, Li et al. [34] introduced a visualization framework aimed at providing researchers with an intuitive means of obtaining feedback during parameter adjustments. This framework enhances clarity during the model’s fine-tuning phase by offering researchers a more transparent view of its behavior and performance.

The BERT model based on the transformer model is also applied in the medical field. Electronic health records are often combined with deep learning models to predict patient conditions. Inspired by this, Rasmy et al. [35] proposed the Med-BERT model, a pre-training model trained through patient electronic health record data. In the experiment, it was found that the Med-BERT model has a higher accuracy rate in predicting patients’ clinical tasks.

With the development of the Internet, it has become easier for people to obtain news and information, and there are more and more false information and false news. As a consequence, Schütz et al. [36] harnessed multiple pre-trained transformer-based models for the purpose of identifying fake news. Their empirical findings underscore the robust capability of transformer models in effectively discerning and detecting fake news.

As the Internet continues to evolve, the prevalence of social media platforms has surged, with a substantial portion of content comprising satire. Identifying satirical language poses a unique challenge due to its distinctive nature. In response, Potamias et al. [37] introduced an approach that amalgamates a recurrent neural network and a transformer model to discern satirical language. Empirical results from their study highlight the enhanced performance of their proposed model when applied to the dataset.

The BERT model released by Google is trained on the English corpus. If you want to apply the BERT model to other models, you need to use corpora of other languages to train the model. Souza et al. [38] used the Spanish corpus to train the BERT model and got good results in the test on downstream tasks. They called the trained model BERTimbau.

The BERT model based on the transformer model has a good performance on many NLP tasks. González-Carvajal et al. [39] described why the BERT model performs better than traditional machine learning methods on natural language processing tasks. Describe the superiority of BERT through different experiments.

The BERT model is a pre-trained model based on the transformer model, while the ALBERT model is a lightweight BERT model. Choi et al. [40] compared the BERT model and the ALBERT model, and then proposed an improved version of the BERT and ALBERT model, the Sentence-BERT model, and the Sentence-ALBERT. Through experimental reality, the proposed model has better performance than BERT and ALBERT.

Koutsikakis et al. [41] used Greek predictions to train the BERT model, and obtained a GREEK-BERT model suitable for Greek NLP tasks. And in the task test of natural language processing. They found that the single-language GREEK-BERT model they trained is better than the M-BERT model and XLM-R model that are suitable for multiple languages. In their research, Hall et al. [42] conducted an extensive review of NLP models and their applications in the context of COVID-19 research. Their focus was primarily on transformer-based biomedical pretrained language models (T-BPLMs) and the sentiment analysis related to COVID-19 vaccination. The comprehensive review encompassed an analysis of 27 papers, revealing that T-BPLM BioLinkBERT exhibited strong performance on the BLURB benchmark, which involves the integration of document link knowledge and hyperlinking into the pretraining process. Furthermore, the study delved into sentiment analysis, leveraging various Twitter API tools. These analyses consistently depicted a positive sentiment among the general public regarding vaccination efforts against COVID-19. The paper also thoughtfully outlines the limitations encountered during the research and suggests potential avenues for future investigations aimed at enhancing the utilization of T-BPLMs in various NLP tasks related to the pandemic.

Casola et al. [43] conducted an extensive study on the increasingly popular pre-trained transformers within the NLP community. While these models have showcased remarkable performance across various NLP tasks, their fine-tuning process poses challenges due to a multitude of hyper-parameters. This complexity often complicates model selection and the accurate assessment of experimental outcomes. The authors commence by introducing and detailing five prominent transformer models, along with their typical applications in prior literature, with a keen focus on issues related to reproducibility. One noteworthy observation was the limited reporting of multiple runs, standard deviation, or statistical significance in recent NLP papers. This shortfall could potentially hinder the replicability and reproducibility of research findings. To address these concerns, the authors conducted an extensive array of NLP tasks, systematically comparing the performance of these models under controlled conditions. Their analysis brought to light the profound impact of hyper-parameters and initial seeds on model results, highlighting the models’ relative fragility. In sum, this study underscores the critical importance of transparently reporting experimental details and advocates for more comprehensive and standardized evaluations of pre-trained transformers in the NLP domain.

In a separate vein, Friedman et al. [44] introduce a transformer-based NLP architecture designed to extract qualitative causal relationships from unstructured text. They underscore the significance of capturing diverse causal relations for cognitive systems operating across various domains, ranging from scientific discovery to social science. Their paper presents an innovative joint extraction approach encompassing variables, qualitative causal relationships, qualifiers, magnitudes, and word senses, all of which are instrumental in localizing each extracted node within a comprehensive ontology. The authors demonstrate their approach’s effectiveness by presenting promising outcomes in two distinct use cases involving textual inputs from academic publications, news articles, and social media.

In the realm of actuarial classification and regression tasks, Troxler et al. [45] delve into the utilization of transformer-based models to integrate text data effectively. They offer compelling case studies involving datasets comprising car accident descriptions and concise property insurance claims descriptions. These case studies underscore the potency of transfer learning and the advantages associated with domain-specific pre-training and task-specific fine-tuning. Moreover, the paper explores unsupervised techniques, including extractive question answering and zero-shot classification, shedding light on their potential applications. Overall, the results eloquently demonstrate that transformer models can seamlessly incorporate text features into actuarial tasks with minimal preprocessing and fine-tuning requirements.

Singh and Mahmood [46] offer a comprehensive overview of the current landscape of state-of-the-art NLP models employed across various NLP tasks to achieve optimal performance and efficiency. While acknowledging the remarkable success of NLP models like BERT and GPT in linguistic and semantic tasks, the authors underscore the significant computational costs associated with these models. To mitigate these computational challenges, recent NLP architectures have strategically incorporated techniques such as transfer learning, pruning, quantization, and knowledge distillation. These approaches have enabled the development of more compact model sizes, which, remarkably, yield nearly comparable performance to their larger counterparts. Additionally, the paper delves into the emergence of Knowledge Retrievers, a critical development for efficient knowledge extraction from vast corpora. The authors also explore ongoing research efforts aimed at enhancing inference capabilities for longer input sequences. In sum, this paper provides a comprehensive synthesis of current NLP research, encompassing diverse architectural approaches, a taxonomy of NLP designs, comparative evaluations, and insightful glimpses into the future directions of the field.

In a separate domain, Khare et al. [47] present an innovative application of transformer models for predicting the thermal stability of collagen triple helices directly from their primary amino acid sequences. Their work involves a comparative analysis between a small transformer model trained from scratch and a fine-tuned large pretrained transformer model, ProtBERT. Interestingly, both models achieve comparable R2 values when tested on the dataset. However, the small transformer model stands out by requiring significantly fewer parameters. The authors also validate their models against recently published sequences, revealing that ProtBERT surpasses the performance of the small transformer model. This study marks a pioneering endeavor in demonstrating the utility of transformer models in handling small datasets and predicting specific biophysical properties. It serves as a promising stepping stone for the broader application of transformer models to address various biophysical challenges.

Discussion

One year after the transformer model was proposed, the BERT model of the encoder part using the transformer model has gradually become familiar and applied to many NLP tasks. Although the BERT model performs well in various NLP tasks, it is computationally intensive and takes a long time. Therefore, paper [10, 11] proposed a method of knowledge distillation to compress the capacity of the model. Two methods are proposed in paper [10], one is the student model learning k layers from the teacher model, and another one is that learning from every k layer from the teacher model. In the methodology described in [11], the approach leverages the shared dimensionality between teacher and student networks. It involves the initialization of the student network from the teacher network by selectively taking one layer out of every two layers in the model.

Not only the encoder part of the transformer is widely used in NLP tasks, but the GPT model of the decoder part based on the transformer model also performs well in NLP tasks. In addition, the RoBERTa model based on the BERT model and the XLNet model, which improves the BERT model, also has good performance. Paper [8, 14, 22, 23] compares several models. Among them, paper [8] has two contributions. The first is the use of models for sentiment analysis of financial news. There has been very little such work before. The second point is to conduct a lot of comparison experiments, using a lot of different text representation methods and machine-learning classifiers for comparison. In paper [14], Geometry of BERT, ELMo, and GPT-2 embeddings are mainly compared. By analyzing the vectors corresponding to words in different layers, we understand the different embedding representations of the three models. In the article [22], the author compared the classification performance of traditional machine learning that uses vocabulary extracted from a TF-IDF model and the BERT model through several experiments. The empirical evidence of the BERT model’s superiority in average NLP problems classical methodologies have been added through four experiments. In the article [23], the author compared BERT, RoBERTa, and SVM models in three languages. Interestingly, the best performing model in this article is SVM, which shows that the performance of traditional machine learning methods can also surpass the transformer model. In this article, we also discovered the importance of data preprocessing, because the spelling of words will cause errors in word embedding, which will lead to incorrect predictions.

Only using a single model such as BERT or XLNet cannot solve some problems, so paper [13, 15, 21] proposed some solutions combining transformer models with other machine learning methods. In paper [13], they used the XLNet model combined with deep consensus for sentiment analysis. This combination can better improve the accuracy of the model. In paper [15], the article studies the generation and answering of questions. They use the GPT-2 model to combine with the BERT model. This combination makes better use of collaborative question generation and question answering. In paper [17], they use RoBERTa as the main model. But at the same time, additional CRF layers were added, and training was performed on two tasks. The results show that this combination is better than just using RoBERTa. In the article [21], the author proposes the TD-BERT model, which is similar to the model in which a fully connected network is added to the BERT network for classification. The difference is that TD-BERT adds a maximum pooling layer after BERT, which allows the classifier to make better use of location information.

The BERT model uses a lot of English corpora for training and has a very large English corpus. But for NLP tasks in other languages, the BERT model is not competent. In paper [16, 18, 24, 26], other language models based on the BERT model have been established. Among them, paper [16] trained a large number of Dutch corpus to obtain a RobBERT model based on Dutch, and paper [18] established an ALBERTo model for Italian NLP tasks. These two models perform very well in the NLP task of the corresponding language. Since Korean is one of the rich languages that use non-Latin alphabets and lack resources, it is also important to capture language-specific linguistic phenomena. In the article [24], the author proposed a KR-BERT model for Korean NLP tasks. This model uses a smaller corpus for training, which makes the training time of the model shorter and more efficient. People use a lot of informal languages when using social media. A lot of code-mixed languages will be produced, such as mixing two languages. Such sentences will be a big obstacle to sentiment analysis. In the article [26], the XLNet model was used to solve such problems, but it did not perform well. I think that for such code-mixed languages, the corpus is no longer working, so you can try to use traditional machine learning methods or use other data preprocessing methods. The attention mechanism is a very important part of the transformer model. Is such an attention mechanism similar to the human attention mechanism? The paper [27] gave the answer. They found that the higher similarity of human attention and performance is significantly related to the LSTM and CNN models, but it is not true for the XLNet model. The XLNet achieved the best performance, which shows that similar to human attention does not guarantee the best performance. It also shows that the machine can only think in a more advanced way.

Iandola et al. [28] with their SqueezeBERT approach and Nagarajan et al. [32] with their use of approximate computing and pruning. The fine-tuning and pre-training of the BERT model for specific tasks or domains include papers such as Chalkidis et al. [29] with their domain-specific pre-training approach and Rasmy et al. [35] with their work on the medical text. The application of the BERT model in various tasks such as sentiment analysis, fake news detection, and satire detection. Lee et al. [30] use of bidirectional LSTM with attention, Bashmal et al. [31] work on Arabic sentiment analysis, Schütz et al. [36] use a transformer-based approach to fake news detection, and Potamias et al. [37] use recurrent and convolutional neural networks for satire detection. The following papers are focused on comparative analysis and model development, including papers such as Souza et al. [38] with their BERTimbau approach for Brazilian Portuguese, González-Carvajal et al. [39] with their comparison of BERT to traditional machine learning models, Choi et al. [40] with their comparative study of BERT variants, and Koutsikakis et al. [41] with their work on GREEK-BERT for Greek NLP tasks.

Hall et al. [42] and Casola et al. [43] review the use of transformer-based biomedical pretrained language models in COVID-19 research and the importance of reporting experimental details and standardization for reproducibility. Friedman et al. [44] present a joint extraction approach to extracting qualitative causal relationships from unstructured text, which has important implications for cognitive systems and graph-based reasoning. Troxler et al. [45] explore the use of transformer-based models to incorporate text data into actuarial classification and regression tasks. Singh and Mahmood [46] provide a comprehensive overview of current NLP research, including different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in the field. Khare et al. [47] demonstrate the potential of transformer models in predicting the thermal stability of collagen triple helices directly from their primary amino acid sequences.

Experimental setup

In our experiment, we mainly use six tasks to compare five different models. In this section, we introduce the datasets and various parameters used by the six tasks. Then we provide the results of the five models on these six tasks and explain the results.

Benchmark datasets

In this part, we briefly introduce the datasets available in the literature used in different tasks.

Coronavirus tweets dataset [48] for sentiment analysis task

For the sentiment analysis task, we use the dataset on kaggle website [48]. These data are obtained from Twitter, and then manually labeled and classified. The dataset includes six columns, namely UserName, ScreenName, Location, Tweet At (time to tweet), Original Tweet (content of Tweet), Label (emotional label). Among them, the content in Original Tweet is the unwashed original Tweet. And, Label contains five categories, namely Neutral, Positive, Extremely Positive, Negative, Extremely Negative. Table 1 shows the dataset we used.

Table 1 Coronavirus tweets dataset [48]

Survey of transformers and towards ensemble learning using transformers for natural language processing

Abstract

Graphical Abstract

Introduction

Motivation

Research questions

Contributions

Structure of the paper

Background

Key natural language processing tasks

Sentiment analysis

Question answering

Translation

Text generation

Text summarization

Named entity recognition

Topic modeling

Machine learning and deep learning approaches for NLP before transformer

Rule-based methods

Methods based on machine learning

Methods based on deep learning

Different models based on transformer

BERT [3]

GPT2 [4]

XLNet [5]

RoBERTa [6]

ALBERT [7]

Word vectors in transformer models

What makes transformer model better for NLP tasks

Self-attention [1]

Multi-head attention [1]

Positional encoding [1]

Mask [1]

Review of related works

Discussion

Experimental setup

Benchmark datasets

Coronavirus tweets dataset [48] for sentiment analysis task

SQuAD 1.1 [49] for question answering task

Groningen Meaning Bank corpus [50] for named-entity recognition task

CNN daily mail dataset [51] for text summarization task

Disaster tweets dataset [52] for topic modeling task

Trump 2020 election speech for text generation task [53]

Testbed

Sentiment analysis task

Question answering task

Named-entity recognition task

Text summarization task

Topic modeling task

Text generation task

Results and discussion

Sentiment analysis task

Results

Discussion on the results of sentiment analysis task

Discussion on the results of question answering task

Results

Discussion of question answering result

Named-entity recognition task

Results

Discussion of named-entity recognition result

Text summarization task

Results

Discussion on the results of text summarization result

Topic modeling task

Results

Discussion on the results of topic modeling result

Text generation task

Results

Discussion on the results of text generation task

Statistical significance of model comparisons

Efficiency analysis

Using the ensemble learning methods

Using the ensemble learning method for transformers

Architecture

Procedure

Using the ensemble learning method for sentiment analysis task

Customized architecture

Hyperparameter

Dataset

Design discussion