Skip to main content

Transforming the generative pretrained transformer into augmented business text writer


This study uses transformers architecture of Artificial neural networks to generate artificial business text for a given topic or theme. The implication of the study is to augment the business report writing, and general business writings process with help of generative pretrained transformers (generative pretrained transformer (GPT)) networks. Main focus of study is to provide practical use case for GPTs models with help of big data. Our study model has 355 million model parameters and trained for three months on GPU enable devices using 2.3 billion text tokens(is available as open-source data now). Text tokens are collected with help of rigorous preprocessing, which includes; shortlisting of Subreddits of Fortune 500 companies and industries, listed on US-based social news aggregation online portal called “Reddit”. After shortlisting, millions of submission of users during the five years, are parsed to collect the URLs out of it. 1.8 million working URLs are scrutinized. Business text is parsed, cleaned, and converted into word embeddings out of uniform resoruce locator (URLs). The result shows that both models; conditional interactive and random sampling, generate text paragraphs that are grammatically accurate and stick to the given topic.


With the passage of time, the field of artificial intelligence, and machine learning have been made progress by leaps and bounds. Nearly all fields are getting benefits from the cutting-edge technologies to leverage their processes, and Deep learning is one of them. Big tech giants are reformulating their strategies to align with AI and ML. Deep learning is a branch of Machine learning that enhances the model learning process with its deep layered architecture. Like many other walks of life, Deep learning has won its spurs as a very effective and efficient technique for natural language processing related tasks. Since, computers are unable to understand the natural language, enabling them to understand the natural language and to process the information in a useful fashion has long been under the researchers’ and practitioners’ focus.

This study is inspired by the new method implement by the Google Brain team [47] and the work of OpenAI [36]. Before introducing the transformers implement by the above-cited research work, it is important to shed the light on the recent past of Natural language processing (NLP). Although Natural language Processing (NLP) has deep roots in the past and the first breakthrough was the well-known paper of Alan Turing ‘Computing Machinery and Intelligence’ [46], real progress in the field has been made in the late 1980s—when machine learning algorithms came into the picture. The machine learning revolution has permanently changed the approaches to address NLP related problems. At the start, mostly much stress has been given to rich text features embedding—to enables Artificial Neural Networks (ANNS) to understand the rich text in numerical form. Later these embeddings are given to an end-to-end neural network that essentially maps the input and output, i.e [32]. Later one, seminal work published related recurrent neural network [40]. Recurrent models are very important for natural language processing because natural language caries lexical, syntactical, and semantic context in it- thus previous words or characters are very important to solve machine translation and text prediction tasks. In the year 2002 Jürgen Schmidhuber and his students [18] came up with a better idea for neural network application that involves long-term dependencies, named, Long Short Term Memory (LSTM). Long Short Term Memory (LSTM) devises some gating and sates mechanism that keeps import information from the previous sequence and also memories the previous state that finally accumulates to the current state to predict the next sequence. Many enhancements have been made by the research community in the recurrent neural network model. The most highlighted models are seq2seq (sequence to sequence) [24, 44]. Seq2seq models essentially work with encoders and decoders recurrently to encode the output of the previous sequence and combine it with the current input. The next enhancement in recurrent model is attention mechanism, see [55, 56]. Attention mechanism has been proven very well in machine translations, where two pairs of sentences of two languages are mapped together with encoders and decoders.

So, looking back to the short history of the evolution of the natural language processing techniques, we understood one common limitation of all these models concerning solving the NLP task is the models are computational resources hungry and very slow. NLP corpus normally involves an enormous amount of training data, long-term dependencies, and recurrent nature. These factors make the training process very slow to achieve the desired result. Addressing this problem, the research community has come up with multilayered attention head and encoder decoders—formally called Transformers [47]. The current study uses a similar approach to generate the domain specific text, and detailed methodology is discussed in "Methods". We have used a recently developed transformer neural network architecture. This architecture is primarily used for Google translation works in two different blocks, namely, encoders and decoders. We have only used the decoder part. We have provided the model with a 2.3 billion text token during the training. The model has 355 model parameters and has been trained for 3 months to reach a 2.6 training loss value. Above-mentioned 2.3 billion text tokens are collected after rigorous data preprocessing steps. US-based social news aggregation and discussion forum has been selected for data collection purpose. Almost 700 Subreddits are shortlisted for the purpose of getting URLs out of it. Millions of submissions for five years have been considered. Submission means any post, comment, or reply by the user. Users often redirect towards URLs for clarification. So, 1.8 million URLs are collected from the submissions, and validation and functionality of all URLs have been confirmed. With the help of a parser, these URLs are parsed and cleaned to get the text. Finally, 2.3 billion ready to feed to the model word embedding has been generated. In rest of the paper; literature review, Methodology of the study and model, results of the study and limitation and future suggestion have been given respectively.

Research gap

After getting the flashback of the evolution of the NLP and recent developments of NLP, we can see one common problem for all Natural language understanding problems is creating a relationship matrix between the words or characters and giving importance to the specific word at a specific place. Solving this problem is very important for all NLP-related niches, for example, Natural language understanding, Natural language generation and, machine translation. In this connection, we have mainly two problems to be solved. Problem no 1 is again giving importance to the words and specific place in the sentence and creating correlation or context to each word embedding based on their usage. The second very problem is supplying a lot of data or in other words a lot of instances to the model to learn the placement and relational pattern of the characters or words. Giving a lot of data needs a lot of words’ embeddings matrix that leads to extremely slow model training and a lot of computation resources. So, the computational and efficiency problem is more lethal as it seems to get a breakthrough of problem No. 1. The research community either could wait for the computation resources to get more efficient and faster enough to solve the problem at hand, or they must have to come up with an optimal solution. So, the solution to this problem was attention mechanism [47] and most specifically transformer architecture of neural networks, formally called encoders decoders [36]. well, fair enough transformer can, theoretically, overcome the above-mentioned problems and give a new horizon to the landscape of NLP and NLG, but we need to provide a lot of real-life use cases and proof of concept to supplement this new ANNS architecture. After this conceptual breakthrough, the next challenge is to come up with a lot of data and preprocess that much big data to supply it to these new models to proof the concept of the conceptional invention. Our paper is exactly filling this gap here by coping with the challenge of developing the proof of concept and practicality of this new advancement of NLP and deep learning. So, in this journey the most important step is to find a use case; so, we have chosen business-related reports and text writing. In the next subsection, we will give precise details where and how this concept can be used in a commercial setting and what benefit it can promise. Coming back to the current point, getting a lot of business-related data is very important as well very hard because of a lot of irrelevant text and without the authenticity of being business text. So, involving humanized efforts to tag data is very costly and not plausible. So, we decided to use “reddit” a platform, widely used, and each post is voted by the community. In this way, we could get human checked data in huge volume, related to the business problems. it is also relevant to mention here that we did not parse data from “reddit” directly, rather we have only collected URL links from the posts, and then we parse complete URLs text. So, our main contribution here is rather less on the theoretical side and more on the practical side. As we have retuned and adopted the existing theoretical concept in a more practical setting to provide its proof of concept. after having this discussion, it’s very relevant to provide one hypothetical application instance and possible commercial usage of this study. So, next subsection talks about the hypothetical ideal use case and overall generic use cases of the study.

Hypothetical use case

Let’s here create a practical scenario. In the office and business management, there are a lot of reports and text writing, for example, Manager X has to give a job placement ad for a consultancy firm, or, he has to write an advertisement. He has to write a small report about his product and its competitor in the industry he is operating to get external funding. In such cases the grammar is not only an important factor but are pinning words other people are using in the industry to influence more or clarity of text is maybe more important. Let say a software application helps Manager X in two ways; first, gives a context or appropriate usage of words replacement based on millions of other use cases already people used in similar instances. Second, if he writes “Apple Inc.”. the application suggests him, i.e., “Apple has launched iPhone pro max. in 2020 that gave them xxx hundred thousand $ annual revenue”. So, now Manager X can save a lot of time and energy in surfing google in searching facts and figures. if some assistance is provided on how he can paraphrase any keywords, could improve business writing greatly. I know that requires a lot of work on front-end development too, but the Black box part would be NLG here.

Practical implication

The study has great potential for real-world practical uses: for example, next-word prediction, topic modeling to extract text out of scanned images, contextual soundness of the business writing, and suitability of word usage even if it’s grammatically correct in the first place. Any subject-specific knowledge, language usage, and vocabulary are always different compares to generic languages. Many companies and start-ups have software applications that are using a similar approach but use general language text. Here is a list of some: Gmail salutation and common words autofill used during the email [20], Grammarly [21] gives words context suggestion and content clarity based on the text they have trained upon. At the start of registration, they asked for purpose of use. Maybe something like Grammarly business writer or something similar could be the very practical use of this study. Reverso Translator gives translation based on the frequency of usage of the word in literature along with text, except where the looked-up words have been used. There is the potential of usage of such tool is there where one can give the accurate context of the only business-related text. Lastly, we did know at the time of conducting this research, but one online platform emerges now which is using augmented writing approach with greater success having a top-level firm in their customers’ portfolio, i.e. see [54]. This would be a very true practical usage of such a study. There is not only business related application of language generation model but also applied to many filed. i.e, van Deursen [15] introduced Generative Examination Networks (GEN) to generated chemical space. 5].

How deep learning integrate into corporate sector?

The literature on the Natural Language Processing is root back in the 1940s. After parsing the literature, the evolution of NLP can be segregated into different phases; for example, the journey started from machine translation problems, followed by the computers and information technology revolution—that triggered the AI applications into this area. After AI and machine learning came into the picture—complex task solving ability has been improved with less time—thus grammatical structure has been focus more. After advancements like deep learning and reinforcement learning, NLP has now entered into artificial text generation and generated text is hardly differentiates from human written.

Though the research community of that time had been working on NLP, the first scientific paper was published by the MIT language department head, William. N Locke and A.Donald Booth, head of the Brick-Beck collage [28]. Machine Translation (Machine Translation (MT)) started with three dominant languages of that time, English, Russian, and a bit of Chinese. Computational resources were too scarce and much effort had to be exerted on converting data in bits [1]. Early birds in this area have given focus to syntactical computational processing of language, and it was important to first draw the basic structure for the language [35]. Work of [11] some researchers have tried to shift the focus from the syntactical to semantic oriented language processing. Ceccato tried to co-relational analysis between the same pattern of a pair of languages and tried to achieve the semantic driven language processing. Winograd [52] and Woods [53] have seen the 1960s transformational grammar theory is a misfit of computational grammar and analysis and not offering much in terms of semantics. The computational confidence approach is given by Woods’ and Winograd’s enriched the previous work in a semantic path.

Later on, in the 80s, AI came into the picture and the community has shifted their focus toward a machine leaning based approach for solving the existing dilemmas of NLP in a pure semantics way [41]. In this decade, researchers have realized that the NLP task such as building the word representation to use in AI-related networks and pining the context is very hard. Some note able work of the 1980s is as follows: Briscoe et al. [9] have built a general-purpose grammatical formalism including syntactical analyzer for the English language with help of suboptimal software, named Grammar Development environment (Grammar Development environment (GED)). They also program software to build and manage a large grammar base. Towards the direction of speech recognition, Young et al. [57] have led to major US speech recognition projects, called, Continuous speech recognition (Continuous speech recognition (CSR)) and (Long vocabulary speech recognition (LVCSR)). The paper includes tools and methods for news transcription, text dictation, and transcriptions.

The next phase of the NLP development is the 1990s, that mostly focuses on a combination of lexical and syntactical approach for natural language processing. After lot of twists and struggle of almost two decades, the statistical and probabilistic approach has been adopted for classification tasks in NLP [43]. Later on, these models became raw sources of machine learning related techniques to solve the NLP complexities. for example, Manning and Schuetze [29] have worked on information retrieval, feature extraction out of it, and analyzing the textual information with statistical models. Mani and Maybury [30] have used terminological logic to built a knowledge base for automatic information extraction and text summarizing. By the end of the 1990s, dialogue speech system and language processing had expanded the horizon with multilingual text machine translations, speaker-independent speech to speech dialogue system. Wahlster [50] has worked on project Foundation of Speech-to-Speech Translation—so-called, ‘Verbmobil’. This multilingual (German, English, and Japanese) takes input in a speaker-independent manner and translates them into other desired languages. it also handles domain-specific business spoken dialogues and translates into other languages with approximately 80 percent accuracy. The struggle of many years make the NLP researchers, practitioner, and industry realize that linguistic resources are inevitable for the further development in this filed, thus, two institutions, “British National Corpus” [8] and “WordNet” [17] are come into being. The next era of natural language processing started after 2001. Though many models have been proposed by the researchers which were other than neural networks, we are only discussing the neural network-oriented important models in this paper.

Bengio et al. [7] proposed tri-gram state-of-the-art neural probabilistic model. They have used a neural network for the probability function. The idea is based on the conjecture that unseen words get a higher probability to be predicted based on the similarity of the words—on which the network is trained. The next word prediction approach has many practical uses commercially, for example, see the work of [26] that can generate a small short semantic reply of the email.

The next advancement in the field of NLP is multitask learning, off-course this method is not only confined to the NLP but a general enhancement in the neural network world. Collobert and Weston [12] have tried to implement this technique for transfer learning. Vector representations of the words have been fed as an input to the model to do word prediction and then learning of the current model was transferred to the other independent model to achieve a similar but not the same task. The multi-task learning approach was first introduced by the Caruana [10]. Once, so-called, word vector representations are fed to the neural network, they start learning the context and association of each work with the other. Transfer learning makes it possible to share the learned weight across the models for generalization and incremented learning approach. During the optimization process, it is very important which parameter to transfer. Ruder [39] proposed that the sharing parameter can also be learned during the learning process. See also similar research [31]. In this connection, the next milestone was “vectors representation” of the text, so-called word embeddings. This basic word embedding idea was first floated by mikolov [33]. They have proposed that removing the hidden layer while training the word embedding is giving more promising outcomes. Later on, this idea paved the way for the concept ‘word2vec’ and originally adapted to two popular approaches, namely, bags-of-words and skip grams. This phenomenon has triggered the research interest in this direction and many researchers have enrich this concept see [2, 3, 34, 51]. The current direction of the word embedding is to train a very large corpus and use used pre-trained embeddings for multilingual models in an independent and unsupervised fashion. for example, see [4, 13, 42].

In the year 2013 and 2014 neural network architectures are being applied to NLP, the most obvious choice was recurrent, recursive, and convolutional neural networks. simper Elman [16] RNNs were replaced with LSTM by [23] because of long-term context dependencies in input text. secondly, convolutional networks are originally dealt with computer vision areas but also implemented in NLP for example see the work of [25, 27]. The obvious plus of the using convolutional network is they are more parallel and local context based on layers rather than past state contrary to the LSTMs.

Concerning recurrent neural networks, the next enhancement was a sequence to sequence modeling (seq2seq). Seq2seq model is using the same recurrent architecture of the neural networks, but the important bit is disguise in encoding and decoding procedures. The input sentence is first encoded into a vector representation. The decoder then tries to decode the predicted symbols based on the encoder state sequentially. The sequence to sequence model was proposed by Sutskever et al. [44]. Later on, in the year 2016 Google [19] has decided to change its monolithic sentence based machine translation to complete neural network-based. Now, seq2seq models are the foundation of language generation models and further developments, i.e transformer-based neural network architectures. Similarly, see also image captioning [48] is using the same technique to generate the image captions automatically. The seq2seq model leads toward attention mechanism and transformers based approaches. The basic limitation of the seq2seq network is that it tries to compress the whole sequence of the sentence and then convert it into a fixed-length vector. Thus, the model cannot look into the hidden state. Attention mechanism, by contrast, looks into the hidden state of the model combine them to realize how much stress should be given to a specific word. Attention [6] was the core innovation in the field of neural machine translation that permanently replace the traditional methods of machine translation. Have a look on different flavors of attention based networks and their application; reading comprehension [22], entity parsing [49], image captioning [55].

The pretrained model has gain popularity among the NLP research community. The main advantage of the pretrained model is that it is context agnostic and unsupervised model. Labeling for the NLP task can be very costlier and challenging. So, the pretrained model captures the meaning and context of one language and the leanings can be transformed into the other language to get the meaning and context generation or translation. The pretrained model was first proposed by Dia and Le [14]. The current study is also based on pretrained multi head attention based model.


In this section we have described how data is prepossessed and then processed data is fed to the model is discussed in detail. The completely prepossessed data will be available as an open-source data for further research and development.

Data preprocessing

In this section, we have described the process of data preparation for model training. Everything else with respect to the neural network model is similar to many other applications of ANNS, but the main concept here is to leverage the training process with an enormous amount of training data. Websites could be the potential source of a lot of textual data as well as a great deal of diversity in it, but the bottleneck with websites’ data is the validity of data and too much unnecessary information in it. Following the research by Vaswani [47] we have adopted a similar approach and choose ‘Reddit’ [37]—a USA based social news aggregation and discussion platform with 330 million users [37] to collection the website URLs to parse the data form. To ensure the validity and usefulness of the web URLs, only those links have been taken that contained more than 3 ‘karma’. ‘Karma’ is so-called assurance given by the other user about the validity of comments and discussion. In this way, we have got a human level quality check on the data. Once we have devised the mechanism of data quality, the next filer was to get the URLs that are only related to the business and Fortune 500 companies. Most of the top 500 companies have their discussion and news profile on ‘Reddit’ called ‘Subreddit’. ‘Reddit’ has a very large community and thus, thousands of submissions are committed on a daily basis. The raw data, ranging from 2005 to 2017, is first programmatically collected with help of the ‘Reddit’ programming interface [38] and stored in the ‘BigQuery’ database. In the next step, we have extracted all the URLs having ‘karma’ ranking more than 3 from the daily submission of the users. These URLs are verified, whether they are working or not and at the end 1,852,482 working URLs list was prepared to parse the textual data from ‘Hyper Text Mark Langauge (HTML)’ tags. With the help of parallel computing and a computer grid, 20 GBs of text files have been collected from all working URLs. These 20 GB text files are gain filtered for some unnecessary characters and symbols. Finally, the 2,302,554,291 text token were collected to be converted into word embeddings. The process is shown in Fig. 1a that depicts a flow of data preprocessing with help of a schematic diagram. preprocessing involves:

Fig. 1
figure 1

Data preprocessing and network architecture

Fig. 2
figure 2

Transformers general network architecture [47]


Next comes the transformer neural network model applied to preprocessed data. The Transformer model takes all words tokens are encoded into words embeddings, that is nothing but the numbers that represent each word. Normally, transformers have two parts, encoders and decoders, but we have only used the decoders part of the Transformer because both encoder and decoder are feasible for machine translations—that is not the case in this study. See Fig. 2 how general transformer works, originally designed for machine translation problems. This architecture was later adopted and modified by many researcher and lab to improve NLP and translation related problems. If you pay closer attention to the paper [48], you will realize transformers are also basically a from of transfer learning where sentence of the language one are pass through many layers of self-attention and feedforward neural network layers and update the training weights keeping the relationship of each word within the sentence and position of each words into mind, whereas, learned weighted of language one are transferred to feedforward layer of decoder part to learn the nature of relationship and position or grammatical aspect into mind when model tries to predict the words in the second language. That is how essence and context of sentence are translated correctly. So our case is rather different from machine translation, thus second language inputs’ weight are not possible here.So, we stick to the decoder part of the model as a main model architecture. coming back to the point of data processing, Words embedding are stored and converted into NumPy zip format for simplicity purposes. first, we will see the high-level representation of the model, and then we will look into how the self-attention layer is working. The model gets the words embedding as input, it assigns positional encoding to each word. The positional encoding keeps the position of the word into a sentence to capture the context efficiently, contrary to random order. Word embedding along with its positional information passes through the self-attention layer. The self-attention layer is twelvefold layers.

For analogy purpose, we can say this layer create many copies of the sentence and map the relationship and importance of each word in the sentence to figure out how much attention to the specific words is to be given. That is why it is called a multi-head self-attention layer. We can plunge into the self-attention layer to see how it is working. Input vector \({\mathbf {X}}_{1}.. {\mathbf {X}}_{N}\) is multiplied by three different vectors, namely, Query vector (\({\mathbf {q}}_{1}\)), Keys vector (\({\mathbf {K}}_{1}\)) and value vector(\({\mathbf {V}}_{1}\)). The vector is random weights of dimension 64 and the output of these matrices’ multiplication is \({\mathbf {W}}^{Q},{\mathbf {W}}^{k},{\mathbf {W}}^{v}\). In the next step, we get the dot product of (\({\mathbf {q}}_{1} \cdot {\mathbf {K}}_{1}....{\mathbf {K}}_{N}\)) for sentence (1....n.). To stabilize the gradient process, each output is then divided to the (\(\sqrt{d_k}\)), whereas, d is dimension of the vector k. This operation gives us scores for each word. higher the sores means that more attention should be given to that word. In the next step all the scores for on word related to all other words should be summed up into a variable \({\mathbf {Z}}\):

$$\begin{aligned} {\mathbf {Z}} = softmax\left( \frac{{\mathbf {Q}}\times {\mathbf {K}}^{T}}{\sqrt{d_{k}}}\right) \times {\mathbf {V}} \end{aligned}$$

This is the final calculation of one out of many self-attention layers, that is to be fed—in a matrix shape, to the feed-forward neural network. To focus on different positions of the words in the sentence we need, multiple representational subspaces, subspace is achieved with the help of multiple head or copies of the attention layer. so;

$$\begin{aligned}&{\mathbf {Q}}_{i}....{\mathbf {Q}}_{n}= {\mathbf {W}}_{i}{\mathbf {X}}\\&{\mathbf {K}}_{i}....{\mathbf {K}}_{n}= {\mathbf {W}}_{i}{\mathbf {X}}\\&{\mathbf {V}}_{i}....{\mathbf {V}}_{n}= {\mathbf {W}}_{i}{\mathbf {X}} \end{aligned}$$

whereas, i...n is the number of attention layers. \({\mathbf {Q}},{\mathbf {K}}, {\mathbf {V}}\) is the query, key, and value vector and \({\mathbf {X}}\) is the word embedding input matrix. So, every attention layer produces a \({\mathbf {Z}}\) matrix and depending on how much attention layers being chosen, in our case 12. The attention output matrices \({\mathbf {Z}}_{1}....{\mathbf {Z}}_{12}\) are multiplied with the weights’ matrix jointly for all layers, called \({\mathbf {W}}_{O}\). The resulting matrix is input for a fully connected feed-forward network. The final output of the feed-forward network is then decoded back to the words to generate the sequence of the sentence. For the clarity of the dimensions of the different matrices, please refer to Table 1.

Table 1 The table gives the dimension of the different matrices


In this section, we have described the results of our study. In this section, we have presented text samples that are generated by our trained model. The results include a sample from both conditional and unconditional samples. Conditional sampling means that we have provided a certain keyword to the model as an input and the model has returned a text paragraph related to that given keyword, however, unconditional means random samples generated by the trained model. Training loss summary of the ‘Tensorboard’ model is given in the Appendix section. To support out the accuracy of model and the sample are not appear out of chance, we have given 100 randomly generated sample by the model in the Appendix section.

We have trained the model up to 460,000 steps. Since the model has almost a 355Million model parameter and more than 2.3 billion text token, the model requires extremely excellent computation power and time. The model has been trained for 3 months on a single GPU and settles on a loss value of 2.6. This value of loss for the text-based model is quite reasonable because the language model always involves complex grammatical chains like dependencies and structures that are not easy to capture. The next two subsections provided real-time model generated text, both based on conditional and unconditional random outputs.

Interactive conditional outputs of the model

This subsection provides 5 different output samples of the interactive conditional sampling method of the study model. This is so-called interactive model outputs, in which the model communicates with the user. The user gives input/keywords to the model and the model generates a text paragraph that mostly talks about the given keyword/topic. Given are the Tables 2, 3, 4, 5 and 6 show output against five different user given inputs.

Table 2 Results of interactive conditional samples
Table 3 Results of interactive conditional samples
Table 4 Results of interactive conditional samples
Table 5 Results of interactive conditional samples
Table 6 Results of interactive conditional samples

Unconditional outputs of the model

In the section below we have given Tables 7, 8, 9, and 10 which show the random sample output generated by the model. this is an artificial text written by the model. If we observe the generated paragraphs, it is very clear that the text is following the grammatical rule mostly and topics of the sample pointing towards the business-related text. An enormous amount of sample can be produced on demand, due to the brevity of this article we have only given some sample.

Table 7 Results of non-conditional samples
Table 8 Results of non-conditional samples
Table 9 Results of non-conditional samples
Table 10 Results of non-conditional samples


In this section, we are going to discuss the results of the study and how these results stratify the problem inference of the study. The main focus of this study is to testify the validity and useability of current theoretical development in the field of natural language generation and generally Natural language processing. For the reliability of data, we have used subreddit to check the URLs at human-level quality check. The robustness is done with help of the KARMA points threshold which is 3 point KARMA. The choice of 3 KARMA is based on the average karma point being given normally in subreddits. As increasing the KARMA point gives you more human-level quality but it reduces the amount of data dramatically, which means losing a lot of quality information and context of the application.As we stressed out previously the long dependencies chain of one word to other words, placement of the word in a given sentence and relational space of the word and characters is the big challenge of language generation-related problems. This problem was very difficult for recurrent neural network models to cope up with. So, the researchers came up with different theoretical concepts. In this connection, we are providing practicality, useability, and proof of concept of the model in our study. For this purpose, we have provided two types of results, interactive conditional and non-interactive random samples. How we have trained the model, iteration, and loss graph can be observed in the Appendix section. However, the main objective of the study was to generate the text that sticks to the overall topic of text, formally called topic modeling, secondly grammatically correctness, and thirdly, somehow related to business only. If we closely read the results of section "Introduction", we have provided the model with random business-related diverse keywords from all different business genres. Model is not only able to collect very related text, but also supply some facts and figures. Moreover, the linkage of sentences and story making is very decent. That does the perfect job for the hypothetical use case and highlighted research gap. Additionally, for the robustness of the model, we have also created a random sample of text generations with thousands of instances. Due to brevity, we have provided some samples here, and we have given link to could to access all other thousands of samples. In both types, an interactive and non-interactive, model is achieving the initial goal for context, relatedness, and topic modeling. of course, this is just a founding block to generate any meaningful commercial soft application. we need to assemble other pieces of puzzles, namely front-end development, scripting, and mapping for words PACs of words matching the counts and statistics concerning the whole database, with many more bumps and stager on the journey down this road.


The current study is focusing on the application of Natural language processing in the field of business writing. In the recent past, the Deep Learning research community has come up with a new architectural style of deep layered AI models that are aligned with the specific need of natural language and text generation. The transformer is one of those models that are proven to very accurate and effective in context and grammar capturing in the text.

Response to the possible question, what is the purpose of the study very briefly? The study uses a generative pretrained neural network model. The model is fed with a lot of business-related preprocessed text data acquired parsing the 1.8 million URLs collected from Reddit. As a result of the trained model, user can give keywords or some topic to the model and model produced paragraph that completely sticks to the given topic, provided that the topic or keyword is in the domain of business or management sciences. These features or results provided by the study can be utilized in automatic paragraph prediction to assist the business report writer or any relevant person involves in the writing process. As there are many applications available for next word prediction generally but paragraph prediction is lacking.

Now, let us give little more details on how data is preprocessed, and the model is trained to get the output? A large amount of quality data is very important for language processing models. To address the quality issue we have chosen ‘Reddit’; news aggregation and content sharing platform. Although Reddit covers a lot of different topics, we have shortlisted ‘subreddits’—topic-specific Reddits. There is a huge amount of Reddit submission every day and ‘KARMA’ vote is given to the post that is helpful for the community. So we have collected 1.8 million URLs from those submissions that have ‘KARMA’ vote greater than three. In the next separate step, we have collected and cleaned all the text available in the URLs. In the end, 2.3 billion text tokens have been fed to the model. The model has 355 million parameters. After three months of model training, the model can generate grammatically correct and aligned with business topic text as a model output. In the coming subsection, we have discussed what could be the practical application of the model and future suggestions along with some limitations of the study.

Implications and future work suggestion

There are many possible implications of this study. One possible use is market intelligence report writing. Possibly a piece of software can be developed to auto-complete the paragraphs for business intelligence report writing. Any business-related industry can be benefited with help of paragraphs prediction instead of just word prediction. In this way, the speed of efficiency of the user can be enhanced significantly. As for future suggestions are concerned, we think that text token prefixed by the theme or topic of the text can make this model more useful. For example during the training text, at the start of the text, we can provide what this piece of text is talking about. In this way, we can have greater control over the output of the model we can generate real-time long reports based on specific keywords. The report is just one example we can utilize the model is much more effective ways. Additionally, the study can be done with a focus on different karma point numbers and how the change in KARMA point selection criteria is affecting the quality of model predictions. We hope that the research community is maybe already doing something in that direction.

Limitations of the study

We have tried to do the study at our best, but there certain technical limitations of the study. Since the models related to text generation is usually based on an enormous amount of training data; that is a very important factor to capture the grammatical structure and relatedness of the topic, this study only relies upon the text generated from those web URLs that were discussed in business-related Subreddits. The study may be improved significantly with help of having more sources of training data and more computational power.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.



Natural language processing


Generative Pretrained Transformer


Uniform resoruce locator


Artificial neural networks


Long short term memory


Machine translation


Grammar development environment


Continuous speech recognition


Long vocabulary speech recognition


Hyper Text Mark Langauge


  1. ALPAC. Language and machines computers in translation and linguistics. 1966

  2. Antoniak M, Mimno D. Evaluating the stability of embedding-based word similarities. Trans Assoc Comput Linguist. 2018;6:107–19.

    Article  Google Scholar 

  3. Arora S, Li Y, Liang Y, Ma T, Risteski A. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics. 2016;4:385–99.

    Article  Google Scholar 

  4. Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings 2018. arXiv preprint arXiv:1805.06297

  5. Bagal V, Aggarwal R, Vinod P, Priyakumar UD. Molgpt: molecular generation using a transformer-decoder model. J Chem Inf Model. 2021;62(9):2064–76.

    Article  Google Scholar 

  6. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate 2014. arXiv preprint arXiv:1409.0473

  7. Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. J Mach Learn Res. 2003;3:1137–55.

    MATH  Google Scholar 

  8. BNC. British national corpus 2020., Accessed 4 Apr 2020.

  9. Briscoe T, Grover C, Boguraev B, Carroll JA. A formalism and environment for the development of a large grammar of English. IJCAI, Citeseer. 1987;87:703–8.

    Google Scholar 

  10. Caruana R. Multitask learning. autonomous agents and multi-agent systems. 1998

  11. Ceccato S. Correlational analysis and mechanical translation. 1967

  12. Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning 2008; pp 160–167

  13. Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H. Word translation without parallel data 2017. arXiv preprint arXiv:1710.04087

  14. Dai AM, Le QV. Semi-supervised sequence learning. In: Advances in neural information processing systems.2015; pp 3079–3087

  15. van Deursen R, Ertl P, Tetko IV, Godin G. Gen: highly efficient smiles explorer using autodidactic generative examination networks. J Cheminform. 2020;12(1):1–14.

    Google Scholar 

  16. Elman JL. Finding structure in time. Cogn Sci. 1990;14(2):179–211.

    Article  Google Scholar 

  17. Fellbaum C. Towards a representation of idioms in wordnet. In: Usage of WordNet in Natural Language Processing Systems. 1998

  18. Gers FA, Schraudolph NN, Schmidhuber J. Learning precise timing with lstm recurrent networks. J Mach Learn Res. 2002;3:115–43.

    MathSciNet  MATH  Google Scholar 

  19. Google. Alphabet inc. 2020., Accessed 4 Apr 2020.

  20. GoogleEMail. Gmail 2021. Accessed 15 Nov 2021.

  21. Grammarly i. Grammarly, 2021., Accessed 15 Nov 2021.

  22. Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P. Teaching machines to read and comprehend. In: Advances in neural information processing systems. 2015;pp 1693–1701

  23. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    Article  Google Scholar 

  24. Jacovi A, Shalom OS, Goldberg Y. Understanding convolutional neural networks for text classification, 2018. arXiv preprint arXiv:1809.08037

  25. Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences, 2014. arXiv preprint arXiv:1404.2188

  26. Kannan A, Kurach K, Ravi S, Kaufmann T, Tomkins A, Miklos B, Corrado G, Lukacs L, Ganea M, Young P, et al. Smart reply: Automated response suggestion for email. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016;pp 955–964

  27. Kim Y. Convolutional neural networks for sentence classification 2014. arXiv preprint arXiv:1408.5882

  28. Locke WN, Booth AD. Machine translation of languages. Am Document. 1956;7(2):135.

    Article  Google Scholar 

  29. Manning CD, Schütze H. Foundations of statistical language processing. 1999

  30. Maybury M. Advances in automatic text summarization. Cambridge: MIT press; 1999.

    Google Scholar 

  31. McCann B, Keskar NS, Xiong C, Socher R. The natural language decathlon: Multitask learning as question answering, 2018. arXiv preprint arXiv:1806.08730

  32. McClelland JL, Rumelhart DE. Explorations in parallel distributed processing: a handbook of models, programs, and exercises. Cambridge: MIT press; 1989.

    Google Scholar 

  33. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. 2013;pp 3111–3119

  34. Mimno D, Thompson L. The strange geometry of skip-gram with negative sampling. In: Empirical Methods in Natural Language Processing. 2017

  35. Plath W. Multiple path analysis and automatic translation. Amsterdam: North-Holland; 1967.

    MATH  Google Scholar 

  36. Radford A, Wu J, Amodei D, Amodei D, Clark J, Brundage M, Sutskever I. Better language models and their implications. 2019, OpenAI Blog https://openai com/blog/better-language-models

  37. reddit. Reddit. 2021a,, Accessed 15 July 2020.

  38. reddit. reddit; 2021b, Accessed 15 July 2020.

  39. Ruder S, Bingel J, Augenstein I, Søgaard A. Latent multi-task architecture learning. Proc AAAI Confer Artif Intell. 2019;33:4822–9.

    Google Scholar 

  40. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. Tech. rep.: California Univ San Diego La Jolla Inst for Cognitive Science; 1985.

  41. Schank RC. Language and memory. Cogn Sci. 1980;4(3):243–84.

    Article  Google Scholar 

  42. Søgaard A, Ruder S, Vulić I. On the limitations of unsupervised bilingual dictionary induction. 2018. arXiv preprint arXiv:1805.03620.

  43. Sparck Jones K. Thesaurus Encyclopedia of artificial intelligence. 1992;2:1605–13.

  44. Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. 2014;p. 3104–3112.

  45. Tensorbaord. 2020. Google tensorboard., Accessed 15 Oct 2020.

  46. Turing AM. Computing machinery and intelligence. In: Parsing the turing test, Springer. 2009;p. 23–65.

  47. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems. 2017;p. 5998–6008.

  48. Vinyals O, Kaiser Ł, Koo T, Petrov S, Sutskever I, Hinton G. Grammar as a foreign language. In: Advances in neural information processing systems. 2015;p. 2773–2781.

  49. Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. Matching networks for one shot learning. In: Advances in neural information processing systems. 2016;p. 3630–3638

  50. Wahlster W. Mobile speech-to-speech translation of spontaneous dialogs: an overview of the final verbmobil system. In: Verbmobil: Foundations of speech-to-speech translation, Springer. 2000;p. 3–21.

  51. Wendlandt L, Kummerfeld JK, Mihalcea R. Factors influencing the surprising instability of word embeddings 2018. arXiv preprint arXiv:1804.09692

  52. Winograd T. Understanding natural language. Cogn Psychol. 1972;3(1):1–191.

    Article  Google Scholar 

  53. Woods WA. Semantics and quantification in natural language question answering. In: Advances in computers. 1978;vol 17, Elsevier, p. 1–87.

  54. Writing TA. 2021. Textio augmented writing. Accessed 15 Nov 2021.

  55. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. 2015;p. 2048–2057.

  56. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016;pp 1480–1489.

  57. Young SJ, Chase LL. Speech recognition evaluation: a review of the us csr and lvcsr programmes. Comput Speech Lang. 1998;12(4):263–79.

    Article  Google Scholar 

Download references


No acknowledgements.


Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations



The research study is written by FK during the PhD at University Osnabrueck, Germany. GP has supervised and improve the study in many aspects.

Authors’ information

Mr. Faisal khalil is doing PhD at university Osnabrueck, Department of cognitive sciences. His major areas are artificial intelligence and Deep learning. His main focus is application of AI in the areas of business and corporate finance.

Corresponding author

Correspondence to Faisal Khalil.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Figure 3 show training loss and Fig. 4 shows test loss in a model training summary graph. The graph is produced with help of “Tensorlflow” tool called “Tensorboard”. Tensorboard is a tool for viewing the hidden layers and mechanism of the ANN models—written by Google to increase the efficiency of “Tensorflow” library [45].

Fig. 3
figure 3

Tensor-board training loss summary

Fig. 4
figure 4

Tensor-board test loss summary

Random samples

In the section, we have given the ‘Microsoft’ ‘OneDrive’ shared folder link which contains 2284 samples that are generated by the model during the training process. The random sample has been generated roughly after every 200 training steps. Samples can be accessed via following link:

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khalil, F., Pipa, G. Transforming the generative pretrained transformer into augmented business text writer. J Big Data 9, 112 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: