In this section, we present our solution to the entity detection task. In essence, it consists of a transformer model trained to recognize the token-based positions of entities in a given NLQ and whether those entities are head or tail entities.
We use Transformer since the model uses an attention mechanism. Transformer differs from the encoder-decoder architecture based on vanilla Recurrent Neural Networks (RNNs) in that the latter only pays attention to the last encoder state. Transformer looks at all the encoder states for each decoder step and computes the most appropriate encoder state to use as the underlying context to generate the decoder’s output at that particular step. This allows the model to keep relevant information of the earlier elements of the sequence, which is beneficial for understanding long sequences such as text-based paragraphs. In the context of KGQA, such long sequences could appear in complex questions that are associated with multiple triple patterns in the corresponding queries.
We start with an overview of the Transformer model we use. Next, we describe how we construct the target labels for the model by employing both the training questions and their corresponding ground truth triples. At the end, we describe how we train the model to predict those labels. Meanwhile, our second contribution in which we provide an expansion to the current benchmark QALD datasets is discussed in the Experimental Evaluation section.
Transformer
Transformer is an encoder-decoder network that relies on attention mechanisms rather than recurrence and convolutions [34]. Figure 3 shows the transformer architecture. In its most basic form, a transformer learns to generate an output sequence from a given input sequence. The encoding component works on an embedding of the input sequence through a series of encoder networks, each of which consists of a self-attention layer and a feed-forward neural network. It results in a representation of the input sequence in the form of a set of attention vectors. Meanwhile, the decoding component operates on an embedding of the output sequence via a series of decoder networks. Each decoder also has a self-attention layer and a feed-forward neural network, but with an additional encoder-decoder attention layer in between. This additional layer makes use of the attention vectors from the encoding component to guide the decode to relevant parts of the input. At the end, the decoding component is stacked with a linear and softmax layer that returns a set of probability values for each word in the context of other words in the sentence.
Transformer can be used for a variety of NLP tasks. For text classification, both the input and output to the transformer are actually the same input sentence. Thus, the transformer’s weights form a feature representation of the data. The classification is then achieved by adding another classification layer on top of the transformer where the actual target labels of the task come into play. In essence, entity detection in this paper is a text classification task. Our implementation makes use of BERT, a pre-trained transformer obtained from HuggingFace’s Transformers library [35] with the help of a wrapper library from SimpleTransformers.Footnote 4
Position-based pattern
We describe in this section the notion of position-based pattern and position-based pattern set as motivated by the Candice Bergen example in the Introduction section. Recall that a position-based pattern needs to contain three types of information: (i) in which triple pattern of the query (first, second, etc.) the entity appears, (ii) whether the entity appear in the head or tail position, and (iii) the token position of the entity in the NLQ. Note that the token position allows us to locate the occurrence of entities in the input NLQ. We use each set of such patterns to represent a particular question type to which the Transformer model learns to classify the questions.
Definition 1
A triple pattern is of the form \(p = (h,r,t)\) where \(h = \texttt {head} (p)\) and \(t = \texttt {tail} (p)\) are respectively called the head and the tail of p, r is the predicate of p. All of h, r, and t can be a word or a phrase in a natural language or an entity/relation identifier with respect to a KG. In addition, h and t can be a variable. A triple pattern (h, r, t) is called a head pattern if h is not a variable, and a tail pattern if t is not a variable. Note that a triple pattern can be both a head and a tail pattern.
Definition 2
We write an NL sentence q consisting of n words/tokens as a list with non-negative integer index \(q = [w_0,w_1,\dots ,w_{n-1}]\). A token-based position is a list of consecutive non-negative integers.
Given a phrase \(s = [v_0,v_1,\dots ,v_{k-1}]\) containing k words/tokens and an NL sentence q, one can associate POS(s, q) of token-based position at which s occurs first in q. That is, POS(s, q) is a list of the form \([m, m+1,\dots , m+k-1]\) where m is the smallest index such that \(v_0 \equiv w_m, v_1 \equiv w_{m+1}, \dots , v_{k-1} \equiv w_{m+k-1}\) where \(\equiv \) denotes case-insensitive string equality. If s does not occur in q, POS(s, q) is undefined.
POS(s, q) provides the leftmost occurrence of s in q. For example, in the question “Where in the United States was John Morris Russell born?”, the token-based position of the phrase “John Morris Russell” with respect to the question is the list “[6,7,8]”.
Our aim is to classify a given question according to whether all of its anchor entities, i.e., those that are necessary for obtaining a correct answer, potentially occurs in the KG as a head or a tail of a triple. Moreover, questions for which their anchor entities occur at different positions should be distinguished from each other. Thus, different combination of token-based positions with a head/tail tag form the different target class labels. Such a head/tail tag for a question in the training set can be constructed based on the ground truth triples for that question. Assuming training questions sufficiently represent all types of questions, all those combinations of head/tail tag and token-based positions capture any type of unseen questions.
Definition 3
A position-based pattern is a pair \((x,\ell )\) where x is either head or tail and \(\ell \) is a token-based position.
Let \((q,T_q)\) be a pair of a question q and a non-empty list of triple patterns \(T_q = [p_0,p_1,\dots , p_{m-1}]\). We call such a pair a question-answer pair and a KGQA dataset can be viewed as a finite set of such pairs. The triple patterns represent the ground truth answer for q and can either be given explicitly as triples or as a part of a logical query (e.g., SPARQL) that gives an answer to q when executed on the KG.
Definition 4
Given a question-answer pair \((q,T_q)\) with respect to a KG \({\mathcal {G}}\) where \(T_q = [p_0,p_1,\dots ,p_{m-1}]\), the position-based pattern set for q is a set \(C_q\) of indexed position-based patterns such that for each \(i=0,1,\dots ,m-1\):
-
(i)
if \(p_i\) is a head pattern and \(\texttt {head} (p_i)\) has \(s_i\) as its NL label in \({\mathcal {G}}\), then \(i\mathord {:}(\texttt {head}, POS(s_i,q)) \in C_q\) provided that \(POS(s_i,q)\) is defined; and
-
(ii)
if \(p_i\) is a tail pattern and \(\texttt {tail} (p_i)\) has \(s_i\) as its NL label in \({\mathcal {G}}\), then \(i\mathord {:}(\texttt {tail}, POS(s_i,q)) \in C_q\) provided that \(POS(s_i,q)\) is defined.
Each unique position-based pattern set forms a class label for the question classification task. The head/tail tag is based on the ground truth triple patterns, not on the occurrence(s) of s in q. Note that even if s appears more than once in q, we only use the leftmost occurrence of s in q for POS(s, q) because once s at POS(s, q) is linked to a KG entity (in the subsequent step after entity detection), all occurrences of s in q will also be linked to it.
To illustrate the definition of position-based pattern set, consider the question q = “Is Amedeo Maiuri and Ettore Pais excavation directors of Pompeii?” from the LC-QuAD 2.0 dataset. The ground truth triple patterns \(T_q\) for q are given inside the following query that can be executed over Wikidata:
$$\begin{aligned} \begin{aligned} \texttt {ASK WHERE}\quad \{ \quad \texttt {wd:Q43332 wdt:P4345 wd:Q442340\quad .}\\ \quad \quad \quad \texttt {wd:Q43332 wdt:P4345 wd:Q981427} \quad . \quad \} \end{aligned} \end{aligned}$$
Thus, \(T_q = [p_0, p_1]\) where \(p_0 = (\text {wd:Q43332}, \text {wdt:P4345}, \text {wd:Q442340})\) and \(p_1 = (\text {wd:Q43332}, \text {wdt:P4345}, \text {wd:Q981427})\). Note that \(p_0\) and \(p_1\) are both head and tail patterns. Moreover, “Amedeo Maiuri”, “Ettore Pais”, and “Pompeii” are labels of wd:Q442340, wd:Q981427, and wd:Q43332 in Wikidata and all of them appear in q. That is, the token-based positions are \(POS(\text {Amedeo Maiuri}, q) = [1,2]\), \(POS(\text {Ettore Pais}, [4,5])\), and \(POS(\text {Pompeii}, q) = [9]\). Thus, the class label of the question q is:
\(C_q = \{ 0\mathord {:}(\texttt {head}, [9]), 0\mathord {:}(\texttt {tail},[1,2]), 1\mathord {:}(\texttt {head},[9]), 1\mathord {:}(\texttt {tail},[4,5]\}\).
Position-based pattern set construction
Our proposed approach for entity detection based on position-based pattern consists of two main phases, as shown in Fig. 4. The first phase is the position-based pattern set construction, which is discussed in this section. The second phase is the classification of the input question into one of the position-based pattern sets using a transformer-based multi-class classifier, which will be discussed in the subsequent section.
The position-based pattern set construction is described in Fig. 4. The input is a question-answer pair in the dataset. Note that the answer component of the pair is either a SPARQL query or a ground triple without variables. In both cases, we can obtain a set of triple patterns expressing the answer. Next, we perform diacritics removal and case normalization over the input question as a part of data pre-processing step.
Let \((q,T_q)\) be the question-answer pair with q already pre-processed. We construct the position-based pattern set for q by following the steps in Algorithm 1. We illustrate the process via a running example given in Fig. 5.
The first part of the algorithm is described between line 1 and line 14. In this part, we go through all triple patterns in \(T_q\) and extract any occurrence of entities either in the head or tail position of the triple pattern. For a triple pattern (s, p, o), s and o can be an entity or a variable. If s is not a variable, then we encounter a head entity, and if o is not a variable, we obtain a tail entity. In both cases, we mark the occurrence by adding \(i\mathord {:}\{pos\mathord {:}\ell \}\) to the set ps where i is the index of the triple pattern (0th, 1st, etc.), pos is either head or tail depending on whether we encounter a head or tail entity, and \(\ell \) is a text representation of the entity obtained by querying its label or description in the KG. For example, using an NLQ in Fig. 5, we first obtain “empire of japan” and “sovereign state” as the entity label of ’Q188712’ in the first triple and ’Q3624078’ in the second triple, respectively. By the algorithm (line 7 and 12), we add \(0\mathord {:}\{\texttt {head} :\text {`empire of japan'}\}\) and \(1\mathord {:}\{\texttt {tail} :\text {`sovereign state'}\}\) to ps.
In the second part of the algorithm, described between line 16 and 32, we perform N-gram matching of ‘empire of japan’ and ‘sovereign state’ in the question string q. That is, we determine whether ‘empire of japan’ and ‘sovereign state’ or any of their n-gram substrings occurs in q. Line 17 starts a loop over elements of ps. That is, for the element of ps containing the KG label ‘empire of japan’, line 20-30 determines whether ‘empire of japan’ or any of its n-gram substring occurs in q. If so, tokenpos stores the token position where it occurs, which is [8, 9, 10] for ‘empire of japan’. When this is found, we add \(0\mathord {:}\{\texttt {head},[8,9,10]\}\) to \(C_q\) as ‘empire of japan’ is the head entity of the 0th triple pattern. Similarly, in case of ‘sovereign state’, we add \(1\mathord {:}\{\texttt {tail},[1,2]\}\) to \(C_q\). Thus, in the end, we obtain the resulting \(C_q = \{0\mathord {:}\{\texttt {head},[8,9,10]\},1\mathord {:}\{\texttt {tail},[1,2]\}\}\).
As illustrated in Fig. 4, we store \(C_q\) in a dictionary of position-based pattern sets using an ad-hoc syntax, which could ease the downstream task of query construction. The syntax uses the following simple rules:
-
(1)
A list of integers \([x_1,\dots ,x_n]\) is written as x_1_x_2_..._x_n. For example, [8, 9, 10] becomes 8_9_10.
-
(2)
A position-based pattern of the form \(i\mathord {:}\{\texttt {head} \mathord {:}tokenpos\}\) is written in a syntax of the form i:head:ent:L where L is the representation of tokenpos according to the previous rule. For example, \(0\mathord {:}\{\texttt {head},[8,9,10]\}\) is encoded as 0:head:ent:8_9_10.
-
(3)
A position-based pattern of the form \(i\mathord {:}\{\texttt {head} \mathord {:}tokenpos\}\) is written in a syntax of the form i:head:ent:L similar to (2). For example, \(1\mathord {:}\{\texttt {tail},[1,2]\}\) is encoded as 1:tail:ent:1_2.
-
(4)
If \(C_q\) contains two position-based patterns of the form \(i\mathord {:}\{\texttt {head} \mathord {:}tokenpos_h\}\) and \(i\mathord {:}\{\texttt {tail} \mathord {:}tokenpos_t\}\), then the two position-based patterns are encoded as a single representation i:head:ent:\(L_h\)[AND]i:tail:ent:\(L_t\) where \(L_h\) and \(L_t\) are representations of \(tokenpos_h\) and \(tokenpos_t\), respectively according to (1). For example, if \(C_q\) contains both \(1\mathord {:}\{\texttt {tail},[1,2]\}\) and \(1\mathord {:}\{\texttt {head},[5,6]\}\), then we have a single encoding 1:head:ent:5_6[AND]1:tail:ent:1_2.
-
(5)
The representation of \(C_q\) is \(E_1\)[SEP]\(E_{2}\)[SEP] ...[SEP]\(E_{m}\) where each \(E_j\) is the representation of the position-based patterns according to (2) and (3) sorted by their indices. Thus, if \(C_q = \{0\mathord {:}\{\texttt {head},[8,9,10]\},1\mathord {:}\{\texttt {tail},[1,2]\}\}\), then its encoding is 0:head:ent:8_9_10[SEP]1:tail:ent:1_2 as seen in Fig. 5.
Question classification using position-based pattern set
Having constructed position-based pattern sets according to the previous section, we can now use those sets as distinct class labels for our multi-class classification problem. This section describes the step we take to train the model.
Data augmentation
To improve our model to recognize other words used in a question, we augment data using the synonyms approach. Wordnet is a database containing lexical in English. Each word in the form of nouns, verbs, adjectives, and adverbs are grouped into synonyms (synset).Footnote 5 We prefer to use a Wordnet-based approach to preserve the position of words in a question. We borrow Wordnet-based augmentation proposed by Marivate and Sefara [36] to augment data. Figure 6 illustrates how data augmentation proposed by Marivate and Sefara [36] works.
For the augmentation step, we focus on exploring the synonym of verb and noun POS tags in a question. Meanwhile, words with other POS tags are kept as the original. This approach enriches other verb expressions that people use to construct a query. To generate synonym questions that have the same attention as the original question, we replace nouns and verbs that have the same POS tags type; for instance, word “film” and “movie” in Fig. 6 have the same POS tags noun, namely, NN. Word “produce” is replaced by other words with the same POS tags (VB), such as “lead”. Figure 7 illustrates how word selection works.
Position-based pattern set prediction
To predict the position-based patterns of a question, we employ a multi-class classifier with an underlying transformer model. For the transformer model, we use a pre-trained BERT-base model, and on top of it, a classification layer with n output neurons is placed.
As in Fig. 4, Position-based pattern set construction phase outputs a list that contains a pair of questions and their position-based pattern. As in Fig. 5, the pair contains “which sovereign state is in diplomatic relation of empire of japan, 1861”. The left (0) and right (1) columns represent a question and ID. of position-based pattern of the question, respectively. In this example, ID 1861 refers to \(0:head:ent:8\_9\_10[SEP]1:tail:ent:1\_2\) pattern in a pattern dictionary. The column we use as the input of this model is the first one.
To predict the position-based pattern set of a question, we use the pre-trained BERT-base-cased model coupled with a simple softmax layer. This model is fine-tuned for a multiclass classification task with our data. The actual implementation of the model is taken from the Simple TransformersFootnote 6 library that simplifies the Transformer API from the HuggingFace library. This implementation allows us to train and evaluate the model quickly.