Here we describe the details of our proposed QA system. In particular, our system translates natural language questions to SPARQL queries in five steps (see Fig. 1 in "Introduction" section). At each step, a relevant task is solved independently by one individual software component. First, the input question is processed by the question analysis component, based solely on syntactic features. Afterwards, the type of the question is identified and phrases in the question are mapped to corresponding resources and properties in the underlying RDF knowledge graph. A number of SPARQL queries are generated based on the mapped resources and properties. A ranking model based on Tree-structured Long Short-Term Memory (Tree-LSTM) [12] is applied to sort the candidate queries according to the similarity between their syntactic and semantic structure relative to the input question. Finally, answers are returned to the user by executing the generated query against the underlying knowledge graph.
In the proposed architecture, only the Phrase Mapping is dependent on the specific underlying knowledge graph because it requires the concrete resources, properties and classes. All other components are independent of the underlying knowledge graph and therefore can be applied to another knowledge domain without being modified.
Question analysis
The first component of our QA system analyzes natural language questions based solely on syntactic features. In particular, our system uses syntactic features to tokenize the question, to determine the proper part of speech tags of theses tokens, to recognize the named entities, to identify the relations between the tokens and, finally, to determine the dependency label of each question component [2].
Moreover, the questions are lemmatized and a dependency parse tree is generated. The resulting lemma representation and the dependency parse tree are used later for question classification and query ranking.
The goal of lemmatization is to reduce the inflectional forms of a word to a common base form. For instance, a question “Who is the mayor of the capital of French Polynesia?” can be converted to the lemma representation as “Who be the mayor of the capital of French Polynesia?”.
Dependency parsing is the process of analyzing the syntactic structure of a sentence to establish semantic relationships between its components. The dependency parser generates a dependency parse tree [25] that contains typed labels denoting the grammatical relationships for each word in the sentence (see Fig. 2 for an example).
Question type classification
In order to process various kinds of questions such as ‘Yes/No’ questions or ‘Count’ questions, the proposed QA system first identifies the type of question and then constructs the WHERE clause in the SPARQL query. Our system currently distinguishes between three question types.
The first is the ‘List’ question type, to which belong most common questions, according to our analysis of the available datasets (see "Results" section for details). ‘List’ questions usually start with a WH-word or a verb such as “list” or “show”. One example question could be ‘Who is the wife of Obama?’. The expected answer to the ‘List’ questions is a list of resources in the underlying knowledge graph.
The second type is the ‘Count’ question type, where the keyword ‘COUNT’ exists in the corresponding SPARQL query. These kind of questions usually start with a particular word such as “how”. One example question could be ‘How many companies were founded in the same year as Google?’. The expected answer to a ‘Count’ question is a number.
Note that sometimes the expected answer to a ‘Count’ question could be directly extracted as the value of the property in the underlying knowledge graph instead of being calculated by the ‘COUNT’ SPARQL set function. For example, the answer of the question ‘How many people live in the capital of Australia?’ is already stored as the value of http://dbpedia.org/ontology/populationTotal. As a result, this question is treated as of the type ‘List’ instead of ‘Count’.
Finally, the ‘Boolean’ question type must contain the keyword “ASK” in the corresponding SPARQL query. For example: ‘Is there a video game called Battle Chess?’. The expected answer is of a Boolean value - either True or False.
We use a machine learning method instead of heuristic rules to classify question types because it is hard to correctly capture all the various question formulations. For example, consider the question ‘How many people live in Zurich?’, which starts with ‘How many’ and belongs to question type ’LIST’ rather than ’COUNT’ (as in the example above). Similar questions include ’How high is Mount Everest’ which also belongs to question type ’LIST’. In order to capture those special questions, many specific cases must be considered while hand-crafting heuristic rules. Instead, using a machine learning algorithm for question type classification saves the tedious manual work and can automatically capture such questions as long as the training data is large and sufficiently diverse.
To automatically derive the question type, we first convert each word of the original question into its lemma representation. Then we use term frequency-inverse document frequency (TF-IDF) to convert the resulting questions into a numeric feature vector [26]. Afterwards, we train a Random Forest model [27] on these numeric feature vectors to classify questions into ‘List’, ‘Count’ and ‘Boolean’ questions. Our experimental results demonstrate that this simple model is good enough for this classification task (see Related work" section). Consequently, a SPARQL query will be constructed based on the derived question type. For instance, ‘ASK WHERE’ is used in the SPARQL query of a ‘Boolean’ question - rather than ‘SELECT * WHERE’.
Phrase mapping
After the question types are identified, our QA system builds the final queries using the information related to the underlying knowledge graph. There are mainly three types of information when considering the RDF schema 1.1 to support the writing of SPARQL queries: resources, properties and classes [28].
For phrase mapping our QA system uses an ensemble method, combining the results from several widely used phrase mapping systems. The ensemble method allows to overcome the weaknesses of each system while at the same time maximizing their strengths, so as to produce the best possible results.
For instance, in order to identify Resources in a natural language question, we use DBpedia Spotlight [29], TagMe [30], EARL [31] and Falcon [32]. In order to identify Properties we use EARL [31], Falcon [32] and RNLIWOD [10]. Finally, in order to identify Classes we use NLIWOD [10]. Below we discuss these systems in more detail.
DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in natural language text [29]. The DBpedia Spotlight first detects possible phrases that are later linked to DBpedia resources. A generative probabilistic model is then applied to disambiguate the detected phrases. Finally, an indexing process is applied to the detected phrases to efficiently extract the corresponding entities from DBpedia [29]. DBpedia Spotlight also allows users to tune the values of important parameters such as the confidence level and support range to get the trade-off between the coverage and accuracy of the detected resources.
TagMe is a tool that on-the-fly identifies meaningful substrings in an unstructured text and links each of them to a pertinent Wikipedia page in an efficient and effective way [30]. TagMe shows good performance especially when annotating texts that are short and poorly composed. This feature of TagMe makes it ideal for question answering tasks. Moreover, TagMe was shown to achieve the best performance on the LC-QuAD dataset among all the available tools used for entity mapping tasks [7].
EARL is a tool for resource and property mapping as a joint task. EARL uses two strategies. The first one is based on reducing the problem to an instance of the Generalized Travelling Salesman problem and the second one uses machine learning in order to exploit the connection density between nodes in the knowledge graph [31]. Both strategies are shown to produce good results for entity and relationship mapping tasks.
Falcon also performs joint resource and property mapping. Falcon shows good performance especially on short texts because it uses a light-weight linguistic approach relying on a background knowledge graph. It uses the context of resources for finding properties and it utilizes an extended knowledge graph created by merging entities and relationships from various knowledge sources [32]. Falcon outperforms other tools and does not require training data, which makes it ideal for a QA system.
RNLIWOD is a tool for mapping properties and classes in the given text. It is shown to have the best overall performance on the LC-QuAD dataset, although its overall macro performance is poor. Therefore, RNLIWOD is augmented with a dictionary of predicates and classes in the question analysis step along with their label information. As a result, the coverage of predicates and classes measured by RNLIWOD increases, which finally leads to an improvement of the overall performance [10].
Fig. 3 shows the mapped resources, properties and classes for the example question: “Who is the mayor of the capital of French Polynesia?”
Query generation
As discussed in "Question type classification" section, the component Question Type Generation is responsible for determining if a question falls into the category ‘List’, ‘Count’ or ‘Boolean’. This component determines the ‘SELECT’ clause in the SPARQL query. The next step in constructing a SPARQL query is to determine the ‘WHERE’ clause, which is the goal of the component Query Generation that we discuss next.
Recall that a SPARQL query is comprised of graph patterns in the form of <subject, predicate, object> triples, where each subject, predicate and object may be a variable. Therefore, the purpose of the query generation step is to construct a set of such triples. These triples are generated based on the output of the mapped resources, properties and classes provided by the component Phrase Mapping. Finally, the ‘WHERE’ clause of the SPARQL query is constructed.
In order to find desired RDF triples, all possible combinations of mapped resources, properties and classes are examined [9]. For instance, dbr:French_Polynesia is a mapped resource and dbo:capital is a mapped property in the example question “Who is the mayor of the capital of French Polynesia?”. The corresponding triple pattern \(\texttt {<dbr:French\_Polynesia}\,\texttt {dbo:capital}\,\texttt {?uri>}\) is added to set S of all possible triples as it exists in the underlying knowledge graph. Since dbr:France is another mapped resource and dbo:country is another mapped property, the corresponding triple pattern \({\texttt {<?uri}}\,\texttt {dbo:country}\,\texttt {dbr:France>}\) is also added to set S of all possible triples as it exists in the underlying knowledge graph.
In more complex SPARQL queries, more than one variable may be involved. Therefore, set S is extended by adding the relationship to a new variable [9]. For example, the triple pattern \(\texttt {<dbr:French\_Polynesia}\,\texttt {dbo:capital}\,\texttt {?uri>}\) in S can be extended by adding another triple pattern \(\texttt {<?uri}\,\texttt {dbo:mayor}\,\texttt {?uri'>}\) because dbo:mayor is one mapped property in the example question and such relationship exists in the underlying knowledge graph. The triple pattern \(\texttt {<?uri}\,\texttt {dbo:country}\,\texttt {dbr:France>}\) can be extended by adding \(\texttt {<?uri'}\,\texttt {dbo:mayor}\,\texttt {?uri>}\) to S for the same reason.
We choose to examine only the subgraph containing the mapped resources and properties instead of traversing the whole underlying knowledge graph. As a result, our approach dramatically decreases the computation time compared to [9]. By considering the whole knowledge graph instead, we would have precision and execution time performance drawbacks. For example, one drawback is that the time needed to execute all the possible entity-property combinations increases significantly with the number of properties. As a result, the number of plausible queries to be considered will significantly increase too, and consequently, the time to compute the similarity between questions and SPARQL queries will also increase.
A list of triples needs to be selected from set S to build the ‘WHERE’ clause in the SPARQL query. However, the output of the mapped resources, properties and classes from the phrase mapping step may be incorrect and some of them may be unnecessary. Therefore, instead of only choosing the combination of triples which contains all the mapped resources and properties and has the maximum number of triples, combinations of any size are constructed from all triples in S as long as such relationship exists in the underlying knowledge graph. For example, (\(\texttt {<dbr:French\_Polynesia}\, \texttt {dbo:capital}\,\texttt {?uri>}\) , \(\texttt {<?uri}\,\texttt {dbo:mayor}\,\texttt {?uri'>}\)) is one possible combination and (\(\texttt {<?uri}\,\texttt {dbo:country}\,\texttt {dbr:France>}\), \(\texttt {<?uri'}\,\texttt {dbo:mayor}\,\texttt {?uri>}\) ) is another possible combination. Given the question type information, each possible combination can be used to build one SPARQL query. As a result, many possible candidate queries are generated for each input question.
Algorithm 1 summarizes the process of constructing set S of all possible triples and set Q of all possible queries, where \(E'\) is the set of all mapped resources, \(P'\) is the set of all mapped properties and K is the underlying knowledge graph. The basic idea of generating all possible triple patterns is taken from previous research [9]. However, we improve that approach to be able to generate more possible WHERE clauses and thus, to be able to handle more complex queries (see lines 15–24 of the algorithm below).
Query ranking
In the previous step Query Generation, we generated a number of candidate queries for each natural language question. The next step is to rank the candidates and to select the most plausible queries. We follow the approach proposed in [9]. It relies on Tree-structured Long-Short Term Memory (Tree-LSTM) [12]. In the following we give a high level account of the method. For technical details we refer the reader to the original publications.
Basic idea for ranking
There is an intrinsic tree-like structure in both SPARQL queries and natural language questions. We adopt the basic assumption that the syntactic similarity between between the queries and the input question can be used for ranking. Since the desired query should capture the intention of the input question, the candidate queries that are syntactically most similar to the input question should rank highest.
As an example, let us revisit the query processing phase with the question: “Who is the mayor of the capital of French Polynesia?”. In the preprocessing phase for the input question, the words corresponding to the mapped resources in the question are substituted with a placeholder. Subsequently, the dependency parse tree of the input question is created (depicted in Fig. 2 for our example). Fig. 4 shows the tree representation of four possible candidate queries for the example question. According to our ranking approach, the first query has the highest similarity among all possible candidate queries.
Ranking with tree-LSTM
LSTM augments the vanilla Recurrent Neural Network (RNN) structure with memory cells. Thus, it preserves sequence information over longer time periods. We measure the similarity between candidate queries and the input question based on Tree-LSTM [12]. Standard LSTM operates on a sequential order of the input. Tree-LSTMs take into account the tree representation. More specifically, Tree-LSTM incorporates information not only from an input vector but also from the hidden states of arbitrarily many child units. In contrast, standard LSTM works only with the hidden state of the previous time step. Thus, Tree-LSTM accommodates sentence structure better. Indeed, Tree-LSTM has been shown empirically to outperform strong LSTM baselines in tasks such as predicting semantic relatedness [12].
We use Tree-LSTM to map the input question and the candidate queries to latent space (i.e. numerical vectors), and then compute the similarity between the vectors. More specifically, the dependency parse tree of the natural language question is mapped to latent space via a Tree-LSTM, denoted by Query Tree-LSTM in [9]. The tree representations of the candidate queries are mapped to latent space via a different Tree-LSTM denoted Question Tree-LSTM. For each sentence pair the similarity score is computed using a neural network that considers both the distance and angle between the vectors in latent space. As a cost function, we use the regularized Kullback–Leibler (KL) divergence between the predicted and the target distributions. Since the goal is to select the candidate query which is most similar to the original natural language question, we pick the sentence pair with the highest similarity. For technical details we refer to the original article [12].