To start with, first we describe the following phrases used in our system:
-
Semantic features Every noun and verb terms from the input short text, concepts representing them, and co-occurring terms constitutes the Semantic features. This is the result for any given short text after we perform co-occurring phase.
-
Semantic vocabulary A list of top-k(eg-3000) semantic features that occurs in highest frequency in our training dataset.
-
Semantic feature vector For each enriched input short texts, a fixed sized vector (equal to the size of vocabulary) where each position gives the count of semantic feature present in it.
Now let us consider our dataset contains n training examples (short texts). After enrichment with noun and verb terms, we represent those short texts as: \(X^{E}= {x^1,x^2,x^3, . . . , x^n} \in R^{k*n}\) as described above, where k is the feature vector dimension. The position of the semantic feature is used to identify each row of feature vector. Here, \(x_j^i\) represents the number of times feature j occurs in the enriched short text i. Our main aim is to learn the correct category of the provided input short text and output the label Y corresponding to that category.
To get enriched text \(X^E\) from input text X, it first goes through pre-processing, conceptualization and co occurrence phase.
Our proposed solution to the problem defined above is sketched in algorithmic view below:
Algorithm 1 outlines the pre-processing of the input text. The first step in pre-processing is to remove the punctuation from the input text. Then it extracts the noun terms and stores them in a list. Similarly, it extracts Verb terms from input text and stores them in a separate list. For each noun in Noun list, if PROBASE contains it then add those terms to a pre-processed list. It does the same for each of the verb terms, but it looks into VERBNET instead of PROBASE. This algorithm has the run time complexity of O(n) as each operation of removing punctuation and extracting noun and verb terms depends linearly on the size of the input.
Algorithm 2 takes the output from algorithm 1 as input and performs conceptualization. For each term in its input, algorithm 2 performs the look-up into PROBASE to find n relevant concepts for the term and adds those concepts along with the terms in the conceptualization list. Algorithm 2 also prunes the irrelevant concepts gathered from PROBASE. We talk about this in detail in the upcoming section. The run time complexity of this algorithm is O(n) as the loop runs for n times which is the size of input.
This algorithm finds the co-occurring terms of the output terms from the conceptualization algorithm. For each term in its input, it finds top m relevant co-occurring terms. Again, PROBASE is used to find the terms. These generated co-occurring terms are stored in a list along with the terms from previous algorithms. Some filtering of irrelevant terms is done which is described in the next section. It runs on O(n) as well.
Algorithm 4 is concerned with predicting the correct category of input short text. After an input text goes through all of the previous algorithms, it is represented in vector form, normalized and then fed to our deep neural network model. The model then returns the prediction for the input text.
Enrichment and classification flow
Understanding short text boils down to two major phases, the enrichment phase and the classification phase. The enrichment phase adds context to the original short text while the classification phase classifies the enriched text to a certain category. Hence, our proposed system contains mainly two modules: the enrichment module and classification module. First we describe the enrichment module, and then the classification module in a later sub-section:
Let X be the input short text which first goes through the enrichment phase.
Pre-processing
We extract noun and verb terms from X. So new representation of X is: \(X_p={x_1, x_2,x_3 ,\ldots, x_k}\), where \(x_i , 1\le \,i\le k\) is the extracted noun, or verb terms of X. While pre-processing the given input text, we consider the longest phrase that is contained in PROBASE. For example, in short text “New York Times Bestseller”? although New York is a noun term in itself, we take “New York Times”? as a single term. We need to remove stopwords and perform word stemming, before parsing such terms.
Conceptualization
We enrich \(X_P\) obtained from pre-processing by finding the concepts of all the xi in \(X_P\) with the help of probase and our manually built verb network which further adds news features to \(X_P\). It is just a lookup operation in the networks which generates the ranked list of concepts which relate to the \(x_i\) in \(X_P\). We denote the result of this phase by \(X_C\): \(X_C={x_1,x_2,x_3, . . . , x_m}.\)
Where \(x_i, 1\le i \le m\) are the union of terms from XP and the terms obtained from conceptualization and m > k. There are cases when irrelevant concepts of given terms can appear in this phase. For instance, given a short text “apple and microsoft”? a concept like “fruit”? may appear, which is irrelevant in this context. So we apply following mechanism to remove such ambiguity.
Mechanism In case of ambiguity of two or more concepts given the terms, we can employ a simple Naive Bayesian mechanism proposed by Song et al. [10]. It estimates the posterior probabilities of each concepts given the instances. As a result, most common concepts associated with the instances get higher posterior probability. The probability of a concept is given formally c as:
$$\begin{aligned} p(c|E)=\frac{p(E|c)p(c)}{p(E)} \end{aligned}$$
(1)
which is proportional to: \(p(c)\prod _{i}^{M}p(e_i|c).\)
Using this, given a short text “apple and microsoft”?, we can get the higher posterior probabilities for concepts like “corporation”, “IT firm” etc. whereas irrelevant concept like “fruit” gets lower probability and hence pruned.
Co-occurring terms
We take the result from previous phase and enrich it using the co-occurring terms. The result of this is the finally enriched version of the original short text which is given by:
$$\begin{aligned} X_E = {x_1,x_2,x_3, . . . , x_n} \end{aligned}$$
(2)
where \(x_i, 1 \le i \le n\) are the union of terms from \(X_C\) and the terms obtained from this phase and n > m.
Co-occurrences give the important information to understand the meaning of words. These are the words that frequently co-occur with the original terms. We know from Distributional hypothesis [29] “words that occur in the same context tend to have similar meanings”.
To find the co-occurring words, we take the terms obtained from the pre-processing phase, and enrich with terms that frequently co-occur with these terms. For each original term, we look into probase to find the co-occurring terms.
DNN model
We designed a four layer deep neural network as the learning element for our input text. It was trained in semi-supervised fashion as we discussed earlier. It is as shown in Fig. 2.
Each layers except the input layer shown in the figure, are pre-trained with an auto-encoder. This is done in layer by layer fashion. To pre-train the first hidden layer, we train an auto-encoder with input layer and hidden layer same as in Fig. 2. The output layer tries to reconstruct the original input. Once this is done, we pre-train second hidden layer with separate auto-encoder. This auto-encoder has the previously trained hidden layer as its input and tries to generate code for its input. All subsequent layers are pre-trained in same manner. We used sigmoid as activation function and cross-entropy error as defined above for loss function. Through this pre-training, the parameters of our network achieve some nearly optimal value.
The pre-training phase alone can not optimize the network parameters completely although they can attain some good values. So we perform supervised fine-tuning to our pre-trained network. Fine-tuning process is as shown in Fig. 3.
We feed our DNN model with a feature vector that represents our short text through input layer. To correctly predict the category, we add softmax layer to our pre-trained network. Output \(o_i\) for any node i is computed as:
$$\begin{aligned} o_i=\frac{exp(a_i)}{\sum _{m=1}^{n}exp(a_m)}, \end{aligned}$$
where n is the total nodes on softmax layer. Output from this network is the prediction class for the given input vector. As shown above, our DNN model goes through two training stages. First is unsupervised pre-training and then the supervised fine-tuning [30].