We can summarise the steps taken for our experiments as an activity pipeline. The pipeline is summarised in Fig. 1.
Corpus generation
The first step is the creation of the corpus. We obtain tweets from a basic Twitter account using some specific keywords via a python script through Twitter’s Streaming API using the python tweepy plugin.Footnote 1 The tweets we download are those that are marked as public which is the default security level and they are only marked private if expressly indicated by users. We employ simple keyword filters to extract the desired tweets. For the training data we employ a data set of 13,004 English tweets that mention the flu, common cold or Listeria.
For the test data we extract tweets that mention the flu in French, German, Spanish, Arabic and Japanese. We translate the tweets eliminating those tweets that are incomprehensible and then annotate the remainder of the tweets. At annotation we label those tweets that mention a recent (less than a month) or ongoing cases of disease as positive. Within this period we expect the disease to still be within its communicable period which is a period within which new infections are still possible as a result of transmission from sick individuals to susceptible healthy individuals. We treat all other mentions as irrelevant and we label them messages negative.
In addition we eliminate duplicates by removing retweets, (tweets with the “RT” tag) and also manually check for duplicates that may not be marked as “RT” and finally we remove all punctuation except the “#” and “@” symbols where they appear at the beginning of tokens where they are used to denote hashgtags and users respectively. Table 1 below summarizes the composition of our corpus. We do not perform any preprocessing prior to translation. We found the translation API to be quite robust but in many cases it returns some unspecified system error and in a few cases the translation only contains inconsequential elements like URLs or is incomprehensible making us unable to competently annotate the tweet so we remove these tweets in addition to tweets whose contents are repeated in several other tweets as well as those that have been retweeted. For this reason there are less tweets retained after translation than the actual number of tweets contained in the corresponding datasets. The column for yield in table is the percentage of tweets successfully translated and retained.
Pre-processing
The next step is pre-processing. We start off by tokenization then part of speech tagging. For part of speech tagging we employ the GATE (General Architecture for Text Engineering) Twitie tagger application [18]. The tagger uses the Penn Treebank tag set [19] in addition to three additional tags “HT”, “USR” and “URL” corresponding to twitter specific phenomenon namely hashtags, users and URLs respectively. We also attempt these experiments with stemming and without stemming. Stemming further reduces lexical diversity by reducing different forms of the same word into a single stem.
Feature generation
We try out four different input representations. We try word one-hot word-level representations, one-hot concept-level representations, distributed representations over words and distributed representations over concepts. For the conceptual representations we employ two different ontologies. We try the ontology previously developed by Magumba et al. [20]] and SNOMED-CT (Systematic Nomenclature of Medicine-Clinical Terms) [21]. Whereas several ontologies exist such as OBO (Open Biomedical Ontologies) [22] and the BioCaster ontology [8], we elect to employ these two because they have a broader conceptual coverage which we consider an advantage in a general language modeling task. For instance the BioCaster method partially defined word lists that contain terms such as “human ehrlichiosis” and "enzootic bovine leukosis" which are generally too technical to appear in casual texts like Twitter posts with any significant regularity. The Magumba et al. [20] ontology on the other hand was created specifically for Twitter disease event detection and SNOMED-CT was created specifically to harmonise divergent medical terminology and is marketed by SNOMED-CT international as the most comprehensive and precise clinical health terminology product in the world.
For the ontology based input representations we transform each tweet into a vector of features as follows: Firstly, we flatten out our ontology into a list of its constituent concepts. For the Magumba et al. [20] ontology each concept is associated with a group of words or tokens referred to as the concept dictionary. Each concept is effectively a list of words and the full ontology is basically a list of lists. In this sense it is a heavily redacted English dictionary containing only words considered to be of epidemiological relevance. To obtain the feature vector we simply tokenize each tweet and for each token we do a dictionary look up in the flattened ontology. If the token exists in the ontology, we simply replace it with the concept in which it occurs. As an example the sentence “I have never had the flu” is encoded as “SELF_REF HAVE FREQUENCY HAVE OOV OOV”.
SELF_REF refers to “Self references” which is the concept class for terms that persons use to refer to themselves such as “I”, “We” and “Us” used as an indicators of speaking in the first person, “HAVE” is the concept class for “have” or “had” which is a special concept class since the verb “to have” is conceptually ambiguous as it can legitimately indicate two senses that is falling sick or possession. The “FREQUENCY” terms refers to a reference to frequency concept which denotes temporal periodicity. The “OOV” terms at the end of the CNF representation stands for “Out of Vocabulary”. The current version of this ontology has 136 concepts corresponding to 1531 tokens versus a vocabulary of about 59,000 tokens for our full corpus (or several billion words in English). Needless to say, most words are out of vocabulary. To obtain the final so-called CNF representation the “OOV” terms are replaced with their part of speech tag therefore our previous example, “I have never had the flu” becomes “SELF_REF HAVE FREQUENCY HAVE DT NN”. Figure 2 below depicts the transformations for the message “I have never had the flu!”.
For SNOMED-CT the ontology is organized differently, each concept has a corresponding description but there is no concept of a concept dictionary. For the SNOMED-CT experiments we employ an SQLite3 implementation, for each word we search the concept for which it appears in the description using the FTS4 (Full-text Search) engine. We deal with out of vocabulary concepts the same way by replacing them with their part of speech tags. For both ontology representations the input may take the form of a simple one-hot vector or as a distributed representation by applying neural embeddings over concepts in the same way they are applied to words.
Word2vec/ Doc2Vec settings
For training distributed embeddings we use the word2vec/doc2vec model by Mikolov et al. [23, 24]. For the ontology-based CNF word2vec/doc2vec model we find optimal performance with a 200 dimensional representation, 8 noise words, a context window of 5, and 20 training epochs with distributed memory architecture and minimum word count of 2. For the word-level word2vec model we employ Google’s 300 dimensional gold standard 3 million word corpus.
Experimental setup
We try out a variety of models including deep neural networks using Convolutional Neural networks (CNNs) and Recurrent Neural Networks (RNNs) with Long Short Term Memory (LSTM) units. We specify a maximum message length of 20 tokens for these experiments. Consequently, each tweet takes the form of a 20 X 200 vector representation, messages that are shorter than 20 words are zero padded. For the CNN-CNF model we use a similar architecture to Yoon Kim [25]. We first pass the embedding to a dropout layer then employ three filters of width 3, 5, 7 and a length the same as the concept embedding length of 200 rectified linear units (ReLus) and a stride of 1. The output from the feature maps is passed through a max-pooling layer and their full output concatenated into a final feature vector that is fed to a dropout layer which then feeds into a sigmoid output layer with one neuron. The model architecture is depicted in Fig. 3. The architecture is similar for the word-level experiment that employs Google’s gold standard model except since it is a 300 dimensional representation, the input shape is 20 X 300 instead.
For the CNN-LSTM-CNF model, which stacks an RNN on top of a CNN, we employ a single filter of width 3 and a stride of 1, then apply max-pooling with a pool width of 2 to the resulting feature maps. We then concatenate the output into a single feature vector which we pass into an LSTM layer that feeds directly to a sigmoid output layer containing one neuron. Figure 4 below depicts the model architecture.
For the Stack of two LSTMs-CNF model, which stacks an RNN layer on top of another, we first apply dropout to the input array then feed the output to an LSTM layer 200 neurons wide for the CNF model followed by a dropout layer then another LSTM layer followed by another dropout layer which finally feeds into a sigmoid output layer with a single neuron. Figure 5 depicts the model architectures the stack of 2 LSTMs model for the input message “I THINK AM GOING TO GET SICK” in the CNF form described by Magumba et al. [20].
For the Bi-directional LSTM-CNF model we first apply dropout to the input layer then feed the result into two parallel stacks of LSTM layers. The output is ordered front to back in the second stack. Each stack comprises a first layer 200 neurons wide whose output is fed to a dropout layer which then feeds its output to a second LSTM layer. At this point the output from both stacks is concatenated and fed into a single dropout layer and into a final sigmoid output layer with a single neuron. Figure 6 below depicts the Bi-directional LSTM model for the input sentence, “SICK WITH THE FLU!” For all deep learning experiments we find the optimal performance after a small number of iterations. We use five epochs for model training, beyond this the models tend to overfit due to the small size of the effective vocabulary. All coding is done in python and for the neural models we employ the python keras package. For all CNN and LSTM models the dropout layers are with a dropout probability of 0.25.
For word2vec and Doc2Vec vectors we employ the python gensim package [26]. For the unigram bag of words models we use scikit-learn’s SGD classifier which is a support vector machine classifier trained with a stochastic gradient descent procedure. We also employ the python scikit- learn package [27] for the logistic regression classifier. For the logistic regression and SGD models we use the scikit-learn hyper parameter defaults except we employ 10,000 iterations for the SGD model.
For the BERT experiments we employ two pretrained multilingual BERT models namely bert-based-multilingual-uncased and bert-based-multilingual-cased from the HuggingFace library [28]. The BERT layers are then fed into the same CNN architecture employed for the word2vec embeddings above.