Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encoding for word-based, class-based (ontology-based) and subword units in the form of byte pair encodings. We also go on to establish the desirable performance characteristics for multi-lingual semantic filtering approaches and offer an in-depth discussion of the implications for end-to-end surveillance.

in many cases this assumption is too strong as the volume of disease related messages can be influenced by panic and other factors. Therefore it is important to incorporate the semantic orientation of tweets to discriminate between relevant and irrelevant mentions of given keywords as in many cases even messages that explicitly mention diseases may actually do so in a non-occurrence related contexts or contexts that are spatio-temporally irrelevant. For instance the post, "I remember when the Challenger went down, I was home sick with the flu!" is an actual reference to an occurrence of the flu but the Challenger disaster occurred in 1986 therefore it would be incorrect to count this mention to model an outbreak in 2021. There is some experimental evidence to suggest that incorporation of semantic orientation of tweets actually improves the end-to-end performance of prediction models for applications like nowcasting [6,7]. In spite of this we are not currently aware of any large scale automated surveillance systems actively using semantic filtering techniques to classify messages.

Related work
For multi-lingual applications the dominant approach has been use of multi-lingual ontologies and taxonomies the BioCaster [8] and HealthMap [9] systems. These are still essentially keyword volume systems at the core. Furthermore, these systems are really just secondary aggregators as they themselves rely on human mediated systems like Promed-Mail [10] which may make the semantic check redundant. In general multilingual text classification for the purpose of semantic filtering of texts for disease surveillance hasn't been a very active area of research. We could only find one recent study by Mutuvi et al. [11] which systematically investigates the issue. In terms of motivation and technical approach this work is very similar where we differ is that we investigate the issue for Twitter messages whereas Mutuvi et al. [11] employed Promed-Mail messages. Social media messages generated on platforms like Twitter present unique challenges for conventional text processing techniques for instance Twitter text is generated by millions of users, each with their own individual writing style and vocabulary. In addition tweets are short, colloquial in nature and are characterized by slang, misspellings, poor grammar and additional artifacts like hashtags, emoticons and URLs (Uniform Resource locators). Furthermore, they are less purposeful than platforms like Promed-Mail which have been created with an explicit goal of communicating about disease outbreaks. When people report disease events on Twitter it is often inadvertent meaning parts of the message will be redundant and even misleading. Another key difference is the error analysis. Previous work has not delved into a detailed discussion of the error types and their implications for the use case. Here we dedicate a significant portion of time on characterizing the errors and discussing their implications on end-to-end performance and the desirability of different models.
Regarding multi-lingual message classification, there are two possible approaches. The first entails creating different models for each language under consideration. This is potentially resource intensive as it requires multi-lingual expertise. The second is to create a single model in a so called "resource-rich" language and then employ it to classify related "resource-poor" languages. This requires the "resource-poor" languages to be translated to the "resource-rich" language. In this work we investigate the latter approach. There are two variations of this approach, one is to fully translate the resource-poor language to the resource-rich language and the other is to partially translate by projecting mono-lingual word embeddings to a common-embedding space to obtain multi-lingual word embeddings [12][13][14]. Typically embeddings in resource-poor languages are projected to those in the resource-rich language which is usually English. Multi-lingual classification with multi-lingual word embeddings provides performance that is at par with mono-lingual classification on certain tasks such as document classification.
The state of the art machine translation systems employ Neural machine translation approaches which rely on recurrent auto encoder -decoder neural networks. It also helps to have in domain training data such that the style and vocabulary of the source data is similar to the target as words may have different meanings in different situations. As input, these systems take distributed representations of the source language and output distributed representations of the target language. In this work we employ Google's Neural Machine Translation system (GNMT) to obtain full translations of tweets. The GNMT models are trained using publicly available data from sources like European Union communications. These are generally formal in style as opposed to the colloquial style of Twitter messages and as a consequence the translation performance of systems like GNMT is unreliable on tweets. However, as already stated it is improving and in some cases like GNMT freely available. To deal with rare words, GNMT employs word pieces as the lexical unit rather than actual words. This is quite useful for Twitter as it makes GNMT somewhat robust to certain types of errors like slight misspellings and enables better handling of rare words.
GNMT internally employs a hybrid architecture that combines Transformers [15] and Recurrent auto encoder-decoder neural networks. Transformer architectures like BERT (Bidirectional Encoder Representations from Transformers) have been employed to obtain neural representations of words in both monolingual and multilingual use cases and currently account for the state of the art results on several tasks such as translation, classification and named entity recognition [16,17]. BERT internally employs a vocabulary of byte pair encodings.

Materials and methods
We can summarise the steps taken for our experiments as an activity pipeline. The pipeline is summarised in Fig. 1.

Corpus generation
The first step is the creation of the corpus. We obtain tweets from a basic Twitter account using some specific keywords via a python script through Twitter's Streaming API using the python tweepy plugin. 1 The tweets we download are those that are marked as public which is the default security level and they are only marked private if expressly indicated by users. We employ simple keyword filters to extract the desired tweets. For the training data we employ a data set of 13,004 English tweets that mention the flu, common cold or Listeria.
For the test data we extract tweets that mention the flu in French, German, Spanish, Arabic and Japanese. We translate the tweets eliminating those tweets that are incomprehensible and then annotate the remainder of the tweets. At annotation we label those tweets that mention a recent (less than a month) or ongoing cases of disease as positive. Within this period we expect the disease to still be within its communicable period which is a period within which new infections are still possible as a result of transmission from sick individuals to susceptible healthy individuals. We treat all other mentions as irrelevant and we label them messages negative.
In addition we eliminate duplicates by removing retweets, (tweets with the "RT" tag) and also manually check for duplicates that may not be marked as "RT" and finally we remove all punctuation except the "#" and "@" symbols where they appear at the beginning of tokens where they are used to denote hashgtags and users respectively. Table 1 below summarizes the composition of our corpus. We do not perform any preprocessing prior to translation. We found the translation API to be quite robust but in many cases it returns some unspecified system error and in a few cases the translation only contains  inconsequential elements like URLs or is incomprehensible making us unable to competently annotate the tweet so we remove these tweets in addition to tweets whose contents are repeated in several other tweets as well as those that have been retweeted. For this reason there are less tweets retained after translation than the actual number of tweets contained in the corresponding datasets. The column for yield in table is the percentage of tweets successfully translated and retained.

Pre-processing
The next step is pre-processing. We start off by tokenization then part of speech tagging. For part of speech tagging we employ the GATE (General Architecture for Text Engineering) Twitie tagger application [18]. The tagger uses the Penn Treebank tag set [19] in addition to three additional tags "HT", "USR" and "URL" corresponding to twitter specific phenomenon namely hashtags, users and URLs respectively. We also attempt these experiments with stemming and without stemming. Stemming further reduces lexical diversity by reducing different forms of the same word into a single stem.

Feature generation
We try out four different input representations. We try word one-hot word-level representations, one-hot concept-level representations, distributed representations over words and distributed representations over concepts. For the conceptual representations we employ two different ontologies. We try the ontology previously developed by Magumba et al. [20]] and SNOMED-CT (Systematic Nomenclature of Medicine-Clinical Terms) [21]. Whereas several ontologies exist such as OBO (Open Biomedical Ontologies) [22] and the BioCaster ontology [8], we elect to employ these two because they have a broader conceptual coverage which we consider an advantage in a general language modeling task. For instance the BioCaster method partially defined word lists that contain terms such as "human ehrlichiosis" and "enzootic bovine leukosis" which are generally too technical to appear in casual texts like Twitter posts with any significant regularity. The Magumba et al. [20] ontology on the other hand was created specifically for Twitter disease event detection and SNOMED-CT was created specifically to harmonise divergent medical terminology and is marketed by SNOMED-CT international as the most comprehensive and precise clinical health terminology product in the world. For the ontology based input representations we transform each tweet into a vector of features as follows: Firstly, we flatten out our ontology into a list of its constituent concepts. For the Magumba et al. [20] ontology each concept is associated with a group of words or tokens referred to as the concept dictionary. Each concept is effectively a list of words and the full ontology is basically a list of lists. In this sense it is a heavily redacted English dictionary containing only words considered to be of epidemiological relevance. To obtain the feature vector we simply tokenize each tweet and for each token we do a dictionary look up in the flattened ontology. If the token exists in the ontology, we simply replace it with the concept in which it occurs. As an example the sentence "I have never had the flu" is encoded as "SELF_REF HAVE FREQUENCY HAVE OOV OOV". SELF_REF refers to "Self references" which is the concept class for terms that persons use to refer to themselves such as "I", "We" and "Us" used as an indicators of speaking in the first person, "HAVE" is the concept class for "have" or "had" which is a special concept class since the verb "to have" is conceptually ambiguous as it can legitimately indicate two senses that is falling sick or possession. The "FREQUENCY" terms refers to a reference to frequency concept which denotes temporal periodicity. The "OOV" terms at the end of the CNF representation stands for "Out of Vocabulary". The current version of this ontology has 136 concepts corresponding to 1531 tokens versus a vocabulary of about 59,000 tokens for our full corpus (or several billion words in English). Needless to say, most words are out of vocabulary. To obtain the final so-called CNF representation the "OOV" terms are replaced with their part of speech tag therefore our previous example, "I have never had the flu" becomes "SELF_REF HAVE FREQUENCY HAVE DT NN". Figure 2 below depicts the transformations for the message "I have never had the flu!".
For SNOMED-CT the ontology is organized differently, each concept has a corresponding description but there is no concept of a concept dictionary. For the SNOMED-CT experiments we employ an SQLite3 implementation, for each word we search the concept for which it appears in the description using the FTS4 (Full-text Search) engine. We deal with out of vocabulary concepts the same way by replacing them with their part of speech tags. For both ontology representations the input may take the form of a simple one-hot vector or as a distributed representation by applying neural embeddings over concepts in the same way they are applied to words.

Word2vec/ Doc2Vec settings
For training distributed embeddings we use the word2vec/doc2vec model by Mikolov et al. [23,24]. For the ontology-based CNF word2vec/doc2vec model we find optimal performance with a 200 dimensional representation, 8 noise words, a context window of 5, and 20 training epochs with distributed memory architecture and minimum word count of 2. For the word-level word2vec model we employ Google's 300 dimensional gold standard 3 million word corpus.

Experimental setup
We try out a variety of models including deep neural networks using Convolutional Neural networks (CNNs) and Recurrent Neural Networks (RNNs) with Long Short Term Memory (LSTM) units. We specify a maximum message length of 20 tokens for  [25]. We first pass the embedding to a dropout layer then employ three filters of width 3, 5, 7 and a length the same as the concept embedding length of 200 rectified linear units (ReLus) and a stride of 1. The output from the feature maps is passed through a max-pooling layer and their full output concatenated into a final feature vector that is fed to a dropout layer which then feeds into a sigmoid output layer with one neuron. The model architecture is depicted in Fig. 3. The architecture is similar for the word-level experiment that employs Google's gold standard model except since it is a 300 dimensional representation, the input shape is 20 X 300 instead.
For the CNN-LSTM-CNF model, which stacks an RNN on top of a CNN, we employ a single filter of width 3 and a stride of 1, then apply max-pooling with a pool width of 2 to the resulting feature maps. We then concatenate the output into a single feature vector which we pass into an LSTM layer that feeds directly to a sigmoid output layer containing one neuron. Figure 4 below depicts the model architecture.
For the Stack of two LSTMs-CNF model, which stacks an RNN layer on top of another, we first apply dropout to the input array then feed the output to an LSTM layer 200 neurons wide for the CNF model followed by a dropout layer then another LSTM layer followed by another dropout layer which finally feeds into a sigmoid output layer with a single neuron. Figure 5 depicts the model architectures the stack of 2 LSTMs model for   [20].
For the Bi-directional LSTM-CNF model we first apply dropout to the input layer then feed the result into two parallel stacks of LSTM layers. The output is ordered front to back in the second stack. Each stack comprises a first layer 200 neurons wide whose output is fed to a dropout layer which then feeds its output to a second LSTM layer. At this point the output from both stacks is concatenated and fed into a single dropout layer and into a final sigmoid output layer with a single neuron. Figure 6 below depicts the Bi-directional LSTM model for the input sentence, "SICK WITH THE FLU!" For all deep learning experiments we find the optimal performance after a small number of iterations. We use five epochs for model training, beyond this the models tend to overfit due to the small size of the effective vocabulary. All coding is done in python and for the neural models we employ the python keras package. For all CNN and LSTM models the dropout layers are with a dropout probability of 0.25.
For word2vec and Doc2Vec vectors we employ the python gensim package [26]. For the unigram bag of words models we use scikit-learn's SGD classifier which is a support vector machine classifier trained with a stochastic gradient descent procedure. We also employ the python scikit-learn package [27] for the logistic regression classifier. For the logistic regression and SGD models we use the scikit-learn hyper parameter defaults except we employ 10,000 iterations for the SGD model.
For the BERT experiments we employ two pretrained multilingual BERT models namely bert-based-multilingual-uncased and bert-based-multilingual-cased from the HuggingFace library [28]. The BERT layers are then fed into the same CNN architecture employed for the word2vec embeddings above.

Results and discussion
We employ the precision, recall and F1 Score as our base performance metrics. They are given by the following equations: (1) P = TP TP + FP The results are presented in Tables 2 and 3 below. For clarity we separate the results into two tables, the first discusses one hot vector encoded representations whilst the second discusses distributed (continuous) neural embeddings. To measure the model performance we employ un-weighted average performance in addition to the performance. The ideal situation is to have a high average performance coupled with a small performance variance. A small variance means the classifier's performance on different datasets is stable, a high variance means the performance characteristics of the approach vary greatly from language to language. To combine the two into a single measure we use the following formula: We refer to the quantity 1-Norm(Var) as the "invariance". The quantity Norm(Var) is the "normalized variance" which is the performance variance expressed as a percentage of the maximal possible variance. Since the base performance metrics (precision, recall and F1Score) are bounded between 0 and 1, it means the variance also has an upper bound which can be calculated from Popovicio's inequality as where M is the largest possible value and m is the smallest possible value. In our case M = 1, and m = 0.
This ensure that the invariance ranges from 0 to 1 and our overall performance also ranges from 0 to 1. Therefore an overall score of 0 implies that the performance of the model for some metric is 0 for all datasets or the performance variance is maximal for instance where precisely half of the datasets have a maximal score for the metric and the other half have the minimal score. An overall score of 1 on the other hand, for a given metric, implies that the model obtains a perfect score for the given metric in all of the language datasets.
We have also excluded the results of the validation dataset from the overall model performance as conveyed in the aggregate score columns (last three columns), their inclusion does not provide any additional information since it does not change the order of aggregate model performance. The first row indicates the performance with the unigram bag of words baseline whilst the remaining rows indicate the performance of different approaches to mitigating performance divergence that may occur between the training data and real world data mainly due to cross-lingual lexical divergence and sub-optimal translation. Table 2 Results of experiments for different methods, representations and data sets with categorical one-hot vectors We find only modest variations between the best performing approaches for different representations (indicated in bold). The highest average performance for categorical one hot vector representations with the bigram SNOMED-CT model with an un-weighted average performance of 0.59, 0.66 and 0.62 respectively for precision, recall and F1 score respectively. The overall scores are 0.72, 0.77 and 0.75 for precision, recall and f1 score respectively. He best performing distributional model is the CNN + word2vec model with an un-weighted average precision, recall and F1 score of 0.55, 0.72 and 0.63 respectively and an overall performance of 0.69, 0.81 and 0.75 for precision, recall and F1 score respectively. Crucially, their overall F1 score performance is tied at 0.75. In fact, across methods it is seen that there are only very slight differences between the performance of different methods and representations. For instance both best performing approaches are only 1.3% better than the CNF model in terms of overall performance by F1 score. In addition they are only 4.2% better than the baseline model in terms of overall performance by F1 score. So, there isn't really a huge performance pay-off for the additional technical complexity of using conceptual representations and deep neural approaches. We also score consistently higher recall than precision across all methods.
As to why we notice very slight differences between different representations and very slight gains in performance versus the unigram word-level baseline, it may be because we employ messages about the same disease therefore resulting in smaller lexical differences implying the problem these approaches are designed to solve didn't exist in the data as selected.
It's also noteworthy that the good performance of the SNOMED-CT model is a bit counterintuitive. The reason for this is that SNOMED-CT is not organized for this sort of task, for instance we employ an SQLite3 implementation and in order to obtain the concept referred to by a token we rely on a full text search via the FTS4 (Full text Search) engine. 2 SNOMED places these terms in the description of the concept, as an example a search for the terms "Man" and "Woman" would produce the following truncated output as depicted in Figs. 7 and 8 respectively: As can be seen from Figs. 7 and 8, several candidate concepts are returned. We would prefer for both cases to map to the "person" concept but there is no way of automatically enforcing this in our setup. In our case we simply return the first match which means that man and woman get mapped to different concepts. Moreover these aren't the preferred meanings as "Man" and "Woman" return "Stiff-man syndrome" and "Achard-Thiers syndrome" respectively which are disorders. From a text processing point view this is no better than a word level model as both words are mapped to different atoms. For the model, having encountered "Man" yields no useful information on "Woman". The good performance of the SNOMED-CT based conceptual representation relative to the baseline is hard to explain given this, it could be a result of the fact that it has a very wide coverage and even though mappings are frequently meaningless they are consistent enough that it somewhat effectively normalizes the text since most words are mapped to some concept (even though it may be the wrong one) and there are fewer out of vocabulary terms than the CNF representation. The SNOMED-CT encoder returns a result 98.28% of the time versus 90% of the time for the CNF encoder therefore it effectively acts as a sort of text normalizer.
Finally, it is evident that we obtain poorer performance on languages which are more distant from the model language as results are consistently poorer on the Japanese and Arabic datasests with Japanese and Arabic being from the Japonic and Afro-Asiatic language families as opposed to French, German and Spanish which like English are from the Indo-European family. For the BERT models we only report results for the bert-basemultilingual-uncased model because the results for the bert-base-multilingual-cased model are basically identical.

Conclusion and future work
The results are promising particularly for languages closely related to the model language. The performance is significantly stronger on languages in the Indo-European language family to which English, the model language, belongs. This is probably due to an abundance of parallel datasets for model development for closely related languages than more distant languages like Arabic. For a globally deployed live implementation a divide and rule strategy could be applied in which different models are created for different groups of languages. Still this would be far cheaper than creating language specific models for the thousands of languages that exist. To get a mathematical impression of the significance of this work we can take the example of a nowcasting application. As already stated, nowcasting approaches that rely on textual web data typically model disease intensity as a function of keyword volume. If we consider the Spanish dataset for which we obtain the highest performance and 67% of tweets are positive, a nowcasting model would have an input error of 50% without message classification. That is if the nowcasting model assumed that all messages containing the keyword "flu" were relevant; the model would incorrectly assume disease activity to be 50% more intense than it actually is. Ideally the model should only accept those messages which report actual occurrences of disease, in actuality it will accept any messages labeled as relevant. The ideal and actual model inputs and model input error are given by the expressions below: where TP, FP, FN represent True Positives, False Positives and False Negatives respectively.
Any precision performance less than 1, means that false positives exist which would increase the model input error, any recall performance less than 1 means false negatives exist which reduces the model input error. In general a negative model error implies that the precision is better than the recall and a positive error that the recall is better than the precision. Crucially, by combining Eqs. 8 with Eqs. 1 and 2 we get the following expression equivalence for the error rate.
where P, R, TP, FP, FN represent Precision, Recall, True Positives, False Positives and False Negatives respectively.
Therefore the error rate is proportional to the difference between the recall and precision. This is because any false negatives are offset by false positives. Therefore, if the precision and recall are equal as with the Spanish dataset using the unigram/bigram CNF model (in bold italics) then the assuming these errors are not geo-spatially localized (for example if they are a product of phenomena like dialect differences) the error would effectively be zero.
As a consequence for the end to end surveillance performance, it isn't even necessary to have a perfect classifier at the linguistic step but rather one which has a balanced performance it terms of precision and recall. In this respect the bigram-SNOMED model is doubly desirable as it not only has the highest overall performance but also the least difference in recall and precision performance overall. Where a balanced precision and recall performance is not achievable a higher precision performance is preferred to avoid false positives and ultimately reliable performance (i.e. performance with a low variance as data changes) is preferred to avoid unreliable results. The raw performance may not be that important as long as the performance is predictable and can therefore be adjusted for in the surveillance models.