Task-agnostic representation learning of multimodal twitter data for downstream applications

Twitter is a frequent target for machine learning research and applications. Many problems, such as sentiment analysis, image tagging, and location prediction have been studied on Twitter data. Much of the prior work that addresses these problems within the context of Twitter focuses on a subset of the types of data available, e.g. only text, or text and image. However, a tweet can have several additional components, such as the location and the author, that can also provide useful information for machine learning tasks. In this work, we explore the problem of jointly modeling several tweet components in a common embedding space via task-agnostic representation learning, which can then be used to tackle various machine learning applications. To address this problem, we propose a deep neural network framework that combines text, image, and graph representations to learn joint embeddings for 5 tweet components: body, hashtags, images, user, and location. In our experiments, we use a large dataset of tweets to learn a joint embedding model and use it in multiple tasks to evaluate its performance vs. state-of-the-art baselines specific to each task. Our results show that our proposed generic method has similar or superior performance to specialized application-specific approaches, including accuracy of 52.43% vs. 48.88% for location prediction and recall of up to 15.93% vs. 12.12% for hashtag recommendation.


Introduction
Twitter produces a wealth of information for analysis of trends, opinions, and interactions with 500 million tweets per day generated by its users [1]. As such, the microblogging service is a popular target for research involving machine learning. Several problem settings in the field of machine learning, or variants thereof, can focus on Twitter data. For example, sentiment analysis, spam detection, and location prediction, all well-established problem settings in their own right, are all applicable to tweets [2][3][4]. Much of the work in applying machine learning to Twitter data focuses on only one component, or a few components, of a tweet. A sentiment analysis model [5] might only use the text of a tweet, while a hashtag recommendation system [6] might use both text and image. A tweet can contain additional components that may be informative to a machine learning model. Examples of these components include the author of the tweet, whose interactions with other users can be modeled by a graph [7], and the location, which can link the tweet to others at the same location [8]. Incorporating several of these tweet components into a machine learning framework can potentially create a model that is better informed than others at a given task. One approach to accomplishing this is by creating a joint embedding framework.
Joint embeddings [9] are used in several machine learning tasks that handle different modalities, e.g. text and images, in order to leverage the relationship between them. The intuition behind a joint embedding space is that inputs from different modalities mapped into the space should be close if they are semantically related. For example, in the problem of image-text retrieval [10], the image captions or tags closest to an image in an embedding space should be those that best describe the visual content of the image. However, these models are generally limited to 2 or 3 modalities (components), like the aforementioned text and image. Introducing additional modalities has two potential benefits. First, it can better inform existing applications by taking these additional modalities into account. In the image-text retrieval task, also considering the author and the location of a post from an image sharing or microblogging service might achieve better results. Second, introducing additional modalities can open up a joint embedding space to new applications. Recent work in hashtag recommendation uses the text and image of a social media post as input to a neural network model [11][12][13], but a joint embedding model that includes hashtags as well as text, images, and other modalities might also perform well at this task.
Our motivation for this paper is thus to address the problem of creating a joint embedding framework that incorporates several tweet components and using an embedding model trained with such a framework to address multiple machine learning problems involving Twitter data. Specifically, the first question we ask in this paper within the context of Twitter is can additional modalities in a joint embedding space improve its performance in typical applications and/or enable it to perform well in new ones? We show in our experiments that additional modalities can indeed allow a joint embedding model to perform better than a similar model with fewer modalities, and can also perform well at new tasks. The second question is can we build a single, task-agnostic joint embedding model for tweets and use it several diverse applications? Our results show that a single trained joint embedding model can be successfully applied to multiple different tasks.

Overview of the proposed approach
Our approach builds on VSE++, the framework proposed by Faghri et al. [14]. We do so by extending the framework to incorporate 3 more tweet components in addition to text and images: hashtags, considered separately from tweet text; the author, as represented by a graph embedding [15] learned from a graph of Twitter user mentions; and location, to represent the context of a tweet in terms of what other Twitter users are discussing at the same place. Extending VSE++ is not trivial, as adding 3 additional modalities complicates training. The loss function involved in training the model must account for how one modality interacts with four others rather than just one. We accomplish this with a novel approach that learns representations for each component using triplet loss [16] calculated using an embedding for one type of tweet component and the average embedding across all components of a tweet. Like VSE++, this loss incorporates hard negatives, which have been shown to be effective in several tasks [17][18][19][20].
Our proposed model is applicable to several tasks, including: • Image retrieval/text retrieval: Given a tweet t without an image (or text), retrieve text (or an image) relevant to t. • Hashtag recommendation: Given a tweet t without hashtags, predict one or more hashtags relevant to t. • Bot detection: Given a user u and several tweets written by u, predict whether or not u is a bot, i.e. an automated Twitter account. • Location prediction: Given a tweet t without geolocation data, predict where the author of t was when it was posted.
In our experiments, we show the performance of our proposed model compared to baselines from these domains using Twitter data. Contributions: We make the following contributions.
• We introduce the problem of developing a task-agnostic representation learning framework for tweets that incorporates several tweet components. To our knowledge, this is the first work to address this problem. • We develop a novel framework with pairwise ranking loss to learn a robust joint embedding with 5 tweet components. • We showcase the usefulness of additional tweet components by applying the learned embeddings to different tasks and comparing their performace to task-specific baselines.
The remainder of the paper is organized as follows: "Related work" discusses prior work. "Approach" describes our proposed method. "Experiments" explains the experimental evaluation of our proposed method and presents the results of our experiments. We conclude in "Conclusion".

Joint embedding
Joint embedding models have been proposed for image-text retrieval [14,[21][22][23], videosentence retrieval [24][25][26][27][28], video-paragraph retrieval [29,30], temporal localization of moments [31][32][33], and a variety of other tasks [34][35][36][37]. The general idea behind a joint embedding model is to place vector representations of different media, such as text and images, into the same embedding space such that the distance between semantically similar vectors (e.g. an image and its captions or tags) is minimized. For the image-text retrieval task, Faghri et al. [14] projected images and text in a visual-semantic embedding space learned with a loss function that utilizes hard negatives. In [21], Mithun et al. used images and noisy text from the Web to improve a joint embedding model. Lee et al. [23] captured the fine-grained interplay between objects present in an image and text to better align images and text in a joint embedding space. For the task of video-sentence retrieval, Mithun et al. [24] employed multimodal cues such as image, motion, and audio for video encoding. In [25], Dong et al. used multi-level encodings for video and text to perform zero-example video retrieval. Wray et al. [26] enriched embedding learning by disentangling the parts-of-speech of captions. For the temporal localization task, moment-sentence pairs [31,33] or clip-sentence pairs [32] are aligned in the joint embedding space.

Machine learning on Twitter data
Several studies have created machine learning methods specifically for use with Twitter data [38][39][40]. However, these works typically focus on text instead of also including other parts of a tweet. Other work incorporates additional tweet data, e.g. images and tweet author metadata, to accomplish specific tasks such as hashtag recommendation, bot detection, and location prediction. We explore some of these studies below.

Hashtag recommendation
Many studies have focused on hashtag recommendation for Twitter and other microblogging platforms. Rawat and Kankanhalli [11] proposed a deep neural framework to recommend descriptive tags for an image that combined image features from a convolutional neural network (CNN) [41] with features from a "ContextNet" neural network using the image's associated location and time data as input to provide context for the image. Their proposed method outperformed a model that considered only image data. Zhang et al. [12] used an image's features from a VGG 1 network [42] and text features from a long short-term memory (LSTM) network [43] representing the image's caption as input to a co-attention mechanism [44] to recommend hashtags. They compared their proposed framework to state-of-the-art methods that used only the image caption and found that their method performed the best. The method proposed by Ma et al. [13] uses a framework similar to [12] to recommend hashtags, but it also incorporates images and text associated with previous uses of a candidate hashtag to achieve better performance than state-of-the-art methods.

Bot detection
Bot detection on Twitter is another area of active research. Botometer (formerly Bot-OrNot), originally proposed by Davis et al. [45], extracts over 1000 features from a Twitter user's metadata, interaction patterns, and content, then uses them as input to a random forest classifier [46] to predict the likelihood of that user being a bot. Botometer's most recent version, v4 [47], is an ensemble classifier that combines Botometer v3 [48] with random forest classifiers that are each trained on a specific class of Twitter bot. This method outperformed baselines on several datasets. Another approach by Kudugunta and Ferrara [49] uses a deep learning method to determine whether an individual tweet was made by a bot, rather than the more conventional approach of determining whether or not a Twitter user is a bot. It uses word embeddings from tweet text as input to an LSTM combined with tweet metadata to make its predictions and achieved high accuracy in testing compared to baselines.

Location prediction
Previous work has examined location prediction for tweets. Matsuo et al. [50] combined a text-based location estimator and a CNN for image features to perform gridbased location prediction and showed that combining image and text outperformed using only one modality. Kumar and Nezhurina [51] proposed a method to predict the location of a Twitter user's next tweet based on the user's past tweets. Their method used tweet geo-coordinates, mentions of predefined location categories, and tweet personality traits as input to an ensemble of popular classification methods, which outperformed individual classifiers used in the ensemble. The method proposed by Lau et al., Deepgeo [52], takes a tweet's text and creation time as well as the author's UTC offset, time zone, location (i.e. the free text location listed in a user's profile), and account creation time as input to a deep learning framework to predict location as a city class, i.e. the city in which a tweet was written. The authors of [52] found that Deepgeo outperformed the state-of-the-art in this task.

Big Data analysis on Twitter
As one of the largest social media platforms, Twitter presents a prime target for studies involving analysis of big data. Linell et al. [53] estimated the prevalence of sleep loss incurred by the beginning of Daylight Saving Time (DST) by analyzing a dataset of 13.1 million tweets. They found that the beginning of DST causes changes in sleep behavior as indicated by a change in the time of peak Twitter activity. Feizollah et al. [54] analyzed tweets related to halal tourism by identifying related topics and performing sentiment analysis. They found that the word "halal" was primarily associated with food and hotels on Twitter, many non-Muslim countries are popular halal tourist destinations, and the majority of tweets expressed positive sentiment. Piña-García and Ramírez-Ramírez [55] examined data from Twitter and other sources to predict the most frequent crimes in Mexico City. They showed that their methods can successfully estimate the occurrences of crime, they note that more work is necessary to develop an accurate model. Our proposed method can support future big data studies similar to these, as researchers can use it to analyze Twitter data they have collected according to their goals.

Approach
In this section, we describe our proposed model to represent tweet components in an embedding space. We first provide an overview of the VSE++ framework on which our proposed method is based ("Description of VSE++ framework"). Next, we describe the structure of the neural network framework and how we represent tweet components ("Network structure and input features"). Then, we present our approach to training a joint embedding model with this framework using pairwise ranking loss ("Training joint embedding").

Description of VSE++ framework
VSE++, proposed by Faghri et al. [14], is a joint embedding framework designed for image-text retrieval. Given a dataset of images and captions, a trained VSE++ model attempts to pair each image with its corresponding caption. It does so by projecting them in a joint embedding space; image embeddings are generated via a ResNet [56] or VGG [42] model, while the embeddings for image captions are generated by a gated recurrent unit (GRU)-based text encoder, which is a commonly-used method to represent sentences [21]. The most notable contribution of [14] is the incorporation of hard negatives during training. In the context of image-text retrieval, a hard negative is the closest non-matching caption to an image (or vice-versa). More generally, a hard negative is the closest negative (i.e. non-matching embedding) to a training query. A loss function incorporating hard negatives will assign higher loss to a query with a negative that is particularly close to the query compared to a more conventional "sum of hinges" loss function that would be more influenced by the average distance of the negatives. Faghri et al. explain that one advantage of this approach is avoiding local minima created by the sum of hinges loss, leading to a better performing model. Their loss function for an image i and an image caption c is thus where ĉ is the hardest negative caption with respect to i, î is the hardest negative image with respect to c, and is the margin parameter. Experimental evaluation with the Microsoft COCO [57] and Flickr30K [58] datasets showed that this approach was superior to other image-text retrieval methods. VSE++ is written in Python and uses the PyTorch deep learning library [59]. Using this as our basis, we extend the concepts presented by Faghri et al. to account for additional modalities present in tweets. (1)

Network architecture
We learn a joint embedding model using a deep neural network framework. Our framework, shown in Fig. 1, has 5 branches for tweet text, an image, hashtags, the location of the tweet, and the user who wrote the tweet. Each of these branches uses a different network. The goal of this design is for the individual branch networks to focus on component-specific features while the fully connected layers convert these to embeddings in the joint space with a dimensionality of 1024.

Text representation
For encoding tweet text (i.e. the text of the tweet without hashtags), we use an embedding layer with weights initialized with word embeddings from a fastText [60] model trained on Twitter data. The dimensionality of this layer is 300. The word embeddings are then input to a GRU. The GRU maps the text features to the joint embedding space.

Image representation
To encode an image contained in a tweet, we use a 152-layer ResNet model [56] trained on the ImageNet dataset [61]. The dimensionality of the image embedding is 2048; this is mapped to the joint space via a fully connected layer. Similar frameworks have also evaluated a 19-layer VGG model [42] as an alternative, however ResNet has been shown to perform better, at least for the task of image-text retrieval [14,21], so we limit our experiments to the ResNet model.

Hashtag representation
To represent hashtags, which are a form of metadata used to denote keywords or topics within tweet text [13], we first separate hashtags from the text of the tweet. We then average over the fastText word embeddings of the extracted hashtags and map this to the joint space with a fully connected layer.

Location representation
To emphasize the context of a tweet, we consider location in terms of tweets from the same place. This is a two-step process: first, we collect the text of "neighbor" tweets from the same place as an input tweet. For simplicity, we do this by grouping tweets from our collected Twitter data with the same Twitter-assigned place ID as the input tweet. The texts of these tweets are then encoded through a network branch identical to that of the input tweet's text, but this is followed by averaging the encodings of the texts.

User representation
We represent the author of a tweet by using a trained graph embedding model. Specifically, we use the fastnode2vec [62] implementation of node2vec [63]. The weighted graph used to train this model was constructed with users as vertices and mentions as edges, i.e. an edge (u, v, w) represents user u mentioning user v in w tweets. The dimensionality of the graph embedding is 300; this is mapped to the joint space via a fully connected layer.

Training joint embedding
For a pair of dissimilar tweets t 1 and t 2 , t 1 should have embedding vectors for its components that are similar to each other, but are not similar to the embedding vectors of the components of t 2 . Conversely, t 2 should have embedding vectors for its components that are similar to each other, but are not similar to the embedding vectors of the components of t 1 . With that intuition in mind, our goal is to learn a joint embedding characterized by the weights of the fully connected layers, the text and location word embedding layers, and the GRUs.
We base our approach on previous work that uses hinge-based bi-directional ranking loss for visual-semantic embeddings [14,21]. These approaches maximize the similarity between corresponding image and text embeddings and minimize similarity to nonmatching embeddings. They also focus on hard negatives, i.e. given a pair (i, t) of image and text embedding vectors, the corresponding hard negatives are the image vector î � = i and the text vector t � = t closest to t and i, respectively.
Our approach must also account for hashtags, user, and location. To accomplish this, we first calculate the loss using each pair (c, a) in a minibatch, where c is the embedding for one component from a tweet (e.g. text or image) and a is the averaged tweet component embeddings from the same tweet. This can be written as follows:

Experiments
Though the tweet component representations/embeddings generated by out framework are learned in a task-agnostic way, our experiments demonstrate the proposed model's effectiveness on several machine learning applications involving Twitter data by comparing the results of experiments versus baselines designed specifically for those applications. In each of these experiments, we generate tweet component embeddings from our trained model and use them in an application-specific framework as shown in Fig. 2. (2) Note that the tweet component embedding model is only trained once, then used in all of the applications evaluated in our experiments. Table 1

Dataset
We trained our model on a dataset of tweets before applying the model to the machine learning tasks studied in our experiments. We created this dataset by collecting tweets from the Twitter Streaming API, which streams tweets in real-time. 2 To accomplish this, we used the Tweepy library for Python. 3 We collected all tweets from the API within the bounds established by the API's rate limits. From these tweets, we filtered out tweets that do not contain all of the components necessary to generate embedding vectors with our proposed model (i.e. text, image, hashtags, geolocation data, and author ID). 100,000 tweets in the dataset from March 1-8, 2020 were used for training, while 5000 tweets from March 9, 2020 were used for validation. An additional 5000 tweets from March 10, 2020 were used for testing. The tweets used in our experiments were limited to those with a location that falls within a bounding box that encompasses most of North America.

Training details
The embedding networks in our model were trained with an Adam optimizer [64] over a total of 30 epochs. We set the initial learning rate of 0.0002 and decreased the learning rate by a factor of 10 after 15 epochs. We set the gradient L2 norm threshold for clipping gradients to 2, the margin to 0.2, and the mini-batch size to 128. The model was  evaluated on the validation set every 500 training iterations. The trained model used for evaluation on the test data was selected based on the sum of recalls (recall @ 1, 5, and 10) on the validation set to mitigate overfitting. The fastText and node2vec models used in our proposed model were trained on tweets from March 1-7, 2020, which were collected from the Twitter Streaming API as described in "Dataset".

Image/text retrieval
For image retrieval and text retrieval, we compare our proposed model to VSE++ [14], which was the basis for our method. We also include results from using hashtags as the text for VSE++ because many hashtags have significant descriptive information of their associated images [65], which the text of a tweet might not. In our image retrieval experiments, which are based on the experiments in [14], each model is given a test tweet minus the image and attempts to retrieve a relevant image from the test data. Similarly, our text retrieval experiments involve attempting to retrieve relevant tweet text given a test tweet minus its text. While retrieval with VSE++ is simply a matter of finding the most similar image to the input text (or vice versa), this becomes somewhat more complex with the additional embeddings in our proposed model's joint embedding space. To determine which image  (or text) embedding to retrieve, we retrieve the embedding corresponding to the highest similarity score from any of the input tweet's component embeddings. This is shown with our proposed model in Fig. 3, where the image corresponding to the image embedding closest to the input tweet's component embeddings is shown. Our results are shown in Table 2 (image retrieval) and Table 3 (text retrieval), which show that our method has higher recall (@ 1, 5, and 10) and median rank than VSE++ in both image retrieval and text retrieval of Twitter data.

Hashtag recommendation
Our experiments on hashtag recommendation are performed in the context of recommending hashtags for images and their associated text. Our baseline for this is the Co-Attention model proposed in [12], which uses both the image and text from a tweet or a post on a photo sharing service such as Instagram. We trained the baseline on the same dataset used to train our model described in "Dataset".
To recommend hashtags for a test tweet t with our model, we use a nearest neighborstyle approach by calculating the embeddings of the components of t and scoring each training hashtag embedding h according to the maximum cosine similarity between h and the component embeddings of t. The recommendation of k hashtags for t is thus the  Fig. 4 Hashtag recommendation task. A tweet is given as input to the hashtag recommendation framework, which then outputs a number of recommended hashtags top k hashtags according to these scores. This approach is illustrated in Fig. 4 for k = 3 , in which the 3 hashtags from the training data corresponding to the 3 hashtag embeddings closest to any component embedding from the input tweet are returned. Figure 5 shows the average precision, recall, and F 1 scores for k = 1, ..., 5 for both our method and the Co-Attention baseline. Our method performs better in all three measures for all values of k evaluated.

Bot detection
As the state-of-the art for bot detection on Twitter, Botometer v4 [47] serves as our baseline in these experiments. Because we use the same datasets in our experiments as those used in [47], we compare the results of our method to those presented there. Specifically, we compare to their cross-domain experiments, which combine several annotated bot detection datasets [47,48,[66][67][68][69][70][71][72][73] for a training dataset of 43,576 bots and 32,849 humans and a test dataset of 9432 bots and 8862 humans. These datasets consist of a Twitter user ID combined with a binary class label indicating whether the user is a bot or a human.
We evaluated two approaches using our trained tweet component embedding model. However, the input for each approach is similar. For each user u in the bot detection datasets, we first retrieved up to 200 tweets written by u. We represent u by up to n of the most recent of these tweets, where n is a hyperparameter of our bot detection frameworks. In our experiments, we use n = 10 . The component embeddings of each of these tweets are then computed by our model and averaged; each instance is thus an n u × m matrix composed of n u ≤ n averaged component embeddings of length m of tweets from user u.
The first approach uses tweet component embeddings in a CNN classifier. The network's structure is based on the text CNN proposed by Kim [74], except the input is matrices of averaged embedding vectors as described above rather than matrices of word embedding vectors. In addition to representing each user by up to n consecutive tweets, we also attempt to balance the training data by adding duplicates of users of the less frequent class (humans) with a different set of up to n tweets drawn from that user's retrieved set of tweets. Training the CNN is performed via 5-fold cross-validation of the training data. This approach is demonstrated in Fig. 6, which shows the conversion of a tweet's components to embedding vectors, which are then averaged. The user's averaged tweet embeddings are used as input to a CNN classifier, which predicts the user's class label.
Our second approach is a simpler k-nearest neighbors (kNN) method. For each user u, we average their embeddings into a single vector (i.e. the average of the averaged tweet component embeddings from u's tweets) and calculate the cosine similarity between u and the users in the training data. We collect the class labels (bot or human) of the users with the k highest similarity scores; the predicted label for u is thus the most common label among the k labels collected by this method. This approach is similar to the movie recommendation system proposed by Singh et al. [75], except we are using vectors to represent Twitter users rather than movies. Figure 6 illustrates our kNN-based approach, in which tweet components are converted to embedding vectors, which are then averaged into a single vector for each of that user's tweets. Those vectors are again averaged into a single vector representing the user, whose class label is predicted according to other nearby user vectors.
We further refined our approaches by using a validation dataset taken from the training data to tune some of their hyperparameters using a small set of values for each one. Based on these results, we set the CNN's filter window sizes to 2, 3, and 4, and number of feature maps to 300. For the kNN approach, we set k = 1.
The results of our bot detection experiments are shown in Fig. 7, which shows the F 1 and AUC scores for the baseline and our proposed approaches for the combined test dataset as well as individual bot detection datasets. The baseline's performance is better for all datasets, however our proposed methods come close in some cases.

Location prediction
We evaluate our proposed model on the task of location prediction in terms of cities, i.e. given a tweet without location data, predict the city in or near which that tweet was posted. As a baseline, we use Deepgeo [52], which combines information from a tweet as well as the author's profile in a deep learning framework. Notably, Deepgeo is designed for the classification setting of predicting a city class rather than predicting a location's latitude and longitude. To train and test this baseline, we used subsets of the datasets described in "Dataset" that include only tweets with a Twitter place ID corresponding to a city (i.e. we omitted tweets with a place ID corresponding to other place types such as administrative regions and points of interest) that is present in all datasets (training, validation, and test). This left a total of 41,849 tweets in the training data, 2841 tweets in the validation data, and 2594 tweets in the test data, with a set of 522 classes (cities) between them.
Our approach for using our proposed tweet embedding model uses concatenated embeddings, i.e. each tweet is represented by a single feature vector that contains each of the tweet's non-location component embeddings. Noting that Lau et al. [52] found that a user's location (i.e. the location listed in a user's profile) contributed substantially to their tweet location prediction model's accuracy, we also represent user location in the tweet's feature vector. We do so by using the user location text of each tweet's author as input to the text branch of our trained tweet component embedding model and concatenating the resulting embedding with its corresponding tweet's component embeddings. We then use these feature vectors and their corresponding class labels as input to a random forest classifier. We use the default hyperparameter values defined in the implementation in [76] with the exception of using "balanced subsample" class weights. This setting calculates class weights such that they are inversely proportional to class frequencies for every decision tree's bootstrap sample. Balancing class weights in this manner accounts for any differences in class frequencies within the training data. Our tweet location prediction approach is summarized in Fig. 8, where the embeddings of a tweet's non-location components and its author's location text are computed and concatenated to a single feature vector, which is then passed to a random forest classifier to predict the tweet's location.
The results of our location prediction experiments are shown in Table 4, which show that our method has higher accuracy than the Deepgeo baseline.

Image/text retrieval
In our image and text retrieval tasks, we found that our method outperforms VSE++ in all metrics evaluated. This is particularly true for text retrieval, in which additional tweet components are more effective in retrieving text than image alone. However, we note that these findings do not extend to the more general problem of image-text retrieval, as a tweet's text and other non-image components may not be as semantically similar to  its image as a caption written specifically for that image. This is supported by the image retrieval results, where we observe that both methods perform very poorly.

Hashtag recommendation
We used our trained tweet component embedding model to recommend the k most similar hashtags to any of the component embeddings of an input tweet. For k = 1, ..., 5, our method had better average precision, recall, and F 1 than the Co-Attention baseline. As with our text retrieval experiments, this suggests that the addition of other tweet components along with images improves performance in this task. Further work might refine this approach, either by eliminating tweet components that do not improve performance or by adding new components not evaluated in this study.

Bot detection
Our bot detection experiments evaluated two different methods: one that used a CNN classifier and another that used a kNN approach. In both cases, each tweet was represented by the average of all of its component embeddings. However, the Botometer v4 baseline outperformed our methods with all datasets evaluated. One limitation of our methods is the data used to train the tweet component embedding model. Our methods may perform better in the bot detection task if the model were trained with tweets from users in the bot detection training data. In the dataset described in "Dataset", bots may be underrepresented compared to the bot detection data because bots are estimated to make up only 9-15% of active Twitter accounts [67]. Another possible reason for the poor performance of our method compared to the baseline is that the baseline benefits from some manual work in separating bots in the training datasets according to distinct bot classes; their model takes advantage of this additional information while our method only considers the binary classes of "bot" and "human. " One more possibility is that the poor performance of our proposed bot detection methods is caused by one or more of the tweet components, e.g. by overfitting. This may suggest that additional components do not necessarily contribute to model performance. If this is indeed the case, it could be overcome by validating models trained with different subsets of tweet components. Addressing these issues may improve the performance of our bot detection methods.

Location prediction
For our location prediction task, we concatenated tweet component embeddings into a single vector for input to a random forest classifier to predict in which one of 522 cities an input tweet was written. Compared to the Deepgeo baseline, our method performed slightly better. A future iteration of our proposed method may yield higher accuracy by taking advantage of the other tweet and user information used by Deepgeo, e.g. tweet creation time, user timezone, or account creation time.

Conclusions
In this paper, we proposed a joint embedding framework for representing multimodal tweet data to generate embeddings for machine learning tasks on Twitter. The framework aligns tweet component embeddings in the joint space using a loss function that incorporates hard negatives. We tested a trained tweet component embedding model on four Twitter machine learning applications and found that it can perform well on most applications evaluated. In the text retrieval task, our proposed method achieved recall@1 of 17.4% compared to 0.02% for the baselines. For hashtag recommendation, we achieved an F 1 score of 0.1181 vs. the baseline's 0.0866 for K = 1 . Our location prediction experiments showed accuracy of 52.43% for our method compared to 48.88% for the baseline. These results show that our proposed method can be applicable to a variety of tasks involving Twitter data. However, its performance in bot detection, where our method achieved an F 1 score of only 69% compared to the baseline's 77% on the combined test dataset, shows that this is not the case for all tasks. Future work will investigate how our joint embedding framework can be improved, both in its design (e.g. incorporating different tweet components) and its application (e.g. how to best use generated tweet component embeddings for a given task), to perform well in additional tasks.