Task-agnostic representation learning of multimodal twitter data for downstream applications

Rivas, Ryan; Paul, Sudipta; Hristidis, Vagelis; Papalexakis, Evangelos E.; Roy-Chowdhury, Amit K.

doi:10.1186/s40537-022-00570-x

Research
Open access
Published: 10 February 2022

Task-agnostic representation learning of multimodal twitter data for downstream applications

Ryan Rivas ORCID: orcid.org/0000-0001-5590-0274¹,
Sudipta Paul¹,
Vagelis Hristidis¹,
Evangelos E. Papalexakis¹ &
…
Amit K. Roy-Chowdhury¹

Journal of Big Data volume 9, Article number: 18 (2022) Cite this article

2852 Accesses
5 Citations
Metrics details

Abstract

Twitter is a frequent target for machine learning research and applications. Many problems, such as sentiment analysis, image tagging, and location prediction have been studied on Twitter data. Much of the prior work that addresses these problems within the context of Twitter focuses on a subset of the types of data available, e.g. only text, or text and image. However, a tweet can have several additional components, such as the location and the author, that can also provide useful information for machine learning tasks. In this work, we explore the problem of jointly modeling several tweet components in a common embedding space via task-agnostic representation learning, which can then be used to tackle various machine learning applications. To address this problem, we propose a deep neural network framework that combines text, image, and graph representations to learn joint embeddings for 5 tweet components: body, hashtags, images, user, and location. In our experiments, we use a large dataset of tweets to learn a joint embedding model and use it in multiple tasks to evaluate its performance vs. state-of-the-art baselines specific to each task. Our results show that our proposed generic method has similar or superior performance to specialized application-specific approaches, including accuracy of 52.43% vs. 48.88% for location prediction and recall of up to 15.93% vs. 12.12% for hashtag recommendation.

Introduction

Twitter produces a wealth of information for analysis of trends, opinions, and interactions with 500 million tweets per day generated by its users [1]. As such, the microblogging service is a popular target for research involving machine learning. Several problem settings in the field of machine learning, or variants thereof, can focus on Twitter data. For example, sentiment analysis, spam detection, and location prediction, all well-established problem settings in their own right, are all applicable to tweets [2,3,4]. Much of the work in applying machine learning to Twitter data focuses on only one component, or a few components, of a tweet. A sentiment analysis model [5] might only use the text of a tweet, while a hashtag recommendation system [6] might use both text and image. A tweet can contain additional components that may be informative to a machine learning model. Examples of these components include the author of the tweet, whose interactions with other users can be modeled by a graph [7], and the location, which can link the tweet to others at the same location [8]. Incorporating several of these tweet components into a machine learning framework can potentially create a model that is better informed than others at a given task. One approach to accomplishing this is by creating a joint embedding framework.

Joint embeddings [9] are used in several machine learning tasks that handle different modalities, e.g. text and images, in order to leverage the relationship between them. The intuition behind a joint embedding space is that inputs from different modalities mapped into the space should be close if they are semantically related. For example, in the problem of image-text retrieval [10], the image captions or tags closest to an image in an embedding space should be those that best describe the visual content of the image. However, these models are generally limited to 2 or 3 modalities (components), like the aforementioned text and image. Introducing additional modalities has two potential benefits. First, it can better inform existing applications by taking these additional modalities into account. In the image-text retrieval task, also considering the author and the location of a post from an image sharing or microblogging service might achieve better results. Second, introducing additional modalities can open up a joint embedding space to new applications. Recent work in hashtag recommendation uses the text and image of a social media post as input to a neural network model [11,12,13], but a joint embedding model that includes hashtags as well as text, images, and other modalities might also perform well at this task.

Our motivation for this paper is thus to address the problem of creating a joint embedding framework that incorporates several tweet components and using an embedding model trained with such a framework to address multiple machine learning problems involving Twitter data. Specifically, the first question we ask in this paper within the context of Twitter is can additional modalities in a joint embedding space improve its performance in typical applications and/or enable it to perform well in new ones? We show in our experiments that additional modalities can indeed allow a joint embedding model to perform better than a similar model with fewer modalities, and can also perform well at new tasks. The second question is can we build a single, task-agnostic joint embedding model for tweets and use it several diverse applications? Our results show that a single trained joint embedding model can be successfully applied to multiple different tasks.

Overview of the proposed approach

Our approach builds on VSE++, the framework proposed by Faghri et al. [14]. We do so by extending the framework to incorporate 3 more tweet components in addition to text and images: hashtags, considered separately from tweet text; the author, as represented by a graph embedding [15] learned from a graph of Twitter user mentions; and location, to represent the context of a tweet in terms of what other Twitter users are discussing at the same place. Extending VSE++ is not trivial, as adding 3 additional modalities complicates training. The loss function involved in training the model must account for how one modality interacts with four others rather than just one. We accomplish this with a novel approach that learns representations for each component using triplet loss [16] calculated using an embedding for one type of tweet component and the average embedding across all components of a tweet. Like VSE++, this loss incorporates hard negatives, which have been shown to be effective in several tasks [17,18,19,20].

Our proposed model is applicable to several tasks, including:

Image retrieval/text retrieval: Given a tweet t without an image (or text), retrieve text (or an image) relevant to t.
Hashtag recommendation: Given a tweet t without hashtags, predict one or more hashtags relevant to t.
Bot detection: Given a user u and several tweets written by u, predict whether or not u is a bot, i.e. an automated Twitter account.
Location prediction: Given a tweet t without geolocation data, predict where the author of t was when it was posted.

In our experiments, we show the performance of our proposed model compared to baselines from these domains using Twitter data.

Contributions: We make the following contributions.

We introduce the problem of developing a task-agnostic representation learning framework for tweets that incorporates several tweet components. To our knowledge, this is the first work to address this problem.
We develop a novel framework with pairwise ranking loss to learn a robust joint embedding with 5 tweet components.
We showcase the usefulness of additional tweet components by applying the learned embeddings to different tasks and comparing their performace to task-specific baselines.

The remainder of the paper is organized as follows: “Related work” discusses prior work. “Approach” describes our proposed method. “Experiments” explains the experimental evaluation of our proposed method and presents the results of our experiments. We conclude in “Conclusion”.

Related work

Joint embedding

Joint embedding models have been proposed for image-text retrieval [14, 21,22,23], video-sentence retrieval [24,25,26,27,28], video-paragraph retrieval [29, 30], temporal localization of moments [31,32,33], and a variety of other tasks [34,35,36,37]. The general idea behind a joint embedding model is to place vector representations of different media, such as text and images, into the same embedding space such that the distance between semantically similar vectors (e.g. an image and its captions or tags) is minimized. For the image-text retrieval task, Faghri et al. [14] projected images and text in a visual-semantic embedding space learned with a loss function that utilizes hard negatives. In [21], Mithun et al. used images and noisy text from the Web to improve a joint embedding model. Lee et al. [23] captured the fine-grained interplay between objects present in an image and text to better align images and text in a joint embedding space. For the task of video-sentence retrieval, Mithun et al. [24] employed multimodal cues such as image, motion, and audio for video encoding. In [25], Dong et al. used multi-level encodings for video and text to perform zero-example video retrieval. Wray et al. [26] enriched embedding learning by disentangling the parts-of-speech of captions. For the temporal localization task, moment-sentence pairs [31, 33] or clip-sentence pairs [32] are aligned in the joint embedding space.

Machine learning on Twitter data

Several studies have created machine learning methods specifically for use with Twitter data [38,39,40]. However, these works typically focus on text instead of also including other parts of a tweet. Other work incorporates additional tweet data, e.g. images and tweet author metadata, to accomplish specific tasks such as hashtag recommendation, bot detection, and location prediction. We explore some of these studies below.

Hashtag recommendation

Many studies have focused on hashtag recommendation for Twitter and other microblogging platforms. Rawat and Kankanhalli [11] proposed a deep neural framework to recommend descriptive tags for an image that combined image features from a convolutional neural network (CNN) [41] with features from a “ContextNet” neural network using the image’s associated location and time data as input to provide context for the image. Their proposed method outperformed a model that considered only image data. Zhang et al. [12] used an image’s features from a VGG^{Footnote 1} network [42] and text features from a long short-term memory (LSTM) network [43] representing the image’s caption as input to a co-attention mechanism [44] to recommend hashtags. They compared their proposed framework to state-of-the-art methods that used only the image caption and found that their method performed the best. The method proposed by Ma et al. [13] uses a framework similar to [12] to recommend hashtags, but it also incorporates images and text associated with previous uses of a candidate hashtag to achieve better performance than state-of-the-art methods.

Bot detection

Bot detection on Twitter is another area of active research. Botometer (formerly BotOrNot), originally proposed by Davis et al. [45], extracts over 1000 features from a Twitter user’s metadata, interaction patterns, and content, then uses them as input to a random forest classifier [46] to predict the likelihood of that user being a bot. Botometer’s most recent version, v4 [47], is an ensemble classifier that combines Botometer v3 [48] with random forest classifiers that are each trained on a specific class of Twitter bot. This method outperformed baselines on several datasets. Another approach by Kudugunta and Ferrara [49] uses a deep learning method to determine whether an individual tweet was made by a bot, rather than the more conventional approach of determining whether or not a Twitter user is a bot. It uses word embeddings from tweet text as input to an LSTM combined with tweet metadata to make its predictions and achieved high accuracy in testing compared to baselines.

Location prediction

Previous work has examined location prediction for tweets. Matsuo et al. [50] combined a text-based location estimator and a CNN for image features to perform grid-based location prediction and showed that combining image and text outperformed using only one modality. Kumar and Nezhurina [51] proposed a method to predict the location of a Twitter user’s next tweet based on the user’s past tweets. Their method used tweet geo-coordinates, mentions of predefined location categories, and tweet personality traits as input to an ensemble of popular classification methods, which outperformed individual classifiers used in the ensemble. The method proposed by Lau et al., Deepgeo [52], takes a tweet’s text and creation time as well as the author’s UTC offset, time zone, location (i.e. the free text location listed in a user’s profile), and account creation time as input to a deep learning framework to predict location as a city class, i.e. the city in which a tweet was written. The authors of [52] found that Deepgeo outperformed the state-of-the-art in this task.

Big Data analysis on Twitter

As one of the largest social media platforms, Twitter presents a prime target for studies involving analysis of big data. Linell et al. [53] estimated the prevalence of sleep loss incurred by the beginning of Daylight Saving Time (DST) by analyzing a dataset of 13.1 million tweets. They found that the beginning of DST causes changes in sleep behavior as indicated by a change in the time of peak Twitter activity. Feizollah et al. [54] analyzed tweets related to halal tourism by identifying related topics and performing sentiment analysis. They found that the word “halal” was primarily associated with food and hotels on Twitter, many non-Muslim countries are popular halal tourist destinations, and the majority of tweets expressed positive sentiment. Piña-García and Ramírez-Ramírez [55] examined data from Twitter and other sources to predict the most frequent crimes in Mexico City. They showed that their methods can successfully estimate the occurrences of crime, they note that more work is necessary to develop an accurate model. Our proposed method can support future big data studies similar to these, as researchers can use it to analyze Twitter data they have collected according to their goals.

Approach

In this section, we describe our proposed model to represent tweet components in an embedding space. We first provide an overview of the VSE++ framework on which our proposed method is based (“Description of VSE++ framework”). Next, we describe the structure of the neural network framework and how we represent tweet components (“Network structure and input features”). Then, we present our approach to training a joint embedding model with this framework using pairwise ranking loss (“Training joint embedding”).

Description of VSE++ framework

VSE++, proposed by Faghri et al. [14], is a joint embedding framework designed for image-text retrieval. Given a dataset of images and captions, a trained VSE++ model attempts to pair each image with its corresponding caption. It does so by projecting them in a joint embedding space; image embeddings are generated via a ResNet [56] or VGG [42] model, while the embeddings for image captions are generated by a gated recurrent unit (GRU)-based text encoder, which is a commonly-used method to represent sentences [21].

The most notable contribution of [14] is the incorporation of hard negatives during training. In the context of image-text retrieval, a hard negative is the closest non-matching caption to an image (or vice-versa). More generally, a hard negative is the closest negative (i.e. non-matching embedding) to a training query. A loss function incorporating hard negatives will assign higher loss to a query with a negative that is particularly close to the query compared to a more conventional “sum of hinges” loss function that would be more influenced by the average distance of the negatives. Faghri et al. explain that one advantage of this approach is avoiding local minima created by the sum of hinges loss, leading to a better performing model. Their loss function for an image i and an image caption c is thus

$$\begin{aligned} {\mathcal {L}}_{ic} = max[0, \Delta - f(i, c) + f(i, {\hat{c}})] + max[0, \Delta - f(i, c) + f({\hat{i}}, c)], \end{aligned}$$

(1)

where ${\hat{c}}$ is the hardest negative caption with respect to i, ${\hat{i}}$ is the hardest negative image with respect to c, and $\Delta$ is the margin parameter. Experimental evaluation with the Microsoft COCO [57] and Flickr30K [58] datasets showed that this approach was superior to other image-text retrieval methods.

VSE++ is written in Python and uses the PyTorch deep learning library [59]. Using this as our basis, we extend the concepts presented by Faghri et al. to account for additional modalities present in tweets.

Network structure and input features

Network architecture

We learn a joint embedding model using a deep neural network framework. Our framework, shown in Fig. 1, has 5 branches for tweet text, an image, hashtags, the location of the tweet, and the user who wrote the tweet. Each of these branches uses a different network. The goal of this design is for the individual branch networks to focus on component-specific features while the fully connected layers convert these to embeddings in the joint space with a dimensionality of 1024.

Text representation

For encoding tweet text (i.e. the text of the tweet without hashtags), we use an embedding layer with weights initialized with word embeddings from a fastText [60] model trained on Twitter data. The dimensionality of this layer is 300. The word embeddings are then input to a GRU. The GRU maps the text features to the joint embedding space.

Image representation

To encode an image contained in a tweet, we use a 152-layer ResNet model [56] trained on the ImageNet dataset [61]. The dimensionality of the image embedding is 2048; this is mapped to the joint space via a fully connected layer. Similar frameworks have also evaluated a 19-layer VGG model [42] as an alternative, however ResNet has been shown to perform better, at least for the task of image-text retrieval [14, 21], so we limit our experiments to the ResNet model.

Hashtag representation

To represent hashtags, which are a form of metadata used to denote keywords or topics within tweet text [13], we first separate hashtags from the text of the tweet. We then average over the fastText word embeddings of the extracted hashtags and map this to the joint space with a fully connected layer.

Location representation

To emphasize the context of a tweet, we consider location in terms of tweets from the same place. This is a two-step process: first, we collect the text of “neighbor” tweets from the same place as an input tweet. For simplicity, we do this by grouping tweets from our collected Twitter data with the same Twitter-assigned place ID as the input tweet. The texts of these tweets are then encoded through a network branch identical to that of the input tweet’s text, but this is followed by averaging the encodings of the texts.

User representation

We represent the author of a tweet by using a trained graph embedding model. Specifically, we use the fastnode2vec [62] implementation of node2vec [63]. The weighted graph used to train this model was constructed with users as vertices and mentions as edges, i.e. an edge (u, v, w) represents user u mentioning user v in w tweets. The dimensionality of the graph embedding is 300; this is mapped to the joint space via a fully connected layer.

Training joint embedding

For a pair of dissimilar tweets $t_1$ and $t_2$, $t_1$ should have embedding vectors for its components that are similar to each other, but are not similar to the embedding vectors of the components of $t_2$. Conversely, $t_2$ should have embedding vectors for its components that are similar to each other, but are not similar to the embedding vectors of the components of $t_1$. With that intuition in mind, our goal is to learn a joint embedding characterized by the weights of the fully connected layers, the text and location word embedding layers, and the GRUs.

We base our approach on previous work that uses hinge-based bi-directional ranking loss for visual-semantic embeddings [14, 21]. These approaches maximize the similarity between corresponding image and text embeddings and minimize similarity to non-matching embeddings. They also focus on hard negatives, i.e. given a pair (i, t) of image and text embedding vectors, the corresponding hard negatives are the image vector ${\hat{i}} \ne i$ and the text vector ${\hat{t}} \ne t$ closest to t and i, respectively.

Our approach must also account for hashtags, user, and location. To accomplish this, we first calculate the loss using each pair (c, a) in a minibatch, where c is the embedding for one component from a tweet (e.g. text or image) and a is the averaged tweet component embeddings from the same tweet. This can be written as follows:

$$\begin{aligned} {\mathcal {L}}_{ca} = \sum _{(c, a)}\{max[0, \Delta - f(c, a) + f(c, {\hat{a}})] + max[0, \Delta - f(a, c) + f(a, {\hat{c}})]\}, \end{aligned}$$

(2)

where $\Delta$ is the margin value for the ranking loss, $f(c,a)=f(a,c)$ is the similarity scoring function between a tweet component embedding c and averaged tweet component embeddings a, and ${\hat{a}} = \mathop {\mathrm {arg\,max}}\limits _{a^-}{f(c,a^-)}$ and ${\hat{c}} = \mathop {\mathrm {arg\,max}}\limits _{c^-}{f(a,c^-)}$ are the hardest negative samples. In our experiments we use cosine similarity for f(c, a), but our approach does not depend specifically on this. With Equation 2 and the set of tweet component embeddings (t, i, h, l, u), representing text, image, hashtags, location, and user of a tweet, respectively, our complete loss function is

$$\begin{aligned} {\mathcal {L}} = {\mathcal {L}}_{ta} + {\mathcal {L}}_{ia} + {\mathcal {L}}_{ha} + {\mathcal {L}}_{la} + {\mathcal {L}}_{ua}. \end{aligned}$$

(3)

i.e. the total loss for a minibatch is the sum of the component-specific minibatch losses (Eq. 2).

Experiments

Though the tweet component representations/embeddings generated by out framework are learned in a task-agnostic way, our experiments demonstrate the proposed model’s effectiveness on several machine learning applications involving Twitter data by comparing the results of experiments versus baselines designed specifically for those applications. In each of these experiments, we generate tweet component embeddings from our trained model and use them in an application-specific framework as shown in Fig. 2. Note that the tweet component embedding model is only trained once, then used in all of the applications evaluated in our experiments.

Table 1 summarizes the applications, baselines, and performance metrics in our experiments

Table 1 Applications and baselines evaluated

Full size table

Dataset

We trained our model on a dataset of tweets before applying the model to the machine learning tasks studied in our experiments. We created this dataset by collecting tweets from the Twitter Streaming API, which streams tweets in real-time.^{Footnote 2} To accomplish this, we used the Tweepy library for Python.^{Footnote 3} We collected all tweets from the API within the bounds established by the API’s rate limits. From these tweets, we filtered out tweets that do not contain all of the components necessary to generate embedding vectors with our proposed model (i.e. text, image, hashtags, geolocation data, and author ID). 100,000 tweets in the dataset from March 1–8, 2020 were used for training, while 5000 tweets from March 9, 2020 were used for validation. An additional 5000 tweets from March 10, 2020 were used for testing. The tweets used in our experiments were limited to those with a location that falls within a bounding box that encompasses most of North America.

Training details

The embedding networks in our model were trained with an Adam optimizer [64] over a total of 30 epochs. We set the initial learning rate of 0.0002 and decreased the learning rate by a factor of 10 after 15 epochs. We set the gradient L2 norm threshold for clipping gradients to 2, the margin $\Delta$ to 0.2, and the mini-batch size to 128. The model was evaluated on the validation set every 500 training iterations. The trained model used for evaluation on the test data was selected based on the sum of recalls (recall @ 1, 5, and 10) on the validation set to mitigate overfitting. The fastText and node2vec models used in our proposed model were trained on tweets from March 1–7, 2020, which were collected from the Twitter Streaming API as described in “Dataset”.

Image/text retrieval

For image retrieval and text retrieval, we compare our proposed model to VSE++ [14], which was the basis for our method. We also include results from using hashtags as the text for VSE++ because many hashtags have significant descriptive information of their associated images [65], which the text of a tweet might not. In our image retrieval experiments, which are based on the experiments in [14], each model is given a test tweet minus the image and attempts to retrieve a relevant image from the test data. Similarly, our text retrieval experiments involve attempting to retrieve relevant tweet text given a test tweet minus its text.

While retrieval with VSE++ is simply a matter of finding the most similar image to the input text (or vice versa), this becomes somewhat more complex with the additional embeddings in our proposed model’s joint embedding space. To determine which image (or text) embedding to retrieve, we retrieve the embedding corresponding to the highest similarity score from any of the input tweet’s component embeddings. This is shown with our proposed model in Fig. 3, where the image corresponding to the image embedding closest to the input tweet’s component embeddings is shown.

Our results are shown in Table 2 (image retrieval) and Table 3 (text retrieval), which show that our method has higher recall (@ 1, 5, and 10) and median rank than VSE++ in both image retrieval and text retrieval of Twitter data.

Table 2 Image retrieval results

Full size table

Table 3 Text retrieval results

Full size table