Cross-modality representation learning from transformer for hashtag prediction

Hashtags are the keywords that describe the theme of social media content and have become very popular in influence marketing and trending topics. In recent years, hashtag prediction has become a hot topic in AI research to help users with automatic hashtag recommendations by capturing the theme of the post. Most of the previous work mainly focused only on textual information, but many microblog posts contain not only text but also the corresponding images. This work explores both image-text features of the microblog post. Inspired by the self-attention mechanism of the transformer in natural language processing, the visual-linguistics pre-train model with transfer learning also outperforms many downstream tasks that require image and text inputs. However, most of the existing models for multimodal hashtag recommendation are based on the traditional co-attention mechanism. This paper investigates the cross-modality transformer LXMERT for multimodal hashtag prediction for developing LXMERT4Hashtag, a cross-modality representation learning transformer model for hashtag prediction. It is a large-scale transformer model that consists of three encoders: a language encoder, an object encoder, and a cross-modality encoder. We evaluate the presented approach on dataset InstaNY100K. Experimental results show that our model is competitive and achieves impressive results, including precision of 50.5% vs 46.12%, recall of 44.02% vs 38.93%, and F1-score of 47.04% vs 42.22% compared to the existing state-of-the-art baseline model.


Introduction
In recent years, social media networks such as Twitter, Instagram, and Sina Weibo have gained huge popularity.People use these platforms for communication and sharing their opinions on different daily activities.The rapid adoption of social media results in a huge volume of social media content on a daily basis.According to the latest statistics, Twitter has 330 million monthly active users, while the number for Pinterest has reached 444 million, 1 and Instagram has more than 2 billion active users. 2 To avoid being overwhelmed, a good choice that improves information diffusion is through the use of hashtags.Hashtags indicate the theme of the social media post and have proven to be useful for influence, opinion analysis, forecasting, prediction, and other purposes.So, hashtag predictions have become an important research topic and have received considerable attention in recent years.Although many researchers have done work on hashtag predictions, most of the previous work has focused only on text information, which sometimes does not provide the context of user opinions.According to the data analysis, we observe that social media posts contain different sources of data, e.g., (texts, images, and videos) that express the user's opinions or daily life moments.So, it is not easy to correctly recommend hashtags for multimodal data using the model, which is designed based on text information.Figure 1 illustrates a multimodal microblog post from the Instagram3 with hashtag #cat and the cat information is not given in the text content of the post.With only textual information, we may predict the hashtag about gifts and celebrations.However, the hashtag # cat is hardly to be identified.
Many approaches have been developed for hashtag recommendations for social media content.Most of these approaches have used traditional deep learning techniques, while recently, the transformer [1] base BERT [2] model has been adopted as an effective method for many classification problems.Unlike the previous research works, which either use image text simple fusion or the traditional co-attention approach, we adopt the cross-modality transfer learning model for the hashtag prediction task and convert this task into a transfer learning problem.
Inspired by the achievements of visual linguistic transformer in many downstream tasks, we adopt the cross-modality representation learning transformer mechanism to , where text offers limited information, without visual information, we can't recommend the correct hashtags extract the features from the multimodal dataset with a cross-attention transformer layer to capture the interactions among images, texts, and hashtags.A multilabel classification head is added on top of the cross-attention layer(the last hidden state) to generate the probability score of each hashtag from the hashtags list.Our approach is based on the recently developed Learning Cross-Modality Encoder Representations from Transformers (LXMERT) [3] for hashtag prediction.LXMERT extends the recent language model BERT's self-attention mechanism for cross-modality vision-and-language interaction.We trained this cross-modality model on a multimodal hashtag dataset called LXMERT4Hashtag, a cross-modality transfer learning model for hashtag prediction.It is an efficient method of hashtag prediction from cross-modal representation learning.This model is based on separate streams for an object encoder and a language encoder that communicates through a cross-modality transfer encoder.In summary, in this article, we present the cross-modality representation learning transformer model for hashtag recommendation.The key contributions can be summarized as follows: • We investigate the hashtag recommendation task for a multimodal dataset and propose a cross-attention-based transformer framework to extract the features from both image and text.And capture the correlation between hashtags and image-text features called LXMERT4Hashtag.• We frame the hashtag recommendation task as a multi-label classification problem.
To formulate this task, we employ the cross-modality representation learning transformer architecture to model this recommendation process.• Extensive experimental results on the big dataset, crawled from the Instagram named InstaNY100K, demonstrate that our proposed model achieves better results for hashtag recommendation tasks by fully exploiting the cross-modal representation learning.
The rest of the paper is organized as follows: related Works section reviews the related research works.Approach section presents the proposed model, and Experiments section presents the experimental result and evaluates the performance of the proposed model.Conclusion section presents the conclusion and future work.

Related works
Our work relates to the hashtag recommendation for multimodal content using a cross-modality transformer model.While previous work has practised traditional deep learning.There is also increasing recognition of the importance of being able to handle multimodal social media content and cross-modal attention.We review some of this work in this section.

Hashtag recommendation
Due to the usefulness of hashtags in influence marketing, opinion mining, and many other purposes, hashtag recommendation has become an attractive research field in recent years.Researchers have proposed many approaches from different perspectives.Zangerle et al. [4] introduced an approach for highly appropriate hashtag recommendation based on TF-IDF content similarity of the user tweet and other tweets.The recommendation aim is to encourage the user to use more appropriate hashtags and avoid synonymous hashtags.The Ref. [5] used Latent Dirichlet Allocation for hashtag recommendation for microblogs using a topic-specific translation model.Sedhai et al. [6] proposed a solution by learning-to-rank method for hyperlink tweets.First, they select the hashtags through five similarity schemes: similar documents, similar tweets, the named entities contained in the document, the domain of the link, and for hashtag recommendation, they adopt the RankSVM approach.Hashtag-LDA proposed by [7] finds meaningful latent topics and the relationships between topics and hashtags.Motivated by the success of convolutional neural networks (CNNs) in some natural language processing tasks, [8] adopted CNNs to perform the hashtag recommendation problem and proposed a novel architecture with an attention mechanism.The work [9] proposed TAB_LSTM, a hashtag recommendation model with Topical Attention-Based LSTM, which is an attention mechanism to merge local hidden representations with global topic vectors.They consider the attention process and construct a unique Topical Attention-Based LSTM model for hashtag recommendation.Li et al. [10] proposed a Long Short-Term Memory Recurrent Neural Network(LSTM-RNN) model for hashtag recommendation.They utilized tweet vector features to categorize hashtags without any feature engineering.In their work, distributed word representations are used with the skip-gram model.Then, the convolutional neural network is trained by semantic phrase vectors, the phrase vectors are then used to train an LSTM-RNN.Hashtag prediction based on multi-features of the microblogs proposed by [11] utilized a user-based method for hashtag prediction by taking advantage of similar users.Hashtag2Vec proposed by [12] explores the multiple relations of hashtag-tweet, tweet-word, word-word, and hashtag-hashtag relationships based on the hierarchical heterogeneous network.The work [13] proposed a framework DeepTagRec, a content-cum-user, based deep learning model to recommend appropriate hashtags on Stack-Overflow.The Ref. [14] proposed a Topical Co-Attention Network (TCAN), which learns the content representation based on a bidirectional LSTM and constructs the topical word matrix to represent the topic and combine them with the co-attention mechanism.Parallel Long Short-term Memory (PLSTM) proposed in [15] for hashtag recommendation task based on current post contents and the post history representation.Zhang et al. [16] proposed Semantically Enhanced Tag Recommendation (STR), a deep learning-based approach that recommends the tags through semantics learning of both tags and questions on software(e.g.

Stack Overflow).
Many researchers have worked on image datasets for hashtag recommendation and have been using different approaches.Most of the researchers pay attention to the tags annotated by users through social media services such as Flickr.For example [17], introduced tag recommendation strategies to assist the user with photo annotation by recommending a set of tags to the photo.The Ref. [18] studied the problem of personalized tag recommendation tasks for the images.Liu et al. [19] proposed a tag ranking system that ranks the tags associated with given Flickr photos according to their relevance to the image content.In Li et al. [20] trains efficient ensembles of Support Vector Machines per tag, enabling fast classification.To learn the semantics of images, most recent approaches rely on CNNs as HARRISON [21] introduced a hashtag recommendation model for real-world photos in social media, which included a visual feature extractor based on a (CNN)convolutional neural network for multi-label hashtag classifier.On the HARRISON dataset, two single feature-based models, object-based and scene-based models, as well as an integrated model of them, are evaluated using this framework.For personalized hashtag recommendation [22], proposed a deep learning model that considers the user's preferences and visual information for image tag recommendation.For sequence relationships between social media images and hashtags, an Attention-based neural image hashtag network (A-NIH) is proposed in [23].CNN Inception V3 with LSTM-Atten is used for the sequential image feature, and GRU to generate the output.Kao et al. [24] combined image classification and semantic embedding models to design an efficient and resource-aware hashtag recommendation model based on deep neural networks.Voting Deep Neural Network with Associative Rules Mining (VDNN-ARM) framework proposed in [25], for image hashtag recommendation.For user representation learning for image hashtag prediction, Durand et al. [26] first extracted the visual representation for each image and then computed the vectorial representation of each image hashtag into pairs.
From the brief descriptions given above, we can observe that most of the previous works focused on either textual information or visual features.In this work, the proposed method incorporates both textual and visual information.

Multimodal hashtag recommendation
As information gets more diverse on social media, hashtag recommendations for multimodal data have attracted a lot of attention in recent years.Various approaches have been studied for multimodal hashtag recommendation tasks in different aspects.Previous works on hashtag recommendation for multimodal datasets usually focused on traditional co-attention for multimodal representation learning.For example [27], first extracted the image feature by 16-layer VGGNet [28], text feature by LSTM, and then fed them into a co-attention network that incorporates textual and visual information to recommend the hashtags for multimodal tweets.User habit-based hashtag recommendation proposed in [29] is based on two modules: a content model for post image and text data based on a parallel co-attention mechanism and a model for users' tagging habits.Finally, the post feature vector from the content modelling module and the habit influence vector from the user habit module are concatenated for hashtag recommendations.Attention-based Multimodal Neural Network model (AMNN) proposed by [30] for hashtag sequence prediction.To extract the image feature representations, a hybrid neural network architecture was adopted.In the first step, the preliminary feature map of a given image is captured by the CNN, and then LSTM is applied to process the intermediate features sequentially.Text feature extraction is considered by the BiLSTM model.Multimodal representation was obtained by concatenating image and text-distributed representations, and a gated recurrent unit (GRU) [31] network decoder was used for the final hashtag recommendation.Co-Attention Memory Network developed by [32] for multimodal microblog's hashtag recommendation combines the attention mechanism with memory networks for hashtag recommendation tasks.With the co-attention mechanism, it first gets text-based visual attention and image-based textual attention.It then feeds them into a cross-attention memory framework to extract the users' interests from the users' microblog history.CACNet [33] proposed a Cross-Active

General
Connection Network for multimodal feature fusion.The work [34] proposed a deep neural network framework that combines the tweet components for representation learning of multimodal Twitter data for downstream applications, including hashtag recommendations tasks.An overview of the selected literature is shown in Table 1.
As mentioned before in the introduction section, these works are based on the traditional co-attention mechanism.Inspired by the recent success of the self-attention mechanism of transformer [1], BERT [2] has gained great achievement in natural language processing, and cross-modal learning has propelled great advancement in visionand-language joint learning tasks.In this work, we adopt the cross-modal representation learning transformer mechanism for better learning cross-visual linguistic features for hashtag prediction.

Approach
We frame the hashtag recommendation task as a multi-label classification problem.To formulate this task, we adopt the cross-modality transformer architecture to model this recommendation process and use the cross-modality head of the LXMERT model similar to visual question answering.It learns the joint representations on the top of image and text embedding and assigns the probability score to each hashtag based on its relevancy to the hashtags list on the base of image-text features from cross-modality representations, where top k hashtags are selected and used for hashtag prediction.The model is proposed as follows: Given the input image with corresponding content pairs, we aim to train a cross-modality representation learning model with cross-attention following the self-attention mechanism of the transformer for the hashtag recommendation task.In this section, we describe the model architecture in detail.

General
Fig. 2 The overview of LXMERT4 Hashtag.The [CLS] token on the top of the text embedding passes through the cross-attention layers and learns the joint representation.The corresponding feature vector of [CLS] token is represented as the yellow square on the top of the language output vector, where multilabel classification is used for top K hashtags recommendation

Model architecture
Figure 2 presents the architecture of the proposed model.Self-attention sub-layers and cross-attention sub-layers are abbreviated as 'Self' and 'Cross, ' respectively.A feed-forward sub-layer is denoted by the letter 'FF' .First, each sentence from the input text is represented as a sequence of word tokens, and the image as a sequence of objects.The model consists of two separate encoders for text and image representation that are based on the transformer self-attention mechanism and a cross-modality encoder with cross-attention layers following the transformer attention mechanism for visual linguistic interaction.The output is a vector, and each vector is represented by the probability of the hashtag, where probability is obtained from Binary Cross-Entropy loss.

Pre-processing
Both image and text input need to convert into two sequences of features.Each sentence from the input text is represented as a sequence of word tokens, and the image as a sequence of objects.Later, these embedding features will be further processed by the corresponding encoding layers.

Text embedding
Text is tokenized through Word Piece tokenizer [35] as in BERT [2].Each sentence is split into words, and each word has an index { w 1 ,..., w n }.Embeddings are generated as vectors for each word and its corresponding index through embedding layers.The final word embeddings are generated by adding both word and position indices as follows:

Image embedding
The image embedding is based on object-level image embeddings.Faster R-CNN [36] with Bottom-up top-down attention [37] used for image embedding.Objects are detected via an object detector.Each object is represented by a position feature p j and a region-of-interest feature f j .The positioning feature represents the bounding box coordinates of the object, and the region-of-interest feature represents a 2048-dimensional region of each object.The final object embedding ( v j ) is formed by adding both features p j and f j ).Thus, image embedding is a position-aware object embedding.In order to obtain a balance between the two types of features, each feature embedding is normalized before addition.

Single-modality encoders
The proposed model consists of two single modality encoders, one for text representation and the other for image representation.This two-stream architecture has some (1) advantages, we can reuse the existing pre-trained model, such as for the text encoder, and we can benefit from the pre-trained text model BERT.Another advantage of the two-stream design is that the model can be fine-tuned for retrieval tasks separately for images and text.Both encoders are based on self-attention and adopt the multi-head attention as described by Transformer [1], same as BERT for text encoder.In the following section, we describe each type in detail.

Text encoder
Index-aware word embeddings that consist of word w i and its index i are fed to a trans- former encoder.The N L layers in the text encoder based on BERT are formed of a self-attention layer and a feed-forward layer followed by residual connection and normalization layers.

Object-relationship encoder
The object-relationship encoder includes one layer of location features in addition to the visual features to follow the order-less self-attention mechanism of the Transformer model.After obtaining the representation of 36 objects that are represented by a position feature and a region-of-interest feature, the embedding is sent to the objectrelational encoder for subsequent processing.The N R layers in the object-relationship encoder mainly include a self-attention layer and a feed-forward layer.In addition, the "+" sign in Fig. 2 represents the newly added layer structure, which is residual connection and layer normalization.The overall structure of this encoder is similar to the text encoder, where a transformer encoder takes the image embedding as input.

Cross-modality encoder
The cross-modality module is formed of (a) two self-attention layers, (b) one bi-directional cross-attention layer, and (c) two feed-forward layers.Each layer output is used as input to the next layer.The bi-directional cross-modality layer is formed of two unidirectional cross-attention layers, from language to image and from image to language.Residual connections and normalization layers are added after each layer.The output vectors represent the language features and vision features as follows: The Cross-attention layer aims to exchange information between the language and vision modalities.The output of the cross-attention layer is then passed to the self-attention layers. ( Finally, the output of self-attention layers is passed to the feed-forward layers to give the final output.

Output representations
We consider the hashtag recommendation as a multi-label classification task.As shown in the right-most part of Fig. 2, the cross-modality encoder has three outputs for language, vision, and cross-modality.We adopt the cross-modality output representation for the hashtag recommendation task.For the cross-modality representation learning, a special [CLS] token on the top of the text embedding passes through the cross-attention layers from language to image and from image to language and learns the joint representation.The corresponding feature vector of [CLS] token is used as the cross-modality output which is represented by the yellow square on the top of the language output vector.Finally, we adopt the last hidden state of [CLS] token for multilabel classification to assign the probability score to each hashtag based on its relevancy on the base of image-text features, where top k hashtags are selected and used for hashtag prediction.

Experiments
We apply the cross-modality representation learning transformer encoder to the task of hashtag recommendation to evaluate the performance.In this section, we design experiments to answer the following research questions: (i) How much can crossmodality representation learn from transformer help in hashtag recommendation as compared to the traditional baseline methods?(ii) Does the cross-attention mechanism inspired by the self-attention approach of the transformer help for this task?
Where TP is True Positive, FP is False Positive, and FN is False Negative.

Baseline methods
We refer to our proposed model as LXMERT4Hashtag, a cross-modality representation learning transformer encoder for hashtag recommendation (Fig. 2).For evaluation of the proposed model, we compare the following baseline methods against our model: • Topical Attention-based LSTM: TAB_LSTM [9] proposed a hashtag recommendation model for text-only content.It's a novel attention-based LSTM model which incorporates topic modeling into the LSTM architecture through an attention mechanism.• LSTM-CNN Concat: We concatenate the LSTM and Vgg for the hashtag prediction task.For combining text and image features, we project the image and text representation to the same dimension and then concatenate the vectors for the hashtag prediction task.• Co-Attention Network: CoA [27] is the state-of-the-art hashtag recommendation method for multimodal(image and text) microblog contents.This model applies the traditional co-attention mechanism to extract post features and then directly uses the features to make recommendations.• Attention-based Multimodal Neural Network model: AMNN [30] a sequence generation attention-based multimodal neural network that first extracts text and image features separately using the neural network with attention mechanism (encoder).Then, it merges the image and text representations and feeds the output values into GRU networks to generate a sequence of the recommended hashtags (decoder).• Tweet Embedding Network: TweetEmbd_Net [34] proposed a task-agnostic model for multimodal Twitter data.This model combines the tweet components (image, text, hashtags, user, location, and time) for representation learning for downstream applications, including hashtag recommendation tasks.

Experimental setup
We perform the training task as follows: We split the entire dataset into three parts with a ratio of 8:1:1 for training, validation, and test set, respectively.We train our model on the training dataset to learn cross-modal joint embedding.A multilabel classification head is used on the top of the cross-modal joint embedding (the last hidden state of [CLS] token) to generate the probability score of each hashtag from the hashtags list.We save the model which has the best performance on the validation dataset.For the For training all the baseline models and our proposed model, we truncated or padded each text sequence into 128-word tokens length.Instead of using the feature output from the convolutional neural network, we adopt the FRCNN [36] with pre-trained BUTD [37] for the proposed model to extract the sequence of 36 objects from each image by bounding boxes on the image.Each object from the sequence is represented by its position feature and its 2048-dimensional region-of-interest (RoI).
For the baseline models, image features are extracted by using their visual encoder.Ber-tAdam optimizer is used to optimize the proposed model.For all the baseline methods and our model, we use the validation data to tune the hyperparameters and report the performance of all the models on the same test dataset.All the models are trained for 75 epochs with a batch size of 18.Each model can be fit into 1 Nvidia Titan V GPU with a batch size of 18.
The model is proposed as given the input image with corresponding content pairs, where we aim to train a cross-modality modal for hashtag recommendation task with cross-attention based on the transformer attention mechanism.The whole process is illustrated in Fig. 2. We adopt the cross-modality head of the model, which learns the cross-modality representation on the top of image and text representations, and a multilabel classification head is used to assign the probability score to each hashtag based on its relevancy to the hashtags list on the basis of image-text features from cross-modality representation, where top k hashtags are selected and used for hashtag prediction.

Results and discussion
Since a large number of social media users prefer to share posts with corresponding images, therefore finding an effective way to get meaningful information, both from the textual and visual content of the post, is very important for social media analyses.
Effectiveness Comparisons Observing the comparisons of our proposed model with base models, it is clear that the cross-modality representation learning transformer encoder can significantly improve the performance of the hashtag recommendation task.We evaluated the proposed model with state-of-the-art baseline methods using the same dataset.In Table 3, we compare the result of our proposed model and the state-ofthe-art baseline multiple methods.The result based on the evaluation metrics shows that our proposed model outperforms the other methods.The evaluation results presented in Table 3 were obtained with the top two hashtag recommendations for each post in the test dataset.Figure 3 shows the result of the evaluation metrics with top K numbers of recommended hashtags on the test dataset.Each point on the curve represents the number of recommended hashtags ranging from 1 to 5. The y-axis represents the accuracy, precision, recall, and F1-score, respectively; the x-axis indicates the top K numbers of recommended hashtags.A model that has the highest curve on the graph indicates the best performance compared to other methods.We can observe from the curve that the performance of our proposed model is the highest compared to all the baseline methods.Figure 3 indicates that when K varies from 1 to 5, our proposed method can achieve 6.2% -4.4%, 1.4% -5.1%, and 2.3% -4.8% absolute improvements in terms of the precision, recall, and F1-score, respectively compared with the best competitor CoA.The proposed model significantly outperforms all the baseline methods on all the evaluation metrics.These remarkable improvements represent the effectiveness of the cross-attention mechanism of the proposed model in the hashtag recommendation task.
In the baseline methods, CoA comparatively performs better than the other methods because of its co-attention mechanism.Although AMNN also considers both visual and linguistic contents, it performs comparatively poorly compared to CoA.The main reason is that AMNN is a sequence generation model that uses GRU networks to generate a sequence of recommended hashtags.However, in the dataset, we removed all the hashtags below the threshold, which means that the input hashtags in the dataset do not represent a good sequence of hashtags.This discrepancy in the dataset affects the performance of AMNN, leading to suboptimal results compared to CoA.The performance of the TweetEmbed_Network depends on components such as image, text, hashtags, user, location, and time.In our case, we used images, text, and Fig. 3 Accuracy, precision, recall, and F1-score curves with different numbers of hashtags recommendation.LXMERT4Hashtag significantly outperforms the compared methods in all the evaluation metrics hashtags.However, we acknowledge that location can have an important impact, and another possible reason could be the loss function.TAB_LSTM is a text-only method, and we observe that TAB_LSTM performs much better.This indicates the usefulness of topic attention in hashtag recommendation tasks.
The experimental results of the proposed model demonstrate that the crossmodality representation learning with a transformer can generate better correlations between the hashtags and the visual-linguistic contents of the social media post.The competitive performance of the proposed model shows that the cross-attention of the cross-modality transformer encoder can produce the high-level representation learning of the multimodal social media content compared to other baseline methods.Based on the above discussion, it can be concluded that the proposed LXMER-T4Hashtag performs better in the hashtag recommendation task compared to the state-of-the-art methods.

Qualitative analysis
In this section, we present some examples to conduct the qualitative analysis of our proposed model result.These examples cover a broad range of feature learning of the proposed model, including the objects, cross-attention mechanism, and implicit learning.Post (a), in Fig. 4, we show the rich features representation learned by LXMERT4Hashtag from the corresponding image of an Instagram post.In this example, the hashtags are correctly recommended by using the features of the post image.We observe that the hashtags #interiordesign, #art, and #design are very difficult to predict with post text, but the visual features give the clue.Post (b), we observe that the hashtags #fashion and #lifestyle do not appear in the microblog post.The hashtags #fashion and #lifestyle are both correctly predicted by the cross-media representation learned by LXMERT4Hashtag of an Instagram post.From the post (c), we observe that in addition to the explicit features, our model can also extract implicit meaning such as #travel and #explore, which demonstrates that the context ability of the crossattention mechanism of the proposed model is very effective.

Conclusion
In this article, we presented the cross-attention mechanism inspired by the self-attention approach of the transformer for the hashtag recommendation task.The presented algorithm is based on the multimodality of social media content and hashtags.We converted the hashtag recommendation task into a multi-label classification and introduced a cross-modality representation learning transfer model for this task.As both image and text contents of the user's social media post are important in the hashtag recommendation task, the proposed model adopts the cross-attention layer to exchange the information and align the entities between image and text modalities in order to learn joint cross-modality representations.
We designed the experiments to evaluate the proposed model against several stateof-the-art models.Our model significantly outperformed and achieved state-of-the-art performance for hashtag prediction tasks over the baseline methods, including the traditional co-attention mechanism, and we found that cross-modality representation learning from the transformer encoder significantly helped in this task.
In the future, there are some options that we would like to investigate in the domain of multimodal hashtag prediction.We want to consider the different architecture of the cross-modality transformer encoder where the features from different modalities can fuse in a better way.Further, we also want to investigate the effectiveness of the visual transformer encoder for the improvement of our model based on the recent competitive visual transformer model.Furthermore, we also want to explore the training strategies to make it generalize for the settings where some hashtags have never been observed during training.We believe this work could be designed with different model architectures to achieve even better results that can handle the cross-modality features from social media posts for hashtag prediction.We left these issues to be further optimized and improved in our future work.

Fig. 1
Fig. 1 An Example of multimedia post from Instagram, where text offers limited information, without visual information, we can't recommend the correct hashtags

F 1Score = 2
Precision * Recall Precision + Recall performance evaluation of the proposed model, we use the test dataset embedding through our trained model and then perform the multilabel classification to generate the probability score for each hashtag.

Table
Summary of the state-of-the-art selected literature on hashtag recommendation

Table 3
Evaluation results of different models for hashtag recommendation task.LXMERT4Hashtag obtained better performance over other methods