Image captioning model using attention and object features to mimic human image understanding

Image captioning spans the fields of computer vision and natural language processing. The image captioning task generalizes object detection where the descriptions are a single word. Recently, most research on image captioning has focused on deep learning techniques, especially Encoder-Decoder models with Convolutional Neural Network (CNN) feature extraction. However, few works have tried using object detection features to increase the quality of the generated captions. This paper presents an attention-based, Encoder-Decoder deep architecture that makes use of convolutional features extracted from a CNN model pre-trained on ImageNet (Xception), together with object features extracted from the YOLOv4 model, pre-trained on MS COCO. This paper also introduces a new positional encoding scheme for object features, the “importance factor”. Our model was tested on the MS COCO and Flickr30k datasets, and the performance is compared to performance in similar works. Our new feature extraction scheme raises the CIDEr score by 15.04%. The code is available at: https://github.com/abdelhadie-almalla/image_captioning


Related works
In [9], Yin and Ordonez suggested a sequence-to-sequence model in which an LSTM network encodes a series of objects and their positions as an input sequence and an LSTM language model decodes this representation to generate captions. Their model uses the YOLO [8] object detection model to extract object layouts from images (object categories and locations) and increase the accuracy of captions. They also present a variation that uses the VGG [10] image classification model pre-trained on ImageNet [11] to extract visual features. The encoder at each time step takes as input a pair of object category (encoded as a one-hot vector), and the location configuration vector that contains the left-most position, top-most position, width and height of the bounding box corresponding to the object, all normalized. The model is trained with back-propagation, but the error is not propagated to the object detection model. They showed that their model increased in accuracy when combined with CNN and YOLO modules. They did not use all available data from the object features produced by YOLO, such as object dimensions and confidence.
In [12] Vo-Ho et al. developed an image captioning system that extracts object features from YOLO9000 [8] and Faster R-CNN [13]. Each type of features is processed through an attention module to produce local features that represent the part that the model is currently focusing on. The two local feature sets are combined and fed into an LSTM model to generate the probabilities of the words in the vocabulary set at each time step. A beam search strategy is used to process the results, in order to choose the best candidate caption. They used the ResNet [14] CNN to extract the features from images. From a given image as input, they first extracted a list of tags using YOLO9000, then break each tag into words and eliminate redundant ones so the list will contain only unique words. Each word i, including the "null" token, is represented by a one-hot vector of the size of the vocabulary set. After that, they embed each word into a d-dimension space using the word embedding method. They used LSTM units for language generation. They only keep the top 20 tags with the highest probabilities.
In [15], Lanzendörfer et al. proposed a model for Visual Question Answering (VQA) based on iBOWIMG. The model extracts features from Inception V3 [16] as well as object features extracted from the YOLO [8] object detection model, and uses the attention mechanism. The outputs of YOLO are encoded as vectors of size 80 × 1 in order to give more informative features to the iBOWIMG model, with each column containing the number of detected objects of the given type. Three of these object vectors are produced for detection confidence thresholds of 25%, 50% and 75% and then concatenated with the image features and question features.
In [17], Herdade et al. proposed a spatial attention-based encoder-decoder model that explicitly integrates information about the spatial relationship between detected objects. They employed an object detector to extract appearance and geometry features from all detected objects in the image, then the Object Relation Transformer to generate caption text. They used Faster R-CNN [13] with ResNet-101 [14] as the base CNN for object detection and feature extraction. A Region Proposal Network (RPN) generates bounding boxes for object proposals using intermediate feature maps from the ResNet-101 as inputs. Overlapping bounding boxes with an intersection-over-union (IoU) exceeding a threshold of 0.7 are discarded, using non-maximum suppression. All bounding boxes where the class prediction probability is below a threshold of 0.2 are also discarded. Then, for each object bounding box, they perform mean-pooling over the spatial dimension to build a 2048-dimensional feature vector. These feature vectors are then input to the Transformer model.
In [18] Wang et al. studied end-to-end image captioning with highly interpretable representations obtained from explicit object detection. They performed a detailed review of the effectiveness of a number of object detection-based cues for image captioning. They discovered that frequency counts, object size, and location are all useful and complement the accuracy of the captions produced. They also discovered that certain object categories had a greater effect than others on image captioning.
The work of Sharif et al. [19] suggested to leverage the linguistic relations between objects in an image to boost image captioning quality. They leverage "word embeddings" to capture word semantics and capsulize the semantic relatedness of objects. The proposed model uses linguistically-aware relationship embeddings to capture the spatial and semantic proximity of object pairs. It also uses NASNet to capture the image's global semantics. As a result, true semantic relations that are not apparent in an image's visual content can be learned, allowing the decoder to focus on the most important object relations and visual features, resulting in more semantically-meaningful captions.
Variš et al. [20] investigated the possibility of textual and visual modalities sharing a common embedding space. They presented an approach that takes advantage of object detection labels' textual nature as well as the possible expressiveness of visual object representations built from them. They investigated whether grounding the representations in the captioning system's word embedding space, rather than grounding words or sentences in their associated images, could improve the captioning system's efficiency. Their proposed grounding approaches ensure that the predicted object features and the term embedding space are mutually grounded.
Alkalouti and Masre [21] proposed a model to automate video captioning based on an Encoder-Decoder architecture. They first select the most important frames from the video and remove redundant ones. They used the YOLO model to detect objects in video frames and an LSTM model for language generation.
In [22], Ke et al. investigated the feature extraction performance of 16 popular CNNs on a dataset of chest X-ray images. They did not find a relationship between the performance on ImageNet and the performance on the medical image dataset. However, they found out that the choice of CNN architecture influences performance more than the concrete model within the model family for medical tasks. They also noticed that ImageNet pre-training gives a boost to performance in all architectures, with a lower boost for bigger architectures. They also observed that ImageNet pre-training yields a statistically significant boost in performance across architectures, with a higher boost for smaller architectures.
In [23], Xu et al. proposed a novel Anchor-Captioner method. They started by identifying the significant tokens that should be given more attention and using them as anchors. The relevant texts for each chosen anchor were then grouped to create the associated anchor-centered graph (ACG). Finally, they implemented multi-view caption generation based on various ACGs in order to improve the content diversity of generated captions.
In [24], Chen et al. suggested Verb-specific Semantic Roles (VSR) as a new Controllable Image Captioning (CIC) control signal. VSR is made up of a verb and some semantic roles that reflect a specific activity and the roles of the entities involved in it. They trained a Grounded Semantic Role Labeling (GSRL) model to locate and ground all entities associated with each role given a VSR. Then, to learn human-like descriptive semantic structures, they suggested a Semantic Structure Planner (SSP). Lastly, they used a role-shift captioning model to generate the captions.
In [25], Cornia et al. presented a unique framework for image captioning, which allows both grounding and controllability to generate diverse descriptions. They produced the relevant caption using a recurrent architecture that explicitly predicts textual chunks based on regions and adheres to the control's limitations, given a control signal in the form of a series or a collection of image regions. Experiments are carried out using Flickr30k Entities and COCO Entities, a more advanced version of COCO that includes semi-automated grounding annotations. Their findings showed that the method produces state-of-the-art outcomes in terms of caption quality and diversity for controllable image captioning.
Unlike previous works, our approach takes advantage of all object features available. The experiments section shows the effect of this scheme.

Research methodology
The experimental method involves extracting object features from the YOLO model and introducing them along with CNN convolutional features to a simple deep learning model that uses the widespread Encoder-Decoder architecture with the attention mechanism. "Results and discussion" section compares the difference in results before and after adding the object features. Although previous research encoded object features as a vector, we add object features in a simple concatenation manner and achieved a good improvement. We also test the impact of sorting the object tags extracted from YOLO according to a metric that we propose here.

Datasets used
We test our method on two datasets used usually for image captioning: MS COCO and Flickr30k. Table 1 contains a brief comparison between them. They are both collected from the Flickr photo sharing website and consist of real-life images, annotated by humans (five annotations per image).
It is worth noting that MS COCO does not publish the labels of the testing set.

Evaluation metrics
We use a set of evaluation metrics that are widely used in the image captioning field. BLEU [26] metrics are commonly used in automated text evaluation and quantify the correspondence between a machine translation output and a human translation; in the case of image captioning, the machine translation output corresponds to the automatically produced caption, and the human translation corresponds to the human description of the image. METEOR [27] is computed using the harmonic mean of unigram precision and recall, with the recall having a higher weight than the precision, as follows: ROUGE-L [28] uses a Longest Common Subsequence (LCS) score to assess the adequacy and fluency of the produced text, while CIDEr [29] focuses on grammaticality and saliency. SPICE [30] evaluates the semantics of the produced text by creating a "scene graph" for both the original and generated captions, and then only matches the terms if their lemmatized WordNet representations are identical. BLEU, METEOR, and ROUGE have low correlations with human quality tests, while SPICE and CIDEr have a better correlation but are more difficult to optimize.

Model
Our model uses an attention-based Encoder-Decoder architecture. It has two methods of feature extraction for image captioning: an image classification CNN (Xception [31]), and an object detection model (YOLOv4 [7]). The outputs of these models are combined by concatenation to produce a feature matrix that carries more information to the language decoder to predict more accurate descriptions. Unlike others' works that embedded object features before combining them with CNN features, we use raw object layout information directly. Language generation is done using an attention module (Bahdanau attention [32]), a GRU [3] and two fully connected layers. Our model is simple, fast to train and evaluate, and generates captions using attention. We believe that if humans can benefit from object features (such as the class of object, its position, and size) to better understand an image, a computer model can benefit from this information as well. A scene containing a group of people standing close together, for example, may suggest a meeting, whereas sparse crowds can indicate a public location. Figure 1 depicts our model.

Image encoding
A. Pre-trained image classification CNN In this work, we use the Xception CNN pretrained on ImageNet [11] to extract spatial features.
Xception [31] (Extreme version of Inception) is inspired by Inception V3 [16], but instead of Inception modules, it has 71 layers with a modified depth-wise separable convolution. It outperforms Inception V3 thanks to better model parameter usage.
We extract features from the last layer before the fully connected layer, following recent works in image captioning. This allows the overall model to gain insight about the objects in the image and the relationships between them instead of just focusing on the image class.
In a previous work, different feature extraction CNN models have been compared for image captioning applications. The results showed that Xception was among the most robust in extracting features and for this it was chosen as the feature extraction B. Object detection model Our method uses the YOLOv4 [7] model because of its speed and good accuracy, which make it suitable for big data and real-time applications. The extracted features are a list of object features, with every object feature containing the X coordinate, Y coordinate, width, height, confidence rate (from 0 to 1 inclusive), class number and a novel optional "importance factor".
Following human intuition, foreground objects are normally larger and more important when describing an image, and background objects are normally smaller and less important. Furthermore, it makes sense to use more accurate pieces of information than to use less accurate ones. Hence, our importance factor tries to balance the importance of the foreground large objects and objects with high confidence rates. The formula to calculate it for a single object is as follows: The importance factor gives a higher score to foreground large objects over background small ones, and higher score to objects with a high confidence over objects with less confidence.
After extracting object features, the importance factor is calculated for each object and concatenated to its tag. Then, all objects in the list are sorted according to this importance factor using the quick sort algorithm. Unlike previous works, our method makes use of all of the image's object information. Because of the size restriction in the output of the CNN, we use up to 292 objects, each with seven attributes (including the importance factor), which is usually enough to represent important objects in an image.
The list of features is flattened into a 1D array, of length less than 2048. It is then padded with zeros to length 2048 to be compatible with the output of the CNN module. The output of this stage is an array (1 × 2048).
As for calculating the confidence score, YOLO divides an image into a grid. B bounding boxes and confidence scores for these boxes are predicted in each of these grid cells. The confidence score indicates how confident the model is that the box includes an object, as well as how accurate the model believes the box that predicted is. The object detection algorithm is evaluated using Intersection over Union (IoU) between the predicted box and the ground truth. It analyzes how similar the predicted box is to the ground truth by calculating the overlap between the ground truth and the predicted bounding box. A cell's confidence score should be zero if no object exists in there. The formula for calculating the confidence score is: C. Concatenation and embedding In order to take advantage of the image classification features and the object detection features, we add this concatenation step, where we attach the output of the YOLOv4 subsystem as the last row in the output of stage 1. The output of this stage is of shape (101 × 2048).

Importance Factor
The embedding is done using one fully connected layer of length 256. This stage ensures a consistent size of the features and maps the feature space to a smaller space appropriate for the language decoder.

D. Attention
Our method uses the Bahdanau soft attention system [32]. This deterministic attention mechanism makes the model as a whole smooth and differentiable.
The term "attention" refers to a strategy that simulates cognitive attention. The effect highlights the most important parts of the input data while fading the rest. The concept is that the network should dedicate greater computer resources to that small but critical portion of the data. Which component of the data is more relevant than others is determined by the context and is learned by gradient descent using training data. Natural language processing and computer vision use attention in a number of machine learning tasks.
The attention mechanism was created to increase the performance of the encoderdecoder architecture for machine translation. And as image captioning can be viewed as a specific case of machine translation, attention proved useful when analyzing images as well. The attention mechanism was intended to allow the decoder to use the most relevant parts of the input sequence in a flexible manner by combining all of the encoded input vectors into a weighted combination, with the most relevant vectors receiving the highest weights.
Attention follows the human intuition of focusing on different parts of an image when describing it. Using object detection features also follows the intuition that knowing about object classes and positions help to grasp more about the image than mere convolutional features. When attention is employed to both feature types, the system will focus on different features of both object classes and positions in the same image.

Language decoder
For decoding, a GRU [3] is used to exploit its speed and low memory usage. It produces a caption by generating one word at every time step, conditioned on a context vector, the previous hidden state, and the previously generated words. The model is trained using the backpropagation algorithm deterministically. The GRU is followed by two fully connected layers. The first one is of length 512, and the second one is of the size of the vocabulary to produce output text.
The training process for the decoder is as follows: 1. The features are extracted then passed through the encoder. 2. The decoder receives the encoder output, hidden state (initialized to 0), and decoder input (which is the start token). 3. The decoder returns the predictions as well as the hidden state of the decoder. 4. The hidden state of the decoder is then passed back into the model, and the loss is calculated using the predictions. 5. To determine the next decoder input, "teacher forcing" is employed, which is a technique that passes the target word as the next input to the decoder.

Pre-processing
This section presents the pre-processing algorithm that was performed on the data: 1. Sort the dataset at random into image-caption pairs. This helps the training process to converge fast and prevents any bias during the training. Therefore, preventing the model from learning the order of training. 2. Read and decode the images. 3. Resize the images to the CNN requirements: whatever the size of the image is, it is resized to 299 × 299 as required by the Xception CNN model. 4. Tokenization of the text. Tokenization breaks the raw text into words, that are separated by punctuations, special characters, or white spaces. The separators are discarded. 5. Count the tokens, sort them by frequency and choose the top 15,000 most common words as the system's vocabulary. This avoids over-fitting by eliminating terms that are not likely to be useful. 6. Generate word-to-index and index-to-word structures. They are then used to translate token sequences into word identifier sequences. 7. Padding. As sentences can be different in length, we need to have the inputs with the same size, this is where the padding is necessary. Here, identifier sequences are padded at the end with null tokens to ensure that they are all the of same length.

Results and discussion
Our code is written in the Python programming language using TensorFlow 1 . library The CNN implementation and trained model were imported from Keras 2 . library, and a YOLOv4 model pre-trained on MS COCO was imported from the yolov4 library 3 . This work uses the MS COCO evaluation tool to calculate scores 4 .
Tests are conducted on two widely used datasets for image caption generation: MS COCO and Flickr30k. Every image has five reference captions in these two datasets, which contain 123,000 and 31,000 images, respectively. For MS COCO, 5000 images are reserved for validation and 5000 images are reserved for checking according to Karpathy's split [33]. In the case of Flicker30k dataset, 29,000 images are used for preparation, 1000 for validation, and 1000 for testing. The model was trained for 20 epochs and used Sparse Categorical Cross Entropy as the loss function. For the optimizer, Adam optimizer was employed. Table 2 presents the results of the proposed model on MS COCO Karpathy split and compares them to the results of the baseline model with features only from Xception. It can be noticed how well the evaluation scores increase after adding object features to the model, especially the CIDEr score, which increased by 15.04%. This reflects good improvement in correlation with human judgment when using full object features, and boosted grammatical integrity and saliency. It appears that the importance factor increases the BLEU metrics and decreases METEOR slightly, whereas the other metric values stay the same. Unlike the findings of Herdade et al. [17], our artificial positional encoding scheme did not decrease the CIDEr score. They tested multiple artificial positional encoding schemes and compared them to their geometric attention mechanism.
To show the effectiveness of our method, we compare our increase in results (with the importance factor) to the increase in results of Yin and Ordonez [9] on MS COCO Karpathy split in Table 3. They also measured the effects of incorporating object features on image captioning results. Their object feature extraction method extracts object layouts from the YOLO9000 model, encodes them through an LSTM  Their baseline model has higher accuracy than ours, which may justify the difference between our scores and theirs. They did not report the BLEU-1, BLEU-2, BLEU-3 or SPICE score. We notice in Table 3 that our results are somewhat comparable to those of Yin and Ordonez [9]. We report all eight standard evaluation scores. The introduction of this type of feature extraction improves all evaluation scores over our baseline model. The increase in the SPICE score (5.88%) reflects increased semantic correlation when using object features, an expected consequence of feeding object tags into the model. SPICE is one of those metrics that are harder to optimize. The score difference between our model and theirs may be related to the feature combination and encoding method. They encode each feature type in a vector, and then add the two vectors, while our model concatenates the two feature sets directly.
We also compare our work with the work of Sharif et al. [19], who tried to benefit from linguistic relations between objects in an image, and we present a comparison between our model and theirs on the Flickr30k dataset [34] in Table 4. We can notice in Table 4 that our method also yields improvement on Flickr30k, with the bigger improvement being in the METEOR score. Sharif et al. benefited from linguistic information in addition to object detection features. Figure 3 displays a comparison between the baseline model and the model enhanced with object features, on MS COCO Karpathy split [33] validation testing sets. We see a clear increase in the results on all evaluation metrics on both sets, which indicates low generalization error and proves our hypothesis that enhancing the vision model with object detection features improves accuracy.
In order to qualitatively compare the textual outputs of the approach, we present in Fig. 4 a qualitative comparison between the results with object features and without them. We notice that the difference is remarkable, and the addition of the object features makes the sentences more salient grammatically, and with less object mistakes. In (a) for example, a skier was identified instead of just the skiing boots. In (b), the model before incorporating object features had mixed up people and snow boards. In (c), the two cows were correctly identified after adding object features. In (d), The model without object features falsely identified a man in the picture. In (e), the model could not identify the third bear without object features. In (f ), object features helped to identify a group of people instead of only two women.

Conclusions
In this paper, we presented an attention-based Encoder-Decoder image captioning model that uses two methods of feature extraction, an image classification CNN (Xception) and an object detection module (YOLOv4), and proved the effectiveness of this scheme. We introduced the importance factor, which prioritizes foreground large objects over background small ones, and favors objects with high confidence over those with low confidence and demonstrated its effect on increasing scores. We showed how our method improved the scores and compared it to previous works in the score increase, especially the CIDEr metric which increased by 15.04%, reflecting improved grammatical saliency. Unlike previous works, our work suggested to benefit from all object detection features extracted from YOLO and showed the effect of sorting the extracted object tags. This can be further improved by better methods for combining object detection features