Journal of Big Data

Table 6 Performance comparison of the proposed method on Flickr30K dataset. The best two models having larger values of the metrics are shown in red and green

From: Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Method	B@1	B@2	B@3	B@4	MT	CD
Deep VS [25]	57.3	36.9	24.0	15.7	15.3
emb-gLSTM [5]	64.6	44.6	30.5	20.6	17.9	-
Soft attn [17]	66.7	43.4	28.8	19.1	18.5	-
Hard attn [17]	66.9	43.9	29.6	19.9	18.5	-
ATT [55]	64.7	46.0	32.4	23.0	18.9	-
SCA-CNN [14]	66.2	46.8	32.5	22.3	19.5	-
avtmNet [58]	–	–	–	24.8	20.8	59.8
Ours	70.1	49.4	35.8	27.2	21.7	67.3

Back to article page