Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Journal of Big Data

Table 4 Performance comparison of the proposed method on MSCOCO dataset. (-) indicates ‘metric is not reported’. The best two models having larger values of the metrics are shown in red and green

Method	B@1	B@2	B@3	B@4	MT	R	CD
Deep VS [25]	62.5	45.0	32.1	23.0	19.5	-	66.0
emb-gLSTM [5]	67.0	49.1	35.8	26.4	22.74	-	81.25
Soft attn [17]	70.7	49.2	34.4	24.3	23.9	-	-
Hard attn [17]	71.8	50.4	35.7	25.0	23.04	-	-
ATT [55]	70.9	53.7	40.2	30.4	24.3	-	-
SCA-CNN [14]	71.9	54.8	41.1	31.1	25.0	-	-
LSTM-A [56]	75.4	–	–	35.2	26.9	55.8	108.8
Up-down [7]	77.2	–	–	36.2	27.0	56.4	113.5
SCST [42]	–	–	–	34.2	26.7	55.7	114.0
RFNet [20]	76.4	60.4	46.6	35.8	27.4	56.5	112.5
GCN-LSTM [57]	77.4	–	–	37.1	28.1	57.2	117.1
avtmNet [58]	–	–	–	33.2	27.3	56.7	112.6
ERNN [59]	73.2	56.9	42.9	32.2	25.2	-	101.4
Tri-LSTM [62]	–	–	–	37.3	28.4	58.1	123.5
TDA+GLD [61]	78.8	62.6	48.0	36.1	27.8	57.1	121.1
Ours	78.5	62.0	49.1	38.2	28.9	58.3	124.2