From: Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction
Method
B@1
B@2
B@3
B@4
MT
Deep VS [25]
57.9
38.3
24.5
16.0
-
emb-gLSTM [5]
64.7
45.9
31.8
21.2
20.6
Soft attn [17]
67.0
44.8
29.9
19.5
18.9
Hard attn [17]
45.7
31.4
21.3
20.3
SCA-CNN [14]
68.2
49.6
35.9
25.8
22.4
Ours
70.5
50.2
37.3
28.6