Skip to main content

Table 5 Performance comparison of the proposed method on Flickr8K dataset. The best two models having larger values of the metrics are shown in red and green

From: Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Method

B@1

B@2

B@3

B@4

MT

Deep VS [25]

57.9

38.3

24.5

16.0

-

emb-gLSTM [5]

64.7

45.9

31.8

21.2

20.6

Soft attn [17]

67.0

44.8

29.9

19.5

18.9

Hard attn [17]

67.0

45.7

31.4

21.3

20.3

SCA-CNN [14]

68.2

49.6

35.9

25.8

22.4

Ours

70.5

50.2

37.3

28.6

24.5