Skip to main content

Table 6 Performance comparison of the proposed method on Flickr30K dataset. The best two models having larger values of the metrics are shown in red and green

From: Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Method

B@1

B@2

B@3

B@4

MT

CD

Deep VS [25]

57.3

36.9

24.0

15.7

15.3

 

emb-gLSTM [5]

64.6

44.6

30.5

20.6

17.9

-

Soft attn [17]

66.7

43.4

28.8

19.1

18.5

-

Hard attn [17]

66.9

43.9

29.6

19.9

18.5

-

ATT [55]

64.7

46.0

32.4

23.0

18.9

-

SCA-CNN [14]

66.2

46.8

32.5

22.3

19.5

-

avtmNet [58]

–

–

–

24.8

20.8

59.8

Ours

70.1

49.4

35.8

27.2

21.7

67.3