Skip to main content

Table 4 Performance comparison of the proposed method on MSCOCO dataset. (-) indicates ‘metric is not reported’. The best two models having larger values of the metrics are shown in red and green

From: Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Method

B@1

B@2

B@3

B@4

MT

R

CD

Deep VS [25]

62.5

45.0

32.1

23.0

19.5

-

66.0

emb-gLSTM [5]

67.0

49.1

35.8

26.4

22.74

-

81.25

Soft attn [17]

70.7

49.2

34.4

24.3

23.9

-

-

Hard attn [17]

71.8

50.4

35.7

25.0

23.04

-

-

ATT [55]

70.9

53.7

40.2

30.4

24.3

-

-

SCA-CNN [14]

71.9

54.8

41.1

31.1

25.0

-

-

LSTM-A [56]

75.4

35.2

26.9

55.8

108.8

Up-down [7]

77.2

36.2

27.0

56.4

113.5

SCST [42]

34.2

26.7

55.7

114.0

RFNet [20]

76.4

60.4

46.6

35.8

27.4

56.5

112.5

GCN-LSTM [57]

77.4

37.1

28.1

57.2

117.1

avtmNet [58]

33.2

27.3

56.7

112.6

ERNN [59]

73.2

56.9

42.9

32.2

25.2

-

101.4

Tri-LSTM [62]

37.3

28.4

58.1

123.5

TDA+GLD [61]

78.8

62.6

48.0

36.1

27.8

57.1

121.1

Ours

78.5

62.0

49.1

38.2

28.9

58.3

124.2