From: Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction
Decomposition levels
MSCOCO
B@4
MT
CD
1-level
52.87
35.14
90.39
2-level
53.64
36.53
91.71
3-level
53.69
36.90
91.89