From: Image captioning model using attention and object features to mimic human image understanding
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | CIDEr | ROUGE-L | SPICE |
---|---|---|---|---|---|---|---|---|
Baseline model | 0.463 | 0.273 | 0.156 | 0.087 | 0.157 | 0.339 | 0.345 | 0.102 |
Ours (with YOLO bounding boxes, without the importance factor) | 0.486 | 0.293 | 0.173 | 0.099 | 0.164 | 0.390 | 0.358 | 0.108 |
Ours (with YOLO bounding boxes and the importance factor) | 0.492 | 0.296 | 0.174 | 0.101 | 0.163 | 0.390 | 0.358 | 0.108 |
Increase due to the importance factor (%) | 1.23 | 1.02 | 0.57 | 2.02 | − 0.99 | 0 | 0 | 0 |
Increase over the baseline model (%) | 6.26 | 8.42 | 11.53 | 16.09 | 3.82 | 15.04 | 3.76 | 5.88 |