Image captioning model using attention and object features to mimic human image understanding

Journal of Big Data

Table 4 A comparison with the results of Sharif et al. [19] on Flickr30k testing split

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	CIDEr	ROUGE-L	SPICE
Sharif et al.’s baseline model	0.4368	NA	NA	NA	0.1297	0.2517	0.2997	0.0700
Sharif et al.’s suggested model	0.4462	NA	NA	NA	0.1350	0.2835	0.3116	0.0741
Our baseline model	0.3990	0.2200	0.1170	0.0620	0.1230	0.1480	0.2930	0.0740
Our model (with the importance factor)	0.3980	0.2210	0.1160	0.0610	0.1290	0.1500	0.2980	0.0740
Sharif et al.’s increase (%)	2.15	NA	NA	NA	4.08	12.63	3.97	5.85
Our increase (%)	− 0.25	0.45	− 0.86	− 1.63	4.87	1.35	1.7	0