From: Semantic context driven language descriptions of videos using deep neural network
Models | MSVD | |||||||
---|---|---|---|---|---|---|---|---|
2 Layer Stacked LSTM | 3 Layer Stacked LSTM | |||||||
METEOR | ROUGE | CIDEr | SPICE | METEOR | ROUGE | CIDEr | SPICE | |
VGG16 + Stacked LSTM + GloVe (Model_1) | 24.7 | 60.7 | 32.4 | 3 | 24.1 | 60.9 | 29.6 | 3 |
InceptionV3 + Stacked LSTM + GloVe (Model_2) | 33.3 | 66.6 | 58.4 | 4.8 | 31.1 | 67.0 | 64.4 | 4.9 |
NASNet + Stacked LSTM + GloVe ( Model_3) | 32.3 | 68.8 | 70.7 | 5.1 | 31.8 | 67.5 | 71.4 | 4.9 |