From: Semantic context driven language descriptions of videos using deep neural network
Models | MSVD | |||||||
---|---|---|---|---|---|---|---|---|
2 Layer Stacked LSTM | 3 Layer Stacked LSTM | |||||||
B@1 | B@2 | B@3 | B@4 | B@1 | B@2 | B@3 | B@4 | |
VGG16 + Stacked LSTM + GloVe (Model_1) | 69.1 | 50.1 | 38.2 | 27.0 | 68.1 | 48.8 | 37.0 | 25.58 |
InceptionV3 + Stacked LSTM + GloVe (Model_2) | 74.3 | 60.1 | 49.7 | 40.2 | 73.6 | 59.8 | 49.5 | 38.5 |
NASNet + Stacked LSTM + GloVe ( Model_3) | 78.4 | 64.8 | 54.2 | 43.7 | 78.2 | 65.3 | 55.1 | 44 |