Bilingual video captioning model for enhanced video retrieval

Journal of Big Data

Table 1 A summary of the reviewed keyframe extraction methods

Ref.	Year	Approach/method	Weaknesses
[13]	2022	Content-based: coarse extraction and partial-fine re-extraction of spatiotemporal slices	Not suitable for all videos (especially videos that contain fast scenes)
[15]	2021	Content-based: SSIM	Applied to a specific regions, not a complete frame
[17]	2015	Time-based	Inaccurate because it is based on time (one frame each second)
[18]	2015	Frame-based	Inaccurate because it is based on the number of frames (240 frames per video)
[19]	2021	Content-based: filtration network RL-based	Requires efficient training
[20]	2022	Content-based: multiview fusion method-based	Complicated method Requires specific frame sizes
[22]	2022	Content-based: local consistent deformable convolution	Long processing time
[23]	2020	Content-based: Sobel gradient images and variance coefficient measure	Long processing time compared to other similar approaches [26]