From: Bilingual video captioning model for enhanced video retrieval
Ref. | Year | Approach/method | Weaknesses |
---|---|---|---|
[13] | 2022 | Content-based: coarse extraction and partial-fine re-extraction of spatiotemporal slices | Not suitable for all videos (especially videos that contain fast scenes) |
[15] | 2021 | Content-based: SSIM | Applied to a specific regions, not a complete frame |
[17] | 2015 | Time-based | Inaccurate because it is based on time (one frame each second) |
[18] | 2015 | Frame-based | Inaccurate because it is based on the number of frames (240 frames per video) |
[19] | 2021 | Content-based: filtration network RL-based | Requires efficient training |
[20] | 2022 | Content-based: multiview fusion method-based | Complicated method Requires specific frame sizes |
[22] | 2022 | Content-based: local consistent deformable convolution | Long processing time |
[23] | 2020 | Content-based: Sobel gradient images and variance coefficient measure | Long processing time compared to other similar approaches [26] |