Skip to main content

Table 9 Visual relationship detection

From: An analytical study of information extraction from unstructured and multidimensional big data

  Purpose Technique Dataset Results Limitations/benefits
[43] To map the images with associated scene triples (subject, predicate, object) Conditional multiway model with implicitly learned latent representation. Both semantic tensor model and object detection used RCNN model Stanford visual relationship dataset Model achieved better performance to predict unobserved triples Results are comparable to Bayesian fusion model. Proposed approach has used implicit learned prior for semantic triples whereas BFM needs explicit
[38] To improve global context cues extraction Variation structured Reinforcement Learning (VRL): Directed semantic graph using language prior + variation structured traversal to construct action set + make sequential predictions using deep RL VRD dataset with 5000 images and visual genome dataset with 87,398 images The proposed approach outperforms baseline methods for attribute recall @100 and @50 i.e. 26.43 24.87 resp. other results also show pretty notable results compared to other methods Although, the proposed approach outperformed but on comparison to VRL performance, the results are almost same as VRL with LSTM. But it was claimed that VRL with LSTM takes more training time
[39] To identify unseen context interaction relationship Context aware interaction classification i.e. Faster-RCNN + AP + C + CAT VRD dataset and visual phrase dataset The proposed approach has performed better than baseline methods using spatial and appearance features Spatial feature representation produced better results than appearance based representation
Adding language prior to proposed approach does not bring benefit
[44] 1. To infuse semantic information and improve predicate detection
2. NMS was used to reduce redundancy and boost detection speed
To include spatial, classification and appearance information, feature extraction used + bidirectional RNN + paired non-maximum suppression (NMS) Visual genome having 108,077 images, VRD having 5000 images Results are compared with other existing methods for predicate, phrase and relationship detection for recall @ 50 and 100. Proposed solution gave better results for both datasets Superfluous regions are filtered using NMS improves the performance
[41] To overcome long tail distribution challenge
To handle widely spread and imbalanced distribution of triples
Visual module using VGG16, Language module using softmax, the contribution was spatial vector using normalized relative location of object and intersection over union VRD and VG dataset The results showed that proposed vector improved the performance of 2% and 4% on Recall@50 as compared to other. The proposed solution capable to detect unseen visual relationship The research only addressed the long tail distribution challenges