Skip to main content

Table 9 Visual relationship detection

From: An analytical study of information extraction from unstructured and multidimensional big data

 

Purpose

Technique

Dataset

Results

Limitations/benefits

[43]

To map the images with associated scene triples (subject, predicate, object)

Conditional multiway model with implicitly learned latent representation. Both semantic tensor model and object detection used RCNN model

Stanford visual relationship dataset

Model achieved better performance to predict unobserved triples

Results are comparable to Bayesian fusion model. Proposed approach has used implicit learned prior for semantic triples whereas BFM needs explicit

[38]

To improve global context cues extraction

Variation structured Reinforcement Learning (VRL): Directed semantic graph using language prior + variation structured traversal to construct action set + make sequential predictions using deep RL

VRD dataset with 5000 images and visual genome dataset with 87,398 images

The proposed approach outperforms baseline methods for attribute recall @100 and @50 i.e. 26.43 24.87 resp. other results also show pretty notable results compared to other methods

Although, the proposed approach outperformed but on comparison to VRL performance, the results are almost same as VRL with LSTM. But it was claimed that VRL with LSTM takes more training time

[39]

To identify unseen context interaction relationship

Context aware interaction classification i.e. Faster-RCNN + AP + C + CAT

VRD dataset and visual phrase dataset

The proposed approach has performed better than baseline methods using spatial and appearance features

Spatial feature representation produced better results than appearance based representation

Adding language prior to proposed approach does not bring benefit

[44]

1. To infuse semantic information and improve predicate detection

2. NMS was used to reduce redundancy and boost detection speed

To include spatial, classification and appearance information, feature extraction used + bidirectional RNN + paired non-maximum suppression (NMS)

Visual genome having 108,077 images, VRD having 5000 images

Results are compared with other existing methods for predicate, phrase and relationship detection for recall @ 50 and 100. Proposed solution gave better results for both datasets

Superfluous regions are filtered using NMS improves the performance

[41]

To overcome long tail distribution challenge

To handle widely spread and imbalanced distribution of triples

Visual module using VGG16, Language module using softmax, the contribution was spatial vector using normalized relative location of object and intersection over union

VRD and VG dataset

The results showed that proposed vector improved the performance of 2% and 4% on Recall@50 as compared to other. The proposed solution capable to detect unseen visual relationship

The research only addressed the long tail distribution challenges