Problem statement and solution
News reports describing event(s) are associated with or are routed to an intermediate node in a network of nodes describing a provenance graph where each node in this graph is associated with an activity. Every news report is presented by a user whose trustworthiness is measured with a computational trust value. This activity describes a “match and extract” procedure for transforming these input reports to the destination activity or event or topic node thus providing a more detailed version of the narrative. This search and extract procedure which uses the importance measure associated with individual attributes or records describing the news reports and the computational trust measure associated with their originators have been described in “Development of relevance of computational trust” section. The relevance of the clustering to this work has been described in “Related work in clustering of nodes in the provenance graph” section and our approach to clustering of nodes has been described in “Our solution to clustering nodes in provenance graph” section. This procedure of learning the path in a provenance graph of activity or event nodes has been described in “Rationale for application of Q Learning algorithm in this trust based representation” section. The rationale for application of Q Learning algorithm to the Trust Provenance Graph Representation has been described in “Application of Q Learning algorithm for learning provenance path” section. The provenance graph models thus constructed from identified dependencies between events or activities is relevant for an interval of time window. These graph models are merged to produce consistent models spanning larger intervals of time windows which apply also to incrementally acquired models for more recent time interval windows. The narrative of splitting and merging for producing the goal topic or whole story description in these identified merged models have been described in “Our approach to provenance graph structure and classifier learning” section.
A block diagram or the design of the solution steps has been presented in Fig. 1. The examples in support of the design have been presented in this paper in Figs. 2 and 3. The Theme or Event or Topic or Subtopic evolution threads describe the Asian Tsumami depicted in Fig. 1. Here for example, Lessons from Event has been described with the Theme evolution Theme 1 → Theme 1 → Theme 4. Possible threads for Aids include Theme 4 → Theme 3 → Theme 5 and thread for Personal Experience include Theme 5 → Theme 6 → Theme 3 → Theme 3. The confluence or merging of Theme 4 and Theme 5 into Theme 6 and then evolving into Theme 3 described the Personal Experience thread as Theme 4 → Theme 6 → Theme 3 → Theme 3 or as Theme 5 → Theme 6 → Theme 3 → Theme 3. Theme 1 at time interval 4 splits into and evolves towards Theme 1 at time 5 and Theme 6 at time 5, corresponding to Lessons from the event thread and Criticism on Iraq thread. The learning or construction of provenance path to a goal path discovery has been described in Fig. 3 as exemplified in [131].
Development of relevance of computational trust
The model for a message passing network is considered in the present work. Trust needs to be associated with messages as well. Two trust values are defined, one for the communicating node and the other for the message sent. In order to conclude about the trustworthiness of a message, a composite trust needs to be computed. A summarized description of issues of storage and retrieval of trusted information using a probabilistic temporal database approach is available in [12]. A measure has been introduced as a composite of calculated weight of record or attributes importance value and the computational trust of the agent attached to the node. The utility measure associated with any node can be calculated based on probability of action of forwarding data from that node to the network node option for the current congestion situation. This probabilistic utility measure may indicate the fraction of the sent packets from the source that are successfully delivered at the destination thus avoiding dropped packets (Eq. 8). This net trust value which is the composite trust of attribute importance values and network utility values and the computational trust of the agent associated with the node is derived from application of principles described in Eq. 10. An event or topic modelling approach is considered where the dependency between news report event or subtopic nodes is represented from their correlated property derived from similarity or distance measure associating these news report events or subtopics. The reward value calculation for transferring message between these nodes is described and which is utilized for deriving a policy path of message transfers using a reinforcement learning approach.
$$ {\text{The weight associated with a source record}} = \left( {\sum\nolimits_{i} {(p_{i} *mc_{i} )/(ascm)} } \right) $$
(1)
where i represents a valid or selected attribute in the merged record where merging records have identical value for this attribute, mc
i
is occurrence count of token associated with attribute i in merged record which is calculated as the sum of occurrence counts of corresponding attribute value in the individual records, ascm is the cumulative occurrence count of tokens in the merged news report for all valid or selected attributes i present in the merging records and p
i
determined from application of Eqs. 2, 3 and 4.
The value is normalized with a division by the summation of such probabilities identified for all attributes of the source record.
A measure can be associated with any field value based on the occurrence count of this value in the news report from where this record is represented. This can be calculated based on the iteratively developed conditional probability of a field token value given an event as a maximization step of EM algorithm solution as described in [132]. The EM algorithm maximizes the posterior probability of occurrence of the event.
The expectation step is \( {\text{p}}({\text{ej}}|{\text{xi}})\left( {{\text{t}} + {\text{1}}} \right) = {\text{p}}({\text{ej}})\left( {\text{t}}\right)\;*\;{\text{p}}({\text{xi}}|{\text{ej}})\left( {\text{t}} \right)/{\text{p}}({\text{xi}})\left( {\text{t}} \right). \) Here ej is the jth event, xi is the ith news report. The maximization step is derived from this equation and this also produces the probability value of nth entity.
$$ {\text{p}}({\text{w}}_{\text{n}} |{\text{e}}_{\text{j}} )({\text{t}}+1) =\left( \sum\limits_{{({\text{i}} = 1\;{\text{to}}\;{\text{M)}}}} {{\text{p(e}}_{\text{j}} |{\text{X}}_{\text{i}} ) ( {\text{t}}}+1)\,*\,\text{tf(i,n)}) /({\text{N}} + \sum\limits_{{({\text{i}} = 1\;{\text{to}}\;{\text{M)}}}} {{\text{p(e}}_{\text{j}} |{\text{X}}_{\text{i}} ) ( {\text{t}}+1)} + \sum\limits_{{(s = 1\;{\text{to}}\;{\text{N)}}}} {\text{tf(i,s)}}\right ) $$
(2)
This probability of field token value in presence of all values can also be calculated as a posterior probability value based on prior, likelihood of the value given the news reports about the event and the prior probability of occurrence of event in the reports as described in the context of combination and calibration of methods for the purpose of forecasting of events appearing in [133].
$$ {\text{f}}({\text{p}}_{\text{t}} |{\text{n}}_{\text{t}} ) = {\text{f}}\left( {{\text{p}}_{\text{t}} } \right)*{\text{f}}\left( {{\text{n}}_{\text{t}} |{\text{p}}_{\text{t}} } \right)/{\text{f}}({\text{n}}_{\text{t}} ) $$
(3)
where \( {\text{f}}\left( {{\text{n}}_{{\text{t}}} |{\text{p}}_{{\text{t}}} } \right) = \left( {{\text{n}}_{{\text{t}}} {\text{C}}_{{\text{m}}} } \right)*{\text{p}}_{{\text{t}}}^{{{\text{nt}}}} *\left( {{\text{1}}-{\text{p}}_{{\text{t}}} } \right)^{{({\text{m}}{-}{\text{nt}})}} \)
This probability value can also be calculated by a beta binomial method. This can also be calculated as an optimal score approach where the conditional probability of the field token value is calculated based on defined number of past observations, and observations from news reports with application of appropriate weight-age to each that minimize a posterior log likelihood measure as described in [133].
This is described as follows,
$$ E(p_{t} |n_{t} ) = (T_{p} + w_{nt} )/(T + w_{m} ) = w_{{\mathbf{1}}} *(n_{t} /m) + w_{{\mathbf{2}}} *p $$
(4)
where w
1 = w
m
/(T + w
m
), and w2 = T/(T + w
m
), where pt is the probability associated with entity of interest (posterior), nt is the occurrence count of the entity, m is the total occurrence count of all entities, T is the sample size and p(prior) is the sample mean.
The weight w is determined with the purpose of maximizing the log-likelihood value,
$$ {\text{L}}\left( {\text{w}} \right) \, = \Pi_{{{\text{t}} = 1}}^{\text{T}}\;\text{E}({\text{p}}_{\text{t}}^{*} |{\text{ n}}_{\text{t}} ).$$
This w value determines the posterior probability of entity of interest.
Thus if wi is the weight associated with ith message record, and Ti is the global trust value associated with source of record i, the composite trust in the event is calculated as
$$ COMPT = \sum\limits_{i} {(w_{i} *T_{i} )} $$
(5)
The fusion value of data is calculated using data fusion rule specified in [134]. This value is determined as
$$ {\text{DV = }}{{\sum\nolimits_{{\text{i}}} {({\text{w}}_{{\text{i}}}* {\text{T}}_{{\text{i}}} )} } \mathord{\left/ {\vphantom {{\sum\nolimits_{{\text{i}}} {({\text{w}}_{{\text{i}}}* {\text{T}}_{{\text{i}}} )} } {\sum\nolimits_{{\text{i}}} {{\text{T}}_{{\text{i}}} } }}} \right. \kern-\nulldelimiterspace} {\sum\nolimits_{{\text{i}}} {{\text{T}}_{{\text{i}}} } }} $$
(6)
where Σi (w
i
* T
i
) is the composite trust in the message produced from this node and T
i
is the Trust in the Trusted Partners associated with ith message record input to this node
$$ {\text{Composite Weight of Source Record}} = ({\text{w}}_{\text{A}} * {\text{u}}_{\text{A}} ) $$
(7)
where wA is weight associated with source record in “A” and UA is the utility associated with the source record from source at state “A” for the chosen forwarding action to B based on current network congestion situation. The utility measure uA is defined as,
$$ {\text{u}}_{\text{A}} = \frac{{\left( {{\text{Frequency count of successful transfer of packets from ``A''}}} \right)}}{{({\text{Frequency count of sum total of all packets transferred from ``A''}})}} $$
(8)
The reward received at node B is COMPTAB where the composite trust in A after interaction with “B” [40] which has been derived using Eq. 10 from a choice condition determined from application of Eq. 9.
The unit step reward for corresponding pAB(n) is valid for n = 1.
However multi step reward calculated for such node pair (A,B) = pAB(n) for n ≥ 1, which is symbolized as pABt and is derived using application of Eqs. 13, 14, 15 and 16 appearing later in this document.
A similarity measure Sim(i, j) between two records i and j is defined where each of these records are described by k attributes and attr_in_merge(j) = 1, indicates if the jth attribute values of the records have similarity value > 0, else attr_in_merge(j) = 0.
The net similarity measure is defined as
$$ Sim\left( {i, \, j} \right) = \sum\limits_{{\alpha (a = 1\,\,{\text{to}}\,\,k)}} {(attr\_in\_merge(a))(wa*Sima(i,j))} $$
(9)
where Sima (i,j) is the similarity of the ath attributes of records i and j and wa is the normalized probabilistic weight associated with the ath attribute of the merged record which is based on application of Eqs. 2, 3 and 4.
The procedure of cumulating occurrence counts of token or evidence for merging information from multiple sources has also been discussed in [135]. The choice of routing or forwarding of information based on reward calculated on aggregation based on data correlation and gain in reward or diminishment of distance measure to the goal node has been described in [136].
This net similarity measure calculated using Eq. 9, is used for determining the appropriate probability disjunction strategy. Here the probabilistic trust values pA and pB associated with individual source records participating in the merge are calculated from applying Eqs. 1 and 7.
If pB is the probabilistic importance or trust value associated with record at node “B” and pA is the probability of source record at “A” forwarded to node “B” situated in the provenance graph is determined from application of Eqs. 1 and 7. Here the weight of the relevant record is multiplied with the probabilistic trust value measuring the trustworthiness of the agent at the node to produce the probabilistic value presented to the probabilistic disjunctive strategy described for temporal probabilistic databases [12, 137].
$$ {\text{The Net}}\_{\text{trust}} = {\text{measure defined on net similarity}} $$
(10)
Where if the net similarity measure > high threshold, probabilistic disjunctive strategy [137] is adoped for the positively correlated case and the Net_trust is min(1, pA + pB). else if the net similarity measure = 0.5, the probabilistic disjunctive strategy [12] is adopted for the independent case and the Net_trust is \( pA + pB\;{-}\;pA\;*\;pB. \) else if the net similarity measure < low threshold and > minimum threshold, probabilistic disjunctive strategy is adopted for the negatively correlated case [12] and the Net_trust is min(1, pA + pB). else If net similarity measure < minimum threshold, probabilistic disjunctive strategy is adopted for the ignorance case and the Net trust is max(pA, pB).
This net trust value COMPTAB or the reward received at node A is the composite or net trust in the event after merging the message records from pair of sources using method as described in Eq. 10 of this section. This measure considers the importance of similar sources in a cluster alongside sources with the news report “contribution” [138] or measure of distinctive relevance of sources in the reward value computation.
Rationale for application of Q Learning algorithm in this trust based representation
The agent selects an action probabilistically based on Boltzmann distribution defined using Q value probabilities associating states with actions [139]. The Q values are associated with policy of actions and thus justifying the use of provenance path probability values in determining action probabilities.
The transition probability pij between states i and j = exp ((f(j) − f(i))/t) has been described in [55]. The cost functions associated with states i and j are represented as f(i) and f(j) representing their energy states. The value of a policy is the sum of rewards obtained from execution of actions described in the policy. The Q Learning algorithm selects a policy with a higher reward. The difference (f(j) − f(i)) has been calculated from the optimum value of the policy connecting these states i and j. As a policy with higher reward (dominant policy) has always been selected over less rewarding policy [121] the procedure described for calculating reward value (Eq. 16) from pAB (d) is appropriate here. Thus this procedure favours moves between similar or relevant nodes or states.
Only sub-topic sub-goal states of each cluster have been identified using application of method described in [24] and are retained for the re-computation of probabilistic weight connecting pair of nodes in the reduced clustered state space. The detected node pairs defining the provenance graph in the reduced cluster space are utilized to learn the policy path with application of Q Learning method. The Q Learning approach adopted is unique to our work in the news reports modelling application as the approach provides more weight-age to rewards from moves taken when the search space is less localized and distant from the goal while calculating a discounted sum of rewards present in the trust path to goal. The rewarding moves between nodes or states in a reduced clustered space using a Q Learning approach provides more weight-age to early moves based on relevance or contribution of information associating or linking these nodes as this information is of more value as compared to that in proximity with the goal where the learning or the search has already been localized. Also final steps or edges in the path are associated with excessive granular representation of events which are discounted or attached a lesser weight in computing the weighted average value of reward from start node to goal node.
Application of Q Learning algorithm for learning provenance path
The Q Learning procedure is described as follows. Initialize V(s) for all states using initial network congestion conditions and using procedures described in Eqs. 1, 7 and 10. The network nodes are considered to be connected by links based on probabilities determined from current congestion situation as described in Eq. 8. Here s is the set of nodes, that are base database [140] relation sites, where record merging operation is applied on records arriving from two or more network sites.
Here the transition probability T(s,a,s’) is determined from application of Eq. 16, and R(s,a) is the trust in the component of information contributed by state s in performing the transformation step at a state s’ reached from forwarding action of data or information packet and this is determined from application of Eq. 16 after reaching state s’ as described earlier.
The policy learned is considered to be “good” if the magnitude of the updated Value attributes are less than an acceptable threshold different from the magnitude of these attribute value prior to update. The greedy policy finds the optimal policy in finite number of steps sometimes earlier than convergence of iterated value function. Here on arriving at a better optimal policy, all the previously learned policy paths including the latest that are covered are ignored and a new policy which is sub-optimal but is presently optimal is learned. This procedure continues until all policy paths satisfying the goal are learned.
Thus a message is transferred through a selected policy path of nodes in a provenance network for maximizing the iterated value described for a node in the graph.
A composite trust value of a node “A” is calculated from its content importance, which includes the trust of the subject associated with the node and importance of the node. The change of importance measure between a node “A” of interest and active node “B” in the provenance path after traversing n steps in the provenance path is calculated where node B is active after executing n provenance steps.
A provenance trust score is calculated for every data “d” passing through this node “A”. This is calculated as minimum or average of the trust values associated with nodes in the provenance path. The provenance trust scores of all data passing through this node “A” are averaged to produce a measure of trust associated with node “A”.
The subject node “B” is included in the same cluster as the cluster identified for subject node “A” if the following conditions are satisfied.
The weight of source record at “A” after satisfying constraints associated with merging with record at “B” is wAB and is calculated using Eq. 7. The weight measure is further normalized by maximum wAB calculated for all links describing the provenance graph.
Also prior weight associated with record at “A” is wABt = wAB/max(wAB)
$$ {\text{p}}_{\text{AB}}(1) = { \text{max} }\left( {{\text{w}}_{\text{ABt}} ,0. 5} \right) $$
(12)
and for “B” reachable from “A” in 1 step, where pAB (1) is the Transition probabilistic trust value between nodes or the strength of trust connecting these nodes and also where in the absence of prior information this probability is initialized to 0.5.
If A is connected to B with a provenance path through one intermediate node C for a run of the workflow:
$$ {\text{p}}_{\text{AB}} \left( 2\right) = {\text{p}}_{\text{AC}} \left( 1\right)*{\text{p}}_{\text{CB}} \left( 1\right) $$
(13)
Alternatively pAB(2) can also be calculated as the average or the weighted average which is of more relevance in the adopted of Q Learning procedure of the trust transitions pAC(1) and pCB(1) if we adopt a procedure described in [44].
The transition probability pABt which is the probabilistic link value connecting any connected pair of nodes in the provenance graph has been defined in [27] and is calculated using a procedure Cluster_Node_1 described as follows.
If PABt > threshold value these nodes are positioned in the same cluster PABt which is the derived trust value of “A” to “B” obtained using distinct provenance paths (pathi) for a data input “d” to the stream.
Alternatively this trust value can be calculated as the weighted average of the trust values computed along the alternative paths where the weights are determined from the path length. Here this may be noteworthy that the links nearer to the start node contribute heavily to the net path trust value connecting any pair of start and goal nodes in the provenance path.
This is also described as to the procedure of combination of path trust values in [44].
The normalized derived probability of trust link A to B for action taken at state A
$$ {\text{p}}_{\text{ABc}} = {{{\text{P}}_{\text{ABt}} } \mathord{\left/ {\vphantom {{{\text{P}}_{\text{ABt}} } {\sum\nolimits_{x} { ( {\text{P}}_{\text{AXt}} )} }}} \right. \kern-0pt} {\sum\nolimits_{x} { ( {\text{P}}_{\text{AXt}} )} }}. $$
(16)
After every updated policy identification round, the revised trust values of agents/actors associated with nodes in the provenance network are propagated and updated according to the trust propagation rules described for social networks in [141].
Our solution to clustering nodes in provenance graph
The approach adopted in this work for Clustering of nodes in a provenance graph includes the following steps.
-
1.
Identify new node of high centrality importance based on its connectedness or between-ness. These nodes may also be marked as split or merge nodes in the provenance graph describing application of news reports topic modelling. This is sub-goal node for a cluster.
-
2.
Grow the cluster starting from a node of high importance which may be nodes of high measure value of degree or between-ness.
-
2.1.
The cluster node expansion and cluster edge expansion measures are considered for growing a cluster using a derived cluster node expansion capability measure. This method considers the nodes in the cluster and the complement of the nodes in the cluster remaining in the graph for defining the expansion measure. Also here if the link trust probabilities are fluctuating, a measure based on relative dependence is used for identifying the stability of clustering. The distance measure between any two nodes in the cluster is defined using a Joint Entropy distance value as specified later in Eq. 17. Pair of nodes are only considered to be members of the same cluster if the activity or task steps or units corresponding to these node are within a predefined threshold distance measure defined using Joint Entropy distance measure between node pair. Also the similarity or distance measure is refined using similarities of roles of Agents associated with these nodes. This distance measure is defined based on length of category or role classification tree path separating the nearest ancestor nodes of agent node pair situated in this topic-subtopic role hierarchy tree. The agent similarity or distance measure is provided higher weight-age when compared to trust link measure in calculating the similarity or distance measure between event node pair. Here an agent role may be represented as an attribute in the event record associated with a network node. This leverages the nodes associated with the same agent role for membership in the same cluster. The entity words and the topic words are important in calculating similarity or distance measure for documents or news reports for these nodes to be positioned in the same topic or sub topic cluster. This similarity measure is represented using the probabilistic trust weight link connecting the nodes positioned in the provenance network. The distance measure calculation from this weight measure has been described later in this section.
-
2.2.
An average distance measure is calculated for every pair of nodes in the identified cluster. Only those nodes are retained in the same cluster whose pair-wise distance is less than a defined threshold measure different from this average distance.
-
3.
Continue Steps 1 to 3 until a cluster is identified for all nodes in the graph.
-
4.
Identify all the cluster components in the graph.
-
4.1.
Alternatively, node pairs with very high trust link weights may be removed from the provenance graph, and thus the clusters of connected graphs may be detected. The high trust link weighted edges may then be introduced, to establish links to sub-goal node(s) of a cluster.
The adjacent clusters/modules are merged based on the following procedure. A module score or a cluster score is defined for each module group in a workflow [45] based on distances of module element pairs contained in the module group. The workflow score is defined as the sum of the module group scores. A greedy strategy is adopted which merges two adjacent module groups with best workflow score. This process continues until only a predefined number of module groups remain. The adjacent cluster components are merged if these identified sub-goal nodes corresponding to the clusters are the same. A concept of role of macro action is applicable where this macro action satisfies a sub-goal and where a reward function may be used which is particular to the sub-topic sub-goal. The roles corresponding to the macro actions satisfies the same sub-topic or sub-goal role and have correlated topic models describing the connected event nodes present in the sub-goal or sub-topic cluster. This role can be interpreted as the semantic role or users of the same role having the same or similar “structural signatures” or users having a trusted group [41] role corresponding to the sub-topic cluster associated with tracking a particular topic or sub-topic or sub-goal of the story. All agents associated with a macro action may be interpreted as playing an informing role to the evaluating agent associated with the sub-topic sub-goal agent. As a consequence of merging of clusters, a hierarchical clustering of sub-topic sub-goal nodes is achieved where sub-goal nodes inferred from recursively defined merged clusters may indicate split or merge points in the goal topic or whole story description. Here a merging of clusters may aim at forming a merged cluster which have nodes in the merged cluster of similar Q values or Reward values as calculated from application of Eq. 16. The actions connecting the sub-topic sub-goal nodes of finally produced clusters are the macro actions on which the Q Learning procedure is re-applied. The distance measure defined on agent roles and trust links in combination with the reward based Q value approach produces a clustering where every cluster has nodes with same or similar agent roles [64].
This condition for merging of clusters is applied with the module score approach to obtain the final clustering. The probabilistic trust weight connecting the sub-goal nodes has been described as a macro action for identifying clusters which are updated using principles described earlier. A role is associated with interpreting a story theme or topic or subtopic or aspect of a story which may be derived from input vector of aspects.
An alternative clustering or classification approach has been described as follows. The examples associated with a state can be classified into one or more identified categories. The full set of states is subdivided into smaller set of states to reduce the entropy measure defined on categories. An action is described on a state with an attribute which is used to further classify the subset of examples in the state. The Information Gain functions that are used are Entropy based, Gini Index based and Discriminant power function based [expected count/(total count)]. The leaf nodes of this decision tree must satisfy the constraint that all activity or tasks nodes which are members of any leaf cluster node are executed by agents with the same role.
Here for measuring the applicability of our approach, a module group score can be defined for a group of module element nodes based on Joint Entropy distance [139] separating every pair of nodes (A, B) defined on this derived probabilistic weight PABc as discussed in Eq. 16.
This Joint entropy value based distance measure
$$ {\text{D}}_{\text{ij}} = {\text{p}}_{\text{ij}} * {\text{log}}({\text{p}}_{\text{ij}} ) + ( 1{-}{\text{p}}_{\text{ij}} )*{ \log }( 1{-}{\text{p}}_{\text{ij}} ) $$
(17)
This distance measure is further normalized by the maximum distance value max (Dij) for the provenance graph model. Thus Dij = Dij/max(Dij).
Thus quality of identified grouping or clustering of nodes in a module group can be represented using this distance measure and the difference from the weighted average distance calculated between every pair of nodes in the identified cluster. If this difference is lesser than a threshold value, the nodes are considered as elements of the same cluster. The adjacent groups of modules can be merged that satisfies a workflow score constraint as discussed earlier for a pre-specified limit on the number of clusters/groups in workflow.
Rationale for provenance graph structure and classifier learning
Only those events have been considered that have one or more of the necessary seed terms that have been used to describe a topic of interest to the user. Here the topic specific words that remove other words commonly occurring across topics have been identified using procedure described in [83]. The identification of story clusters for snapshot intervals of time and procedure for linking these clusters have been described in [106]. The models thus identified may be merged based on their importance in description of the story. Here interpreting clustering of documents for representing model cluster and Bayesian probabilistic approach [142] of assignment of document to a model cluster as has described in [143] are relevant. The application driven text data source can be static or dynamic where a static source implies that the document collection as not having frequent updates, while other text streams can be characterized as having many updates [144]. The topic modelling techniques have also been adapted with respect to the temporal scale for narrowing down events to fine granularity [144]. Hierarchical topics provide an overview of topics from one corpora [34]. The method suggested in [34] provides a full picture of topics from multiple corpora which can represent time updated versions of earlier corpora, where the hierarchical topic models have been merged based on graph matching methods [34], such as graph edit distance and other such methods [34]. A phrase reinforcement learning has been proposed in [145] where a starting phrase represents the topic for which generating a summary of tweets has been proposed, and this best partial summary represents the selection of path with the maximum sum of weights along the path [145]. The above description summarizes related work in the merging of topic models and representing summary policy path.
The classifier policy path has been detected from this structure by application of Q Learning algorithm for learning provenance path as has been described in an earlier section. Q Learning procedure has earlier been described to produce a plan [10] of event or topic executions [74] from the start topic or event [107] to the goal topic or event [107].
Provenance graph model(s) have been constructed for time window (s) which have been determined from window length [127] and refresh rate [127]. The time windows can also be defined based on a concept of sliding window [127] with overlaps between the models defined for these time windows.
The integer linear programming, constraint satisfaction, and other emerging algorithm based set covering problem solutions have been described as relevant for generating a cover set of classifiers for the models. A classifier in a model is compared with all classifiers in every other model learned and the best match is considered for calculating significance of a classifier [146]. Each one of these significant classifiers present in the minimum cover set and which are associated with one or more model (s) are compared with all such significant classifiers which are associated with other such model (s) and a similarity measure between these models has been computed from the classifiers present in these models using a correlation ratio measure as described in [147]. Alternatively, this similarity measure can be calculated as Pearson Product moment correlation measure. A principle of covariance measure defined between a single value and a vector of values is the sum of the covariance measures calculated between the single value and each element of the vector. Alternatively a distance or similarity measure can be described between a pair of these models using a graph edit distance measure as has been described in [125]. The models are clustered using pair-wise similarity or distance measures between a model pair described by their representative classifiers as described earlier. A hierarchical agglomerative Approach [148] can be adopted for this with merging of most similar model pairs at every hierarchy where a merged model can been derived using procedures as has been described later in Steps 9.2, 9.3 and 11. The above description summarizes the relevance of covariance or correlation based clustering methods to our work.
A partitioning approach can be adopted for this model merging using the K-medoids approach where a model has been selected as representative for the cluster of models and this clustering method is realized with a Partitioning around “medoids” (PAM) or Clustering Large Applications (CLARA). The algorithm described as PAM starts with randomly collected seed models and improves the clustering with a greedy strategy by randomly selecting a model as a “medoid” which reduces the measure associated with the absolute error criterion [148] representing the sum of dissimilarities. Here all models present in a cluster are merged. Alternatively, models are merged from using their property of markov equivalence producing an essential model. These have similarities with ideas appearing in [35, 36]. The algorithm described in Step 11 has been utilized to complete the definition of merged model. PAM (partitioning around “medoids”), a medoid based clustering algorithm which has been cited in our work is less influenced by outliers. Our representation of record instances for a time interval with a provenance graph model reduces the cluster space and makes it feasible for application of PAM. For large data sets, a sampling based method called CLARA can be used where after sampling the clustering methodology PAM has been applied to detect the best “medoid”. A “medoid” based approach to clustering of points described in PAM, CLARA or CLARANS where two clusters has been be merged based on the farthest distance between two points in the cluster pair and this tested merged cluster satisfies less than a certain threshold value for the diameter measure. The revised centroid point has been identified from merging of clusters. This centroid point has a maximum distance measure value to a point in the merged cluster and satisfies a magnitude less than a threshold value in radius measure. The above description summarizes the database clustering methods such as PAM, CLARA, CLARANS which are relevant to our work.
Also for the cause of merging models, the “medoid” model in a cluster has been selected as the initial model. The other present models in the cluster has been merged iteratively such that the intermediate model minimally increases the error defined based on a maximum likelihood measure as has been described in [149]. Alternatively a Bayesian Model Merging principle has been adopted such that product of model prior and likelihood measure associated with the models has a maximum value [149]. Also the models can be ordered based on their relevance to a subject or topic using principles of Bayesian Factor or Bayesian sampling approaches as has been described in [150] and the models can be merged in an appropriate order within a cluster. Here a unified provenance graph model is thus constructed where model unification procedure has been applied for merging element models from incrementally acquired information obtained at more recent time intervals as has been described earlier. A merging of time series data using principles of Dynamic Time warping (DTW) has been described in [151] where the time series pair participating in the merge have been optimally aligned using principles of Dynamic Programming if the lengths of the time series pair have not been observed as same. A cluster has been represented by time series data which have not been considered as similar in nature and a representative time series has been derived from the time series data present in the cluster that considers the DTW distance to identify the closest time series [151]. The above description summarizes related work in merging models within a cluster using Bayesian Factor, or using methods for Time Series Merging based on Dynamic Time warping.
A hierarchical agglomerative clustering approach appearing in [11] has been adopted with merging of most similar model pairs at every hierarchy where a merged model has been derived using a procedure described later in Steps 9.2, 9.3 and 11. BOAT uses attribute selection method like Gini index which has been used for constructing for Regression Trees.
Boosting is a method of combining ensemble classifiers created from a weighted version of learning sample, where weights have been adjusted at each step to provide increased weight to cases misclassified earlier. Adapting re-sampling has been identified as the key to success with misclassified cases receiving larger weights in the next step. Bagging has been applied to larger trees in contrast to boosting that works well with stumps or slightly larger trees [152]. The idea of combining ensemble of classifiers using a weighted version of each as has been described in boosting and this has interpretation of relevance to our work where earlier steps of learning have been provided more weight-age than those derived later and the significant classifiers thus derived have more weight-age value. Also the start nodes describing a topic in the Phrase Reinforcement Learning in learning of topic description as has been described in [145]. The above description summarizes the relevance of Boosting over Bagging for combining ensemble classifiers.
A reinforcement learning approach has been described as providing a balance between pruning for generalization and growing deeper trees for accuracy [153]. A continuous U tree algorithm transfers traditional U tree algorithm to reinforcement learning and this U tree algorithm can be viewed as a Regression Tree algorithm for storing state values [154]. The regression clustering (CART) with splits satisfying a maximum gain measure as modelled by ginni coefficient describes a probabilistic measure modelling the fraction of points of the predecessor nodes that are present in the one or the other successor node. The above description summarizes the relevance of U tree algorithm and CART algorithm to our work.
Structural Regression Trees integrates the regression method of learning into inductive logic programming [155]. This method however produces a solution to the Relational Regression Problems which have been difficult to understand, and assumes that all features are equally relevant to all parts of instance space, and also does not have easy utilization of domain space [155]. The SRT method has a simple method of tree selection based on Minimum Description Length (MDL) [113] principle. The MDL algorithm measures the simplicity and accuracy of the theory and data. The theory description length has been derived from encoding of literals and encoding of the predicted values in leaves [155]. The data length has been derived from encoding of errors [155]. A model has been selected with minimum message length associated with the sum of theory message length and data message length of the model [155]. The balance provided by Regression Tree approach [153] with the methodology of error complexity pruning and growing deeper tree for accuracy has similarity with the MDL based approach to learning as has been described in [155]. The interaction network summarization has been described with independent topical events that are temporally and topically coherent and this interaction network has been summarized by large events [156]. A collection of k-events has been selected that maximizes the node coverage and this task maps to the finding the maximum set cover solution [156]. The above description summarizes related work on the MDL principle and maximum cover set solutions for representing models.
Our approach to provenance graph structure and classifier learning
-
Step 1:
An event or news report may be associated with more than single story. The time order of occurrence of events and their similarities is derived from a joint entropy distance measure linking the events. This measure has been used to calculate the dependency between these events. The provenance graph of events can thus be described. Only those events are considered that have one or more of the necessary seed terms used to describe a topic of interest to the user. Here the topic specific words that remove other words commonly occurring across topics are identified. The topic words and the entity words together qualify in determining topic or sub-topic association of documents or news reports. The path to the final goal content or event or topic node represents the learned path to the recognized goal topic node. Thus from the description in related work in topic or story or event and from the brief description that appears in this work we have provided a rationale for constructing a provenance graph from events from establishing links between these events or stories related to these events.
-
Step 2:
This information is also utilized to define the probabilistic weight of link connecting any two states or nodes and a distance measure separating these nodes present in the graph from applications of Eqs. 12–17.
-
Step 3:
The probabilistic weight connecting two nodes in a graph is utilized to cluster the nodes in the graph using application of Steps 1 to 4.
-
Step 4:
Only sub-goal states of each cluster are identified using application of method described in an earlier section on our approach to clustering of nodes of this work.
-
Step 5:
Bayesian methods utilize these probabilities defined for this revised graph obtained at step 4 to build the graphical structure satisfying a constraint such as mdl, bd, bic for all nodes present in a provenance graph from the dependencies describing the discovered event or topic or story model graph. Here more than one provenance graph models may be produced from the application of this procedure to a time window. Here interpreting clustering of documents to represent model cluster and bayesian probabilistic approach of assignment of document to a model cluster as described in [143] are relevant
-
Step 6:
The classifier policy path is detected from this structure by application of Q Learning algorithm for learning provenance path as described in an earlier section. Q Learning procedure has earlier been described to produce a plan of event or topic executions from the start topic or event to the goal topic or event. Many such alternative plans may be generated from this procedure which may be optimal or suboptimal.
-
Step 7:
Provenance Graph model(s) are constructed for a time window determined from window length and refresh rate. The time windows may also be defined based on a concept of sliding window with overlaps between the models defined for these time windows. The time stamp of stories describing an event may cause a more detailed definition of an event or a topic at a later time window sometimes with some overlap between these definitions of an event or topic causing an overlap between the provenance graph models described for these time windows.
-
Step 7.1:
A collection of classifiers has been identified in every model with this training information. Calculate a distance measure from these policy classifiers in a model from all policy classifiers in every other model using the SEC measure (distance as reciprocal of similarity value) concept for paths. Calculate a weight measure associated with the identified classifiers in every model based on these similarity measures, where classifiers more similar to others have stronger weights. This weight can also be determined from application of Eqs. 15 and 16. These identified classifiers which provide cover for all the classifiers [33] are considered for model merging.
-
Step 8:
This classifier weight adjustment for classifying “difficult” data is also included in the procedure for reward calculation in this work. This procedure also can provide insights into split or split-merge in topic/story definition. The reward calculation procedure as described earlier may also be sufficient in determining “important” or “difficult” with identifying edges or links of distinctive relevance data for classifying purposes.
-
Step 9:
The rationale for merging or linking provenance graph models has been provided in earlier steps.
-
Step 9.1:
Use this weight measure to compare the significance of the collection of classifiers detected in each model. Only those policy path classifiers which are more or most significant or those providing a minimum cover set for all classifiers are retained. A classifier in a model is compared with all classifiers in every other model learned and the best match is considered for calculating significance of a classifier. The procedure is applied to generate a reduced collection of classifiers that provides a cover set for all classifiers. Each one of these significant classifiers present in the minimum cover set and which are associated with one or more model(s) are compared with all such significant classifiers which are associated with other such model(s) and a similarity measure is computed for these models from their classifiers using a correlation ratio. Alternatively, this similarity measure can be calculated as Pearson Product moment correlation measure between a feature vector of edges describing a classifier associated with a model with those of another model. This is derived as the ratio of the covariance measure calculated between these vectors of values and the product of variance measures calculated for the individual vectors of values. A covariance measure defined between a single value and a vector of values is the sum of the covariance measures calculated between the single value and each element of the vector. Also the covariance measure is defined to have a commutative property that is used for this calculation. Only those model pairs are candidate for merging where this similarity measure exceeds a certain threshold value. Alternatively a distance or similarity measure may be described between a pair of these models using a graph edit distance measure. The models are clustered using pair-wise similarity or distance measures between a model pair as described earlier. A hierarchical agglomerative approach can be adopted for this with merging of most similar model pairs at every hierarchy where a merged model is derived using procedure a described later in Steps 9.2, 9.3 and 11.
-
Step 9.2:
A partitioning may be adopted for this model merging using the K-medoids approach where a model is selected as a representative for the cluster of models and this clustering method is realized with a partitioning around medoids (PAM) or CLARA or CURE. The algorithm which starts with randomly collected seed models and improves the clustering with a greedy strategy by randomly selecting a model as a medoid which reduces the measure associated with the absolute error criterion representing the sum of dissimilarities. Here all models present in a cluster are merged using procedure described in Step 11. The ideas described in this paper where the graph models are first partitioned using a partitioning approach like CLARA or PAM or CURE and thus identified representative models in each cluster are merged using a Bayesian scoring approach. Alternatively, models are merged from using their property of markov equivalence producing an essential model. The algorithm described in Step 11 is required to complete the definition of merged model.
-
Step 9.3:
Please refer steps 9.3.1, 9.3.2 and 9.3.3 below.
-
Step 9.3.1:
A system of similar models detected in Step 9.1 forming a cluster can be merged using application of procedure defined as follows on a Bayesian scoring approaches. Here tasks unit steps or activities are linked or connected that leads to maximum improvement in the selected Bayesian score metric value and the path describing the changes to the graphical structure is determined from application of hill climbing, or simulated annealing or TABU search as described in [102] with random restarts for avoiding the solution from getting trapped in local minima.
-
Step 9.3.2:
Alternatively the identified system of significant models represented with provenance graphs can be merged using techniques briefly described here. Here for graphs that are Markov equivalent and hence similar, a composite graph is constructed as the essential graph.
Here the essential graph thus constructed has trust weight link connecting common nodes in the models which are candidates for merging and this link weight is defined using a procedure described in Step 11.
-
Step 9.3.3:
The best classifier with maximum probability of selection for the clustered model is identified from application of Eqs. 13−16. Here the probabilistic trust weight link connecting any two nodes in the merged provenance graph, is derived as the weighted average of probabilistic trust values associated with common links or edges connecting the same pair of nodes corresponding to similar classifiers where weights are determined from Steps 2 and 3 of this algorithm. This approach is also applicable for computing weighted average of expected Q values of nodes with weights determined from the sample or model from where this information is retrieved. The candidate models participating in the procedure for this merging have policy path classifiers which are either only member of any set of similar classifiers or are members in the minimum cover set of such similar classifiers that are present in the cluster of models. Also for the cause of merging models, the medoid model in a cluster may be selected as the initial model. The other present models in the cluster may be merged iteratively such that the intermediate model minimally increases the error defined based on a maximum likelihood measure [113]. Alternatively a Bayesian Model Merging principle may be adopted such that product of model prior and likelihood measure associated with the model is maximized. Also the models may be ordered based on their relevance to a subject or topic using principles of Bayesian Factor or Bayesian sampling approaches. The models may be merged in this order within a cluster. Here a unified provenance graph model is thus constructed where model unification procedure may be applied for merging element models from incrementally acquired information obtained at more recent time intervals as has been described earlier.
-
Step 10:
The final classifier accuracy obtained from the similar models participating in merging is the vector of weighted average of the selected classifier feature edge weights that are common to the classifiers that are getting merged. This measure indicates the probability of selecting a candidate classifier (accuracy) derived from the system of significant classifiers.
-
Step 11:
Alternatively, previously calculated weights associated with features describing the nearest subtopic or sub-story or goal or topic node obtained using method described in this work can indicate a change of policy where previously suboptimal action can become optimal and vice versa. A Q Learning procedure is applied to the revised provenance graph model derived after merging of the significant models. If a newly calculated policy path is not destined to the same goal topic or event node then it is removed from farther consideration. Alternatively, the expected gain from executing action at a state is the difference between the expected Q value as reward value calculated from Eq. 16 from executing the changed action and earlier Q value associated with taking optimal action at the state. The Value of Perfect Information (VPI) associated with taking action at a state is the weighted sum of expected gain measure calculated for all discrete probabilities associating the state, action pair which separates the best classifier policy value and the considered classifier policy value. Here a strategy is selected that maximizes the sum of expected Q value for a state action pair and Value of Perfect Information associated with state action pair The alternatives to the best classifiers which when considered are re-ranked from this expected gain measure. The candidate set of classifiers which form the minimum cover set of classifiers or is the only member of a set of classifiers for a unified provenance graph model are considered for specifying recognition paths for relevant goal topics where these paths may be both general or discerning.