We formulate the problem of influence maximization on a graph G = (V, E) as follows: We seek to select top-K users (i.e seed set) from the graph G, \(S \in V\), so as the influence spread will be maximized. This problem is studied in various research papers [9, 21] and has been proven to be categorized in NP-Hard problem. We study in this paper, the influence maximization problem for directed and undirected graph under two cascade models namely Independent cascade model and linear threshold model. Figure 1, shows our system model and how the selection of seed set is performed generally for directed and undirected graphs. The details of selection process of each seed set is given in algorithm 1 and algorithm 2. We assume that each selected seed should be separated from next seed by a certain multi-hops distance \(D=r-1\) and all its immediate neighbors should be excluded. The first supposition which states that the next seed should be separated by a certain multi-hops distance is due that the selection of nodes with different coverage of influence will have an influence on different users from previously selected seed set. This can be observed in real-world applications, taking, for instance, a small company that distributed a free sample of a product to users who are separated by certain multi-hops distance will have a higher influence on different users and may influence distinct users from a different background than targeting seed set that are close to each other. The second assumption of excluding all neighbors of selected seed comes from that users always have great influence on immediate neighbors who are more likely to adopt the diffused information, since as the multi-hops distance increase between seed and other nodes, the rate of adoption is more likely to decrease. For this, we excluded all neighbors that may have the same influence on other users as the selected seed of the graph. We assume that each network has selection threshold value defined formally in Eq. (1) that should take into account when selecting seed set. This selection threshold value depends on diameter value of the graph that determines the minimal score for each candidate node should have to be selected as seed. The adding of this selection threshold is motivated by the observation that sometimes the selection of seed set without constraint may results in the choice of low ranked nodes that will decrease the influence. So, setting a certain threshold value would certainly increase the influence as proved by our experiments. The formula of our threshold value \(th_I\) is given as follow:
$$\begin{aligned} th_I=\frac{d*r}{2}-r \end{aligned}$$
(1)
where r is the graph radius and d represents the graph diameter. Here the diameter and radius represent the node number that a selected node seed should have. So, according to the equation, a node would be selected if at least it is linked half number of radius*diameter minus the initial and farthest nodes from candidate node.
Directed graph algorithm for top-K influential users DERND D-hops
In this subsection, we present an algorithm for top-K influential users selection on a directed graph. The algorithm is based mainly on indegree and outdegree of each node over radius distance. As we carried various experiment, we noticed that the ratio of counting from indegree on outdegree of the actual node to its radius. This permits to value the indegree of a node while discount the effort of the node that incarnates in the linking to other nodes. We know that the node is more important if it has a lot indegree while if the node has as well a lot outdegree, this will accurately identify important node since some nodes link to others as a feedback of following. The ratio will quantify how the node is important depending on the users that link to it and the effort that makes node by following other nodes. Formally, we write the indegree of node v over a radius hops denoted by \(Ind_{r}(v)\) as follows:
$$\begin{aligned} Ind_r (v)=\sum _{u \in neig_r^I (v)}A(u,v) \end{aligned}$$
(2)
where A represent the adjacency matrix.
\(neig_{r}^{I}(v)\): represent all indegree neighbors of node node v from its immediate indegree neighbors till graph radius r.
In the same manner, we express the outdegree of node v over a radius hops denoted by \(Outd_{r}(v)\) as follows:
$$\begin{aligned} Outd_r (v)=\sum _{u \in neig_r^O (v)}A(v,u) \end{aligned}$$
(3)
where, \(neig_{r}^{O}(v)\): represent all outdegree neighbors of node node v from its immediate outdegree neighbors till graph radius r.
The algorithm DERN D-hops called Directed Extended Radius-Neighborhood D-hops algorithm, is an extension of our work [9] to reduce the time complexity compared to our previous version [9] and improving the influence spread against the state of the art algorithms based heuristics.
In this extended version of our algorithm, we try to filter the selection of influential users based on a selection threshold value that depend on structural properties of each graph and by selecting each two consecutive seed set by a distance D. The distance D is defined by taking all nodes neighbors of selected seed \(S_d\) from \(d_{I+1}\) hops till D hops. This permits to have a certain quality in term of seed selection, since the algorithm require that a node cannot be selected if it is under a certain selection threshold value and separate consecutive seed by a number of hops to avoid the selection of closest nodes in the same region of influence.
The algorithm starts by initializing the seed set size S to an empty set (Line 1). Then it computes for each vertex in graph G, the corresponding ratio of indegree over outdegree+1, the added one is to avoid the null in the denominator. This ratio is the effort performed by the node versus the benefit received (Line 2). After assigning to each node its corresponding ratio that characterizes node importance in a directed graph, we set the size of the free sample that we are ready to offer for free of charge, this size depends on the available budget and how much we could offer free of charge in a way that we maximize the profit. Next, we select the first seed in the queue by sorting the obtained data in line 1 in a decreasing order, which results in the selection of node with the highest score ratio (Lines 3–4).
Then, the selected seed, that represent node that will be selected to initiate the cascade process, Seed is added to seed set S (Line 5) and the last selected seed from seed set is stored in sed. Thereafter Seedk is initialized with empty set that will contains all nodes that have a certain score value that should surpass the selection threshold value and then exclude selected seed sed from graph data \(data-score_r\) (Lines 6–8). Thereafter, the algorithm proceeds by checking if the size of S is different from fixed size K and assign to sed, that represent the basic parameter that store node that was selected and added to S in order to take it as input to get ego network, the last selected seed set S.
Next, \(neigd_s\) that represent all indegree nodes of selected seed sed from its immediate indegree neighborhood till farthest nodes by \(d_s\) hops. In other words, its like creating an ego network for seed sed through its immediate indegree neighborhood till nodes connected to that seed by indegree across \(d_s\) hops. The same thing is applied for \(neigd_I\). Then Selected-neighborhood, that represent all nodes that are candidate to be selected as seed set S, is selected from the last selected seed denoted by taking the difference of neighborhood of seed that is farthest D-hops away minus node neighborhood of seed sed that is farthest one hops away based on indegree centrality (Lines 9–13).
In the case of selected seed sed has no more than immediate neighbors, the algorithm pick another seed set sed from graph data and proceeds as previously to take the difference of neighborhood selected neighborhood that are farthest D hops minus all immediate neighbors and select candidate seed that may be selected as seed set by matching \(Selected_neighborhood\) with graph nodes to select node with the highest score ratio (Lines 14–20). Then, the algorithm select as seed set all nodes SeedK that has the highest score value and select one selected seed that surpasses the selection threshold value, then exclude the selected seed from graph data (Lines 21–27). If no condition is not successful to select the seed set, select from graph data the seed set sed with the highest score value and then exclude it from graph data and finally return the seed set S (Lines 28–33). The algorithm run once for each \(K=10, 20, 30, 40, 50\) from (Lines 3–8) and run till getting K seed set from Lines 9–33. This could be justified that the algorithm at each tenth of seed size K, it needs to select the most central node with highest score value and then proceeds by executing the remaining of procedure based on multi-hops distance, radius-neighborhood degree and selection threshold value.
Top-K influential users selection UERND D-hops algorithm for undirected graph
In this algorithm, we use the neighborhood radius metric from [9], due to its efficiency in term of identifying most important nodes in term of influencing other nodes to adopt behavior, product and then increase significantly the influence spread within the network. This metric relies mainly on counting the immediate degree of node till the graph radius. The radius neighborhood degree of a node v can be written as introduced in [9] by:
$$\begin{aligned} deg_{r}^{U} (v)=\sum _{u \in neig_{r_h=1}^{r}(v)} deg(u) \end{aligned}$$
(4)
The metric start by identifying the neighbors of each node u from \(r_h=1\) hops that represent immediate neighbors of node u and increment at each step by 1 till the graph radius denoted by r and then the degree of each node is computed by counting all immediate neighbors of node till the graph radius. The notation \(neig_{r_h=1}^{r}(v)\), represents neighbors of node v counted from immediate neighborhood \(r_{h}=1\) till radius of the graph r. The notation deg(u) represent the degree of node u.
The main idea of algorithm 2 is straightforward and is the same to some extent as algorithm 1 for a directed graph. So, the procedure of selection of top-K influential users will be the same, in the difference that here we will consider both indegree and outdegree as in undirected graph there is a mutual relationship between users. Here the algorithm employs the radius-neighborhood degree introduced in [9] and improves the efficiency of the algorithm by determining a selection threshold for each graph data and by controlling the seed selection from 2 to \(r-1\) hops from actually selected seed set. This has two objectives, a larger multi-hops distance from actually chosen seed set permits to have a large choice of seed set and setting a selection threshold that permits to avoid the selection negligible nodes that have a marginal influence spread. In the next section, we provide a complexity analysis of two introduced algorithms for the directed and undirected graph to test at which extent will perform when we are dealing with a large scale graph.
It very important and critical to evaluate the performance of algorithm which shows its superiority in term of achieving some goals and benefit in a minimum time. Foremost one of the alternative way is testing algorithms with the same input and observe which algorithm provide the best performance in term of benefit and running time. However, its most likely that some algorithms will perform better than other algorithms for certain input of data. We tried to tackle this problem in the current work to adapt our algorithm on different datasets with different densities and properties over-directed and undirected graph. The performance analysis covers normally the time and space complexity. In this paper we will cover only the analysis in term of time complexity, in another term, we will try to perform an asymptotic analysis and try to find the worst and average case of time complexity of the two proposed algorithm. We will proceeds firstly with analysis of undirected graph algorithm for top-K influential users selection. So, as first analysis we start computing the time complexity required by an algorithm to complete the calculation of seed set S. As depicted in algorithm 1 below uses the radius-neighborhood degree introduced in our paper [9], which relies on computing the neighborhood of each vertex from immediate neighbors to radius hops. So, for each vertex u, it computes \(neig_r(u)=\{set.neighborhood_r(u) |\) detect all paths of length r between u and neighbors till radius \(set.neighborhood_r(u)\) and then count for vertex the length of \(neig_r(u)\}\). And that, for each vertex, the degree is computed from immediate neighbors of candidate vertex to radius graph. So, it requires \((1+2+\cdots +n)r\), where r is the hops numbers between candidate vertex and the graph radius. Next, for each vertex a length of its neighborhood from its immediate to r farthest neighbors, which need nL. Then, the runtime complexity of first line 1 of algorithm 1, can be computed as: \(r(1+2+\cdots +n)+nL=r(1+2+\cdots +n)+nL=r (n(n+1))/2+nL=O(rn^2+nL)\) So, the line 1 of our algorithm required in overall a time complexity of \(O(n^2r+nL)\), where n is the number of a vertex in graph G and is L is the length of each neighborhood vertex list. The most time complexity comes from line 1 of the algorithm, as it increases with graph size. The while loop has a runtime complexity klog(K), since the two loop is nested and that the inner for loop runs independently of the outer node while number of an iteration loop. Thus, the time complexity of 9–23, is the time required for inner loop for that takes k multiplied by outer loop while that takes log(K). Therefore the time complexity for Lines (9–33) is O(klogK), where k is the length of each results neighborhood of candidate seed and K is the size of seed set required as input in our algorithm. In overall, the time complexity of our algorithm 1(2) is \(O(rn^2+nL+klog K)\).