Large-scale forecasting of information spreading

This research proposes a system based on a combination of various components for parallel modelling and forecasting the processes in networks with data assimilation from the real network. The main novelty of this work consists of the assimilation of data for forecasting the processes in social networks which allows improving the quality of the forecast. The social network VK was considered as a source of information for determining types of entities and the parameters of the model. The main component is the model based on a combination of internal sub-models for more realistic reproduction of processes on micro (for single information message) and meso (for series of messages) levels. Moreover, the results of the forecast must not lose their relevance during the calculations. In order to get the result of the forecast for networks with millions of nodes in reasonable time, the process of simulation has been parallelized. The accuracy of the forecast is estimated by MAPE, MAE metrics for micro-scale, the Kolmogorov–Smirnov criterion for aggregated dynamics. The quality in the operational regime is also estimated by the number of batches with assimilated data to achieve the required accuracy and the ratio of calculation time in the frames of the forecasting period. In addition, the results include experimental studies of functional characteristics, scalability, as well as the performance of the system.

Introduction Severiukhina et al. J Big Data (2020) 7:72 presented in different contexts (e.g. thematic communities of Online Social Networks (OSN), such as a community of football fans or classical music lovers) and for various audience. Therefore, the model needs to be tailored to a given context.
We distinguish between several levels of modelling and forecasting information processes in OSN (Fig. 1): • micro-level: information message (IM); • meso-level: community (sequence of information messages in one community); • macro-level: information spread in the form of IM between a set of communities.
At the micro level, the cascade on individual post dynamics is usually studied. At the meso-level, one may explore such things as influence of publication time on the impact of various messages and preferences of different segments of the audience. At the macro level, one may observe exchange of information between communities.
In this research, we present and evaluate an agent-based forecasting system for information spreading on micro-and meso-levels in the large-scale OSN. This system reproduces the aggregated impact of information messages from the individual agents' actions. To perform initial identification of model parameters, the retrospective data from OSN has been collected and the groups of agents and types of messages for a given community have been defined. During the forecast in the operational regime, we track the actual status of information process using web crawlers and feed this data into the model running at the supercomputer to tune the forecast in a real-time manner. This study investigates: (i) the possibility of reproducing the observed dynamics from individual reactions, (ii) the quality of forecasts in terms of accuracy and earliness, (iii) the impact of data assimilation on the quality of forecasts, (iv) the scalability and performance of the prediction system. Experimental studies performed on the Lomonosov supercomputer [1] using data collected during 2018 for two massive sets of news and charity communities.
The rest of the paper is organized as follows. Relevant information and related work are presented in Section 2. Section 3 describes the model, architecture of our system Fig. 1 Online social networks-micro, meso and macro levels and methods for assessing forecast quality. Section 4 shows the details on the dataset, forecasting experiments, scalability and performance analysis. Section 5 presents conclusions and further research discussion.

Related research
To create data-driven models of process in a complex network, it is essential to understand two intertwined issues: (i) the most appropriate way to define agents (like individual users or a community), (ii) the structure and the parameters of the predictive models. In order to specify agents in the model, it becomes necessary to determine the features of their behavior, as well as the relationships between them.
The reaction of each user in a social network is unique and can be determined by their age, social status, preferences, the internal state of the user, the history of their interaction with the information source, and other characteristics. As users interact with various items over time, users' and items' features may change their behavior over time [2]. Moreover, the influence of media and social connections on the dynamics of user opinions should be noted [3]. Furthermore, the existence of the echo cameras effect in the network: polarized opinion in the nodes cluster leads to the diffusion of complex contagions, for example, fake news [4]. Some individual characteristics may influence processes in networks: heterogeneity of stateful agents [5] or agents' curiosity [6]. In works [7,8] authors proposed an ontology-based approach to extract semantics of textual data and defined the domain of data. More precisely they semantically analysed the social data at the entity and the domain level. Proposed approach was evaluated with a public dataset collected from Twitter.
Although users differ in the number and ratio of responses to messages, OSNs may provide limited information about the activities of a single user (as opposed to properties of sub-populations of users). Therefore, creation of reaction models at the level of individuals is hampered both by the lack of data and the difficulty to distinguish and define all the factors that determine the users' response. The problem of reproducing agents' reactions can be solved by clustering users by their level of involvement, roles in the community, and parameters of their profile. For example, in [9] authors classify users into four groups: celebrities, organizations/media accounts, grassroots stars, and ordinary individuals. In our study, data-driven agent-based models are developed. Then, in the frames of these models, parameters of clusters from the history of individualized responses within a community are learned.
The topology of network affects on the information processes. Agent-based modelling approach allows investigating the bottom-up behavior or what-if analysis. In this case, agents (micro level) create emerging network behavior (macro level) [10]. In article [10], the authors propose an approach of complex agent networks that combines agent-based model and network approaches. Moreover, the social networks have the following "small world" effect, scale-free degree distributions and the modular structure [11]. The classical generative models like Erdos-Renyi, Watt-Strogatz and Barabasi-Albert models cannot reproduce all properties of real-world social networks. Moreover, the conductivity of links between entities may influence information dissemination: the users who have more common friends may have a greater possibility for the dissemination of information [12]. In our work, we use the real networks as inputs for the model. Models of information spreading in networks can be divided into two categories: explanatory and predictive. In the explanatory models, information spreading is often considered in the same way as an epidemic spread process [13], where nodes can have one of the possible states in a concrete moment, for example, susceptible, infected, removed (SIR model). However, Weng [14] states that diseases differ from information: diseases spread as simple contagions, information spread as complex contagions, because the last one is affected by social reinforcement and homophily.
According to [13], there are three types of predictive models for OSN: the independent cascade model (ICM), the linear threshold model (LTM) and the game theory model. In the first type of model inactive node can be activated by the active node with some predefined probability. LTM model is a more complicated case, every interaction between nodes provides a cumulative effect on a node's state. In the last type of models, there are some specific restrictions and various agents' strategies. For example, the article [15] is aimed to explain how human factors impact on competitive information dissemination.
One of the sources of heterogeneity is the topicality of the message contents. Topicaware independent cascade and topic-aware linear threshold models were proposed in work [16]. These model have different topics distribution and strength of nodes influence on each other depending on the topic. In other research, posts or news have a defined virality coefficient that depicts the popularity of the message and affects on the probability of sharing. In [14], the authors predict the virality of memes based on early spreading patterns in terms of community structure. In this case, investigation of activity patterns helps to detect viral memes [17].
In recent years, a significant number of papers are dedicated to more complex methods. Prediction of shares number based on temporal behavior patterns of users was studied in the article [18]. Machine learning approach with the passive-aggressive algorithm for predicting users' behavior in Twitter was proposed in [19]. Kefato at al. proposed a novel algorithm called CAS2VEC [20] that models information cascades as time series and discretizes them using time slices. In [21], the authors trained a probabilistic collaborative filter model to predict future retweets using Twitter data. In our work, we propose a forecasting method based on an agent-based model and data assimilation for increasing accuracy.
To forecast processes on large graphs, one needs to parallelize computation to be able to get the result of forecast in time. Parallelization can be applied to different steps of simulation from generative models (e.g. parallel Chung-Lu model [22]) to models of information spread (e.g. parallel SIR [23]). Moreover, different hierarchical synchronous parallel models for graph analytics can be used [24]. In this study, we modify our previous parallel algorithm for parallel simulation of dynamical processes on stochastic Kronecker graphs [25], to support arbitrary topologies and complicated models of agents' behavior.
Summarizing, our goal is to combine the advantages of complex agent network with data-driven approach learning the topology and parameters of agents from the data. To tackle the overall complexity of the resulting model and to adapt it to changing conditions, we constantly tune its parameters using data assimilation. Scalable parallel implementation is aimed to solve the problem of operational forecasting of information messages in OSN. This research is an attempt to propose whole methodology: from retrospective data collection to getting the results of forecasts. In addition, proposed approach supports fine-tuning of the properties for various OSN contexts, temporal dynamics and different scales of simulation.

Dataset description
For this study, the data sets were collected from the VKontakte social network using the web crawler. The data includes charity community and news community (there are two types of IMs in this community, regular and IMs with advertisement) with IMs from April to May 2018 as historical data and several months from August as real-time data. The first community has 295 k followers and 100 IMs, second community-1900 k followers and 1500 k IMS.
The key features of the retrospective data are presented in Table 1. This dataset was used to train the basic parameters of the models.
Communities differ in both: the proportion of different reactions and the sources of reactions (Fig. 2). For instance, network charity community is distinguished by a high part of shares made by communities (these shares are done by administrators of individual charges or small communities), as well as a high part of reactions not from community subscribers, while in the news communities there is a larger activity of   Figure 3b shows an example of a profile of user reactions containing the parameters of reactions to different topics. The selection of behavior' types was carried out in the space proportional to the selected types of reactions (Fig. 3a), e.g. point (0.2, 0.3, 0.6) means that the user left the response of the first type to 20% of IMs, the response of the second type to 30% of IMs and, finally, the response of the third type to 60% of the available IMs.
To sum up, as the output, we obtained the following characteristics: networks of communities and users and their actions to determine temporary activity and attitude to various topics.

Description of main entities in model
There are three main entities of online social networks: communities, users and IM (more detailed is in our previous study [23]). The network of entities in cyberspace is a directed graph, where the vertices of the graph are a set of communities and set of users, and edges are subscriptions or friendship links. Unit of information is IM, it can be transferred between vertices. Main characteristics of entities are presented in Fig. 4.   Every IM has a source of information and an identifier, time of publication, topic, as well as values of virality coefficient which differ for possible types of reaction on messages, for example, virality of sharing. We define virality as a potential impact of the message, which imply the probability of reaction (as an analogue for virality in epidemic spreading). A user is a vertex (receiver of information), which can respond to the received IM. For online social networks, three basic types of reactions are available: like (approval of content), comment (discussion of information), share (distribution of IM). Moreover, each user is described by types of daily/weekly activity, and a set of reactions to a set of IMs. A community is a network' vertex that broadcasts an IM to multiple users (community subscribers). It is described by a set of subscribers and a set of possible topics for IM. Each topic has the probability distribution of publication by the community depending on the time of day and day of the week.
The input parameters of the model are the network of subscriptions and friendships as well as parameters of internal models of communities and users. The network can be created artificially or extracted from the real social network. For the second case, we used the web crawler and collected information about subscribers of communities and friends of users. There are three internal models representing different drivers of information process and defining the behavior of entities: a model of IM's generation, a model of activity and model of reaction. The generative model of a community reproduces its publication activity. For a given community, it defines temporal patterns of publication (frequency during a day/a week) for IMs of different types according to the data presented on their page. The model of user activity determines the probability that the user will be online at different intervals of the day and may vary for different types of users (for example, early birds, night people). For this model, we used a number of user's reactions at different intervals of a day. The reaction model has parameters depending on the type of a user and IM, for different types of reactions. These parameters were set according to digital traces of user in different communities. All of the parameters above are estimated by using historical data from the social network. More details on the implementation of the internal models as well as setting of internal parameters may be found in [23].

Description of the general scheme
The architecture of the implemented system for operational forecasting of information processes in cyberspace is presented in Fig. 5. The web crawler allows efficiently collect data from various sources on the Internet, including the largest Russian online social network VKontakte. Depending on the chosen scenario, the crawler allows you to get two types of data: historical data for past time intervals and real-time data. Data collected by crawler are presented in JSON format with various parameters. However, further application of data in the model requires additional processing of the received files. Historical data are required to adjust the input parameters of the model. Data are stored as separate collections in MongoDB. The data obtained in this way can be processed and presented as input parameters for the model: network topology, communities' and users' parameters.
The forecasting model is responsible for two main tasks: (i) disseminating messages through the network, and (ii) refining the parameters when we get the data about the actual state of information process from the crawler. Parallel computations allow you to finish modelling on large-scale networks within a reasonable time. Reasonable means that we can obtain the forecast before it becomes outdated, and that we have enough time to react if this architecture is used for decision support system. The model was implemented on a C++ language with MPI standard for message passing. The model uses the pattern of parallel communication Master/Slave. Master nodes are responsible for keeping statistics and IMs' generation. Slave nodes are responsible for hosting a subnetwork and propagating IMs through it. A more detailed description of the algorithms and functionality of Master and Slave processes are presented in our previous article [26].
Real-time data are added to the model in the form of sequential batches that are processed and saved to a specific folder. Batch is a JSON file with the network data: each row or IM has the number of reactions of different types. In addition, it can include lists of users who actually reacted to the information. Every batch has an index starting from one.
During forecasting for one IM, the values of m predicted parameters for iterations 1, 2,…, n are calculated (from the current moment 0 to the forecast period T). A series of n values is recalculated every time a new batch of data from the crawling is received (Fig. 6). Thus, the result of a single prediction cycle is m• p • n • k values, where k is the number of batches. After applying new batch, internal parameters of the model can be specified for more accurate prediction by following options: resetting model time to batch time, changing the number of reactions, modification of users who viewed and reacted to the IMs, modification of the virality coefficients. A detailed description of the forecasting scheme was described in our previous work [27].

Forecast quality assessment
The quality of the forecast is estimated in terms of accuracy, earliness and a number of batches needed to achieve desired accuracy of the forecast. Quality assessment is carried out according to the procedure illustrated in Fig. 7. After IM generation we can get information about a number of responses at different interval of time using batches. We use information from batches in two cases. Firstly, the evaluation of forecast obtained by our model (batches of evaluation). Secondly, assimilation of data from OSN and further recalculation of parameters in the model (batches of assimilation). For forecasting at the micro level, the forecast accuracy for each IM is estimated by the MAPE, MAE metrics (Eqs. 1, 2) for an individual batch and is averaged over the set of batch files (Eq. 3).
where A t is actual value of the parameter at the iteration t , F t is forecast value of the parameter at the iteration t.
(1) where A b t is the actual value of the predicted parameter at the iteration t for the batch b , F b t . is the forecast value of the parameter at the iteration t for the batch b.
For forecasting at the meso-level, the accuracy of forecast for the series of IM is estimated with mean value (Eq. 4). Also, we measure the rate of improvement of forecast accuracy during data assimilation. It is estimated in terms of the number of batches needed toieve a given accuracy ε, at a fixed update rate ν (Eq. 5).
where A it is the actual value of the predicted parameter at the iteration t . for IM i and F it is the forecast value.
where n . is number of iterations, p . number of IMs, b . is batch number, M . is selected metric to assess the accuracy of the forecast, ε is the minimal accuracy of the forecast.
To estimate the aggregated response to messages, e actual and model distributions of the number of reactions are compared using Kolmogorov-Smirnov criterion with significance level of 0.05 (Eq. 6).
where N 1 , N 2 denote size of first and second samples respectively, F 1,N (x) is empirical distribution function based on a sample size N.
Finally, earliness is estimated by the length of the time between the completion of the calculation and the end of the forecast period (Eq. 7), as well as the proportion of time allocated for the calculation to the length of the forecast period (Eq. 8).
where b is batch number, τ b forecast time for current batch b , l is time between batch assimilation, T is forecast period.
The description of assessment methods is given in Table 2. The following notations for parameters are introduced: t is length of model time unit, n is number of iterations, T is prediction period ( n t = T ), p is number of posts, b is number of batches ( A is set batches of assimilation, B is set batches of evaluation,A ⊆ B ), l is the length of the time interval between batches ( lk-the period of assimilation).

Experimental study of the quality of forecasts
To test the functionality and the quality of forecasts, we consider two scenarios: 1. Prediction of the response to a single IM in the community (corresponds to the micro-scale modelling).
2. Prediction of the aggregated response to messages on various topics within a single community (corresponds to the meso-scale). 2 IM response accuracy with data assimilation Number of responses Fix T , n (in Fig. 7 n = 9 ). A series of forecasts for n . iterations is carried out: a) at the time of IM generation; b) after each batch from the set b ∈ A . batch of assimilation (with the adjustment of the model parameters from the obtained data). Figure 7 shows assimilation of three batches. Additionally, batches of evaluation b ∈ C . must be collected to ensure verification predicted values for all |A| + 1 forecasts. The averaged values of the forecast accuracy metrics for different assimilation batch are calculated. To study the quality of forecasts, experiments were carried out with parameters which are listed in Table 3. The results of the experiments were evaluated by the metrics described in Sect. 3.3. The quality of forecasting reactions to a single IM for the news community with 1,912,769 subscribers was investigated.
The accuracy assessments of the IMs response prediction for the basic value of virality and cases with the data assimilation are shown in Fig. 8 (methods 1-grey color and 2blue color). For all types of reactions, the error decreases with an increase in the number of the batch (for likes, the median error less than 20% is reached on average in 1.5 h after the post publication, for shares and comments-after 3.5 h, due to the less number of these types of reactions). The error without assimilation (for basic virality averaged over historical data) is quite high (50-150%), due to the large variance of the actual viralities of the IMs. Moreover, Fig. 8 depicts the results of the assessment according to method 3 (of the impact of the frequency of assimilation batches). The median values of responses for different frequencies of receiving the batches are quite stable. The spread of the values for likes is smaller for more frequent batches; for rarer reactions (comments and shares), this trend is not observed.
To measure the accuracy according to method 4, we investigated the number of batches, which are needed to achieve desired accuracy (interval between batches is 30 min). Results for MAPE values from 0 to 1.5 are shown in Fig. 9a. To achieve an error value of less than 10% for likes and comments, about 10 batches are required, and 14 batches for shares, which are less predictable. Figure 9b shows dependences between forecast number and prediction period.
To study forecast quality for aggregated reactions on the set of IMs in communities, method 5 was used. Figure 10 provides a comparison of the aggregated dynamics of reactions for two types of IMs in the news community. Table 4 contains information on the results of calculating the statistics of the Kolmogorov-Smirnov criterion and the p-value for the two studied communities and three types of messages. For all the considered cases we cannot reject the null hypothesis that the samples have the same distribution.
The dynamics of forecasting time for a set of batches is shown in Fig. 11a. For the first batches, the time is growing, which relates to the need to update the states of the nodes of the complex network according to the batch data. However, starting from fourth batch, the time decreases due to the shorter forecast period. At the same time, Table 3 Initial parameters for forecast quality assessment    the earliness (Fig. 11b) are in the range of 0.02-0.04 (that is, the forecasting time ranges from 2 to 4% of the forecasting period), which shows the possibility of effective system operation in the operational mode. Figure 12 shows the results of the efficiency for the parallel implementation. The first experiment examines the scalability of predictive modelling in the parallel mode on the supercomputer Lomonosov. In this experiment, the size of one community's network is set to 5-15 M, and the number of IM is 5 or 10. In the data assimilation mode, 1 IM is added to each of 22 assimilation batches, and the number of processes varies from 1 to 128. Figure 12a demonstrates the parallel efficiency obtained in this experiment. For community sizes of 5 or 10 M nodes parallel efficiency varies from 0.64 to 0.78, peaking at 32 processes and then decreasing to 0.75 for 128 processes (Fig. 12a). For the graph with 15 M vertices, the parallel efficiency grows all the way up to 128 processes. The computational load (Fig. 12b) grows from batch to batch because every batch one new post is added to the system. The second experiment estimates the performance of the predictive modeling module in the parallel mode on the supercomputer. Experiment parameters are the same as in the experiment above. Figure 13a shows the total simulation time for 22 batches of different sizes of MPI-communicator. Figure 13b shows the time taken by one iteration on 3 processes. In this experiment, the prediction period was set to 900 s, and the community size was 15 M. Running the module on three processes allows performing the forecast for 130% of the prediction period, 64 processes give 5% of the prediction period, and 128 processes give 2% of the prediction period. Therefore, the performance of the module is sufficient to obtain forecasts for large networks.