TreeNet analysis of human stress behavior using socio-mobile data

Human behavior is essentially social and humans start their daily routines by interacting with others. There are many forms of social interactions and we have used mobile phone based social interaction features and social surveys for finding human stress behavior. For this, we gathered mobile phone call logs data set containing 111,444 voice calls of 131 adult members of a living community for a period of more than 5 months. And we identified that top 5 social network measures like hierarchy, density, farness, reachability and eigenvector of individuals have profound influence on individuals’ stress levels in a social network. If an ego lies in the shortest path of all other alters then the ego receives more information and hence is more stressed. In this paper, we have used TreeNet machine learning algorithm for its speed and immunity to outliers. We have tested our results with another Random Forest classifier as well and yet, we found TreeNet to be more efficient. This research can be of vital importance to economists, professionals, analysts, and policy makers.

status [11]. Few reality mining experiments also focus on sleep and mood as they have significant public health impact with societal and financial effects [12].
In the last decade, many researchers in sociology have described stress behavior as a social construct by pointing out humans' social influences which play a major role. And call log based social interaction patterns provide more predictive power on human stress.
The organization of the paper is as follows. We survey the related work followed by listing out the social network measures used in this paper. We present a TreeNet Gradient Boosting technique for characterizing stress behavior, and discuss the socio-mobile and stress features used to study the interconnections. Next, Social network features are listed out according to their priority and the influence of top predictor on the target class is shown through visualizations.

Related work
Social networks play a key role in diffusion of ideas, opinions and recommendations. It is the ability of a node in a social network to influence other nodes to accept or reject the information [13]. The Social Evolution experiment conducted by MIT closely tracks the day-to-day life of humans with mobile phones and it studies their adaptation of diet, exercise, political affiliation, depression and stress [14]. The frequency of interaction patterns extracted from communication logs are used to infer the strength of social ties and identify relationships.
Stress is a state of mental condition that everyone experiences in their daily life. In modern era, stress is one of the major causes for various health problems like heart disease, stroke, depression and cancer. Many times individuals are under stress due to deadline of projects and work and in long run high stress can be chronic. Several stress detection technologies may help people to better understand and relieve stress by increasing their awareness. A recent study at Harvard and Stanford Business Schools shows high mortality rates relating to stress which causes hypertension, cardiovascular and other mental health diseases. And this leads to 120,000 deaths in America each year [15].
Alison Dillon et al. studied the effectiveness of smartphone games which utilize biofeedback in reducing stress [16]. Biofeedback method detects various physiological signals such as heart rate, respiration, muscle activity, or skin temperature etc. from the user's body, and it helps them to gain control over them [17].
Amir Muaremi et al. worked on stress prediction using smartphones and wearable devices during working days and sleep and found an accuracy rate of 61 % at three different stress levels [18].  [20].
Akane Sano et al. studied physiological or behavioral marker for stress by the using mobile phones, a wrist sensor and social surveys to classify whether the participants were stressed or not [21].
To understand multiple aspects of human behavior, in recent years many reality mining approaches have been used starting from individual to group to understand measures like personality, stress, interest level, spread of diseases and product adoption [3,22]. Smart phones in particular have been used to study human mobility patterns both at a macro and an individual level [23].
Human social interactions data collected from mobile phones have been identified as potent stressors which often impact his/her social behavior [24]. Previous research on human stress comes largely from neurobiology and few methods are based on physiological signals like blood pressure, heart rate [25], heart rate variability (HRV) [26], skin conductance [27,28] and cortisol [29,30]. Many stress assessing methods are based on surveys. Questions related to perceived stress scale (PSS) were used as an objective stress marker which assessed to what degree a subject feels stressful in different situation.
Today we have many wearable devices and mobile phones containing various sensors to measure behavioral data in our day to day lives. This paper aims to use socio-mobile data to find human stress behavior using call log data collected from social-interactions through smart phones. At the same time, classification of human stress behavior using TreeNet analysis is a unique method and the efficiency of the results were verified after comparing it with other popular machine learning algorithms like Random Forest classifier.

Evaluation of individual well-being
Few behavior signals produced by smart phones have been correlated to the function of some major brain systems. Recent reality mining research on data streams offers direct assessment of cognitive and emotional states of individuals, perception of events, and information on their behaviors [31].

Mapping social networks
Reality mining's capability for automatic mapping of social network is one of the important areas of research. A smart phone can sense and continuously monitor user's call, SMS patterns, location information; and by using statistical analysis of this data, we can show different behavioral patterns based on user's social relationship.

Evaluation of population well-being
Reality mining techniques are used to assess health conditions within a community. It shows that people tend to be least deprived in the regions where there is greater diversity of communication [32].

Infectious disease
GPS and other sensing technologies provided by smart phones are used to easily track people's movement to find out any disease spread by population or by physical proximity such as bird flu. And human behavior plays an important role in the spread of these diseases and in improving the control over them [32].

Mental health
Reality mining techniques assist in the early detection of psychiatric disorders such as depression, attention deficit hyperactive disorder (ADHD) etc. Reality mining data stream approaches allow direct, continuous, and long term assessment of health patterns and behaviors.
The communication patterns and the frequency of communication with others of individuals, and their content and manner of speech are also the key signs of several psychiatric disorders [33].

Behavioral health analytics
Behavioral patterns collected from smart phone social sensors are used to improve the quality and to reduce the healthcare cost. Emerging mobile apps provide data of patients based on user's location, call records, SMS records, app usage which is then used for big data analytics to find deviations in an individual's daily activity to predict something wrong or suspicious even before an event occurs [34].
Our research is based on smart phone call logs and it is the simplest way of maintaining a personal network diary of users on their phones with very little or almost no effort for long duration. This implements passive collection of data from smart phones and it helps us to gather complete information about user networks with no interruption to user's daily activity. This dataset doesn't suffer from cognitive/social biases in drawing network information. Table 1 show the uniqueness of this research and also it explains the related work done by other researchers in terms of social granularity. Earlier most of the stress measurement work was done by collecting data from surveys, smart phone sensors and wearable devices. But in this paper, we have used call log data set of smart phone users to study human stress on daily basis over a period of time. Such method of finding stress behavior has never been explored before.

Social network analysis
A group of collaborating and/or competing individuals who are related to each other by one or more types of relations and are formally defined as a set of social actors, or nodes in a social network [35]. Social network analysis (SNA) is a technique which deals with the analysis of social networks to trace and understand the social relationships, and apply the inferred information among the members of the network. Many concepts from graph theory are adopted in SNA. The reason is representation of social network through graph is the best way to analyze the relationships and interaction strengths among the actors or nodes [35][36][37]. Many graph tools have been developed to help researchers to visualize social networks.
In this paper, we have used the UCINET 6, a software package useful for social network analysis. In the literature of SNA, there are many metrics proposed to discover the characteristics of a social networks, like degree/size, density, different types of centralities, clustering coefficient, path analysis, flow, cohesion and influence, and other essential information which is obtained by various types of analysis [38]. In this section, for our analysis, we use the following metrics: • Degree: It is defined as the number of actors (alters) that an ego is directly connected to. • Farness: It is an aggregate of the weights of the shortest paths from ego/to ego to/ from all other nodes. If the social network is directed, then farness can be computed for sending and receiving information from alters, and the sum of geodesic distances from alters is called in-Farness and to other alters is called out-Farness. • Closeness centrality: It is the reciprocal of farness. This metric is based on the notion of the average shortest path between a node and all the nodes in the graph. It is defined as the mean geodesic distance between an actor and all alters reachable from it. Closeness is an important measure which tells how long it will take information to spread from a given node to other nodes in the network. For a directed graph, incloseness and out-closeness is calculated separately. • Structural holes: These are the gaps/weaker connections between non redundant contacts or groups in the social structure. Individuals on either side of a structural hole circulate different flow of information. Structural holes are an opportunity to the broker to pass the information between people from opposite sides of the hole [39]. • Ego betweenness: It is the sum of ego's proportion of times that ego lies on the shortest path between each part of alters. If the alters are connected to each other not through ego, then the contribution of that pair is 0, for alters connected to each other only through ego, the contribution is 1. Similarly, alters which are linked to ego and one or more other alters, make the contribution 1/n, where n is the total number of nodes connecting the pair of those alters. N Ego Betweenness is normalized by a function of the number of nodes in the ego network [40].
• Proximal betweenness: It measures the number of times a node occurs in a penultimate position on a geodesic. Let a jk be the proportion of all geodesics linking vertex j and vertex k passing through vertex i, where i is the penultimate node on the geodesic, that is (i, k) is the last edge of the geodesic path. The proximal betweenness of a node i is the sum of all a jk where i, j and k are distinct. • Betweenness centrality: It measures the position of a node and is defined as the number of times a node connects pairs of other nodes who otherwise would not be able to reach one another and plays the role of intermediary in the interaction between the other nodes. • Flow betweenness: Let a jk be the amount of flow between node j and node k which must pass through i for any maximum flow. The flow betweenness of node i is the sum of all a jk where i, j and k are distinct and j < k. The flow betweenness is, therefore, a measure of the contribution of a node to all possible maximum flows. • Reach centrality: It counts the number of nodes where each node can reach in k or less steps. For k = 1, this is equivalent to degree centrality. For directed networks, it calculates separate measures for out-Reach and in-Reach. In a social network, when we find the key individuals who are positioned well in the network, via them we can reach many people in just a few steps. This measure gives us a natural metric for evaluating each node. • Density: It is defined as the total number of ties divided by the total number of possible ties. Given a direct graph, G = (V, E), Density is defined as: Density = |E| |V|*(|V| − 1), where |V| is the total number of vertices and |E| is the total number of the edges of the graph. • Degree centrality: It is defined as the number of direct connections a node has with other actors or alters. A node with a high degree centrality acts as a hub in the network and for a directed network; degree centrality is the sum of in-degree and outdegree. It signifies activity or popularity of that node in the network due to large number of interactions with other nodes. • Clique: It can be defined as a sub-set of nodes where all probable pairs of nodes are directly linked to each other.

A social analysis of human stress behavior
Our study is based on smart phone call logs dataset. This dataset contains continuous collection of call logs including the date, time and duration of call of individuals residing in a community. Here the call types are incoming, outgoing or missed call between individuals. Using this dataset, we build a phone communication network for the community where each node is an actor and each link is the type of calls made by them. Table 2 shows sample data set containing two types of community users, SP and FA, who also participated in the survey. The data set contains call logs of both community and non community users, but we have preprocessed our data and concentrated only on community users. Table 3 a sample of survey dataset containing 12,658 records on human stress from the same users and it is given in the range of 1-7. After preprocessing the data we computed Avg_Stress of each community user.
In this section, we present study of human stress behavior with the aim to elicit some useful information by using social network analysis and in particular the metrics shown in "Social network analysis" section. As already mentioned, we have used the UCINET 6 software package to compute the social network metrics. Figure 1 shows the social network metrics computed using UCINET 6.0 package.
Here we propose the study of the human stress behavior from a social point of view, with the goal of extracting information from the dynamics of the relationships among the members of the network. We have used TreeNet algorithm, which typically generates thousands of small decision trees built in a sequential error-correcting process to converge to an accurate model given by Salford Systems (SPM 8.0).

Stochastic Gradient TreeBoost algorithm
Stochastic Gradient_TreeBoost algorithm is a minor modification made to gradient boosting technique, where randomness is incorporated at each iteration of the algorithm [41]. At every iteration, a subsample of the training data set is drawn randomly (without replacement) and this subsample is used to fit the base learner and compute the model for the current iteration. Subsample size is a constant fraction f of the size of the training data set. Smaller values of f initiate randomness into the algorithm and it helps in preventing overfitting which acts as a kind of regularization. Let {y i , X i } N 1 be the entire training data sample and {r(i)} N 1 be a random permutation of integers {1, 2, 3, . . . , N}. Then a random subsample Ñ < N is given by {y r(i) , X r(i) }Ñ 1 .
Algorithm: Stochastic Gradient_TreeBoost The value Ñ = N introduces no randomness. Therefore the fraction f =Ñ /N causes more randomness in successive iterations thus improving overall randomness.

Treenet ananlysis of call log dataset
TreeNet is a Stochastic Gradient Boosting technique, a new machine learning approach which is good for classification and regression problems. It is built on CART trees and thus is fast, efficient, data driven, immune to outliers and invariant to monotone transformation of variables.
This model builds trees from several hundred to several thousand small trees and each tree depicts a small portion of the overall model. Finally, the model prediction is done by adding up all individual contributions. This model is similar to a long series expansion containing a sum of factors which becomes more accurate as the series expands. Figure 2 shows tree building process using gradient boosting technique.
where each T i is a small tree. Procedure of TreeNet: 1. It begins with a very small tree with initial model containing as small as one split generating 2 terminal nodes. 2. But generally a model has 3-5 splits in a tree, rendering 4-6 terminal nodes. 3. After Tree, it computes "residuals" for this simple model for every record in data.
Then it grows a second small tree to predict the residuals from the first tree.
4. Then it computes residuals from this new two-tree model and grow a third tree to predict revised residuals. 5. Further it repeats this process to grow a sequence of tree.
Every tree yields minimum one positive and one negative node. Red shows a relatively large positive node and deep blue indicates a relatively negative node. Total "score" is obtained by identifying a relevant terminal node in every tree and summing the score across all trees in the model.
TreeNet (TN) model is generally summarized in different ways like partial dependency plots, variable importance ranking, ROC curves and confusion matrix. Figure 3 shows a partially dependency plot between the target and predictor variable as captured by the model.
In our dataset, Avg_Stress is our target variable and there are 17 predictors used to predict our target class STRESS (S). We have used train and test set ratio as 80:20, learn rate of 0.1, 200 as initial number of trees and cross entropy technique to determine number of trees optimal for our logistic model. Our model is based on a binary classification system, where a person is stressed (S) or not stressed (NS) is determined by whether the sign of the predicted outcome is positive or negative. This technique produces the smallest possible two-node tree in each stage. The default TreeNet uses a six-node tree, but Fig. 2 Shows a TreeNet modeling process which starts with a very small tree containing 2 terminal nodes and this model normally contains 3-5 splits in a tree, generating 4-6 terminal nodes. It computes prediction errors for the model for every record in the dataset and grows second small tree to predict the residuals from the first tree. This process repeats to grow a sequence of trees till it generates an optimal tree for the dataset Fig. 3 Shows a partially dependency plot between the target and predictor variable as captured by the TreeNet (TN) model in our model, the optimal likelihood and ROC models are attained when 171 trees are grown, that means at this level, the tree shows the best performance. Figure 4 shows the average log likelihood (Negative) value of 0.653 for trees optimal N = 171 in our model. TN model produces ROC curves which are unique for each record. This model allows records to be ranked from best to worst. Table 4 shows our optimal model, where area under the ROC curve shows a measure of overall model performance of 50 % and an average Log-Likelihood of 65 % to emphasize the probability interpretation of the model predictions for the target variable STRESS (S).
TN model gives stable variable importance rankings after assessing the relative importance of predictors. Table 5 shows the importance of top 5 predictor variables for the target class. Figure 5 shows the top two predictor dependence variables for our target class STRESS (S).
We have used another modeling engine for our dataset like RandomForest classifier and found the results as follow: TN model generates the confusion matrix using an adjustable score threshold and this matrix is an indicators of model's false positive and false negative rates. Table 6 shows the sensitivity (true positive rate) or specificity (true negative rate) more in TreeNet than Fig. 4 The green line shows our optimal model with respect to the test, learn likelihood. X-axis shows actual number of trees and Y-axis shows the corresponding value of the likelihood. In our case, the optimal 171 tree model has 0.653 average negative log-likelihood  that in RandomForest, it means that the proportion of positives/negatives that are correctly identified as STRESS (S) or NOT STRESSED (NS) is more in TreeNet.

Hypothesis testing
We have analyzed call log data set on daily basis over a period of time and we have formulated a set of hypothesis which is given in Table 7. These hypotheses we have tested using Person correlation analysis at 95 % confidence level, 0.05 significance level.

Conclusion
In this paper, we presented the study of human stress behavior from a social point of view. The position of an individual/actor and the flow of information through that node in a social network preferably decide the stress behavior of that person. In a social network, when an ego is well-located on the communication paths linking pairs of others nodes, then that person is in central location. A centrally positioned ego can influence the group by restraining information in transmission. Such individuals are responsible for the maintenance of communications and they act as potential coordinators of group. If a node is more central in a network then the total distance is lower from all other nodes. Eigenvector centrality assigns high scores to the nodes based on the influence of a node in a network. Similarly, the reach centrality of a node in a social network also decides the reachability of the node from other nodes. If the out-Reach of a node is high then the node acts like a hub in the network. From our analysis, we identified that the closeness, eigenvector and reach centrality are the major predictors for identifying stress behavior. Hierarchy describes the nature of constraint on ego which is a crucial property for shaping the nature of social interactions. Yet another important criterion is density because it shows the extent to which information diffusion takes place among the nodes. Therefore, an individual is more stressed based on its social hierarchy and density. We hope this study can be taken as benchmark for the further research in the field of social science.