Regularized Simple Graph Convolution (SGC) for improved interpretability of large datasets

Classification of data points which correspond to complex entities such as people or journal articles is a ongoing research task. Notable applications are recommendation systems for customer behaviors based upon their features or past purchases and in academia labeling relevant research papers in order to reduce the reading time required. The features that can be extracted are many and result in large datasets which are a challenge to process with complex machine learning methodologies. There is also an issue on how this is presented and how to interpret the parameterizations beyond the classification accuracies. This work shows how the network information contained in an adjacency matrix allows improved classification of entities through their associations and how the framework of the SGC provide an expressive and fast approach. The proposed regularized SGC incorporates shrinkage upon three different aspects of the projection vectors to reduce the number of parameters, the size of the parameters and the directions between the vectors to produce more meaningful interpretations.

As well as the node interlinking information which can produce a network (graph), there are the attributes of the users which can provide important useful information in improving upon predictive analytics. A notable technique used by online shopping platforms is collaborative filtering [10] which works as a recommendation system for improving the shopping experience of customers by optimizing the product view to those items predicted to be of interest. These new link predictions are based upon past purchases and the historical records of other customers. Here 'clusters' are formed between groups of items in a multidimensional space of these choices [11,12]. This relies on the observation that the items are not selected independently of previous purchases and that there is information gained from utilizing the data collected [13]. Another area where predictive analytics has used information in order to make predictions is in student performance [14]. Logistic regression has been used in situations where a model is required in a decision framework in order to predict achievements [15]. A key difference between the network based approach and these approaches is that the information contained in the network and how the links influence the node of concern are excluded from the model which can play a key role in economic behaviors predicted [16]. In the effort to fuse these sources together for models to incorporate [17] discusses how user identity linkage across online social networks can be accomplished.
Online social networks have been at the forefront of the interest in networks and approaches which use the interlinking information due to their size and the effect they have on human behaviors [18]. The network topologies of the virtual networks can find applications but they also carry over into the physical world. The techniques have been used in other domains such as examining the centrality in streets of urban spaces [19] which also can be seen as a continuation of the original network/topological graph theoretical formulation of Euler's investigation of the 'Seven Bridges of Konigsberg' problem [20]. As these urban networks fit within networks of urban spaces themselves, multilayer networks [21] are produced that span over the globe allowing an analysis of even global migration patterns [22]. In a similar fashion, it is also possible to consider academic literature as a network with similar properties governing its construction, such as homophily [23]. The nodes in such academic networks are publications and the links are the citations between the articles that provide information of association. There is active research in this field [24] which notes the key motivation is that researchers can spend considerable amounts of time searching for the relevant research in order to not allocate time on topics already explored with similar approaches. Being able to find associative research is of importance since it is possible for research to be directed in areas already investigated and waste time as well as materials in research such as studies requiring expensive lab equipment. Navigating the network to extract relevant research is therefore a key activity in preventing this. The work of [25] discusses how the investigators can seek from these datasets insight for the dynamics of the growth and the interconnectivity of scientific thought. With the growth of the citation datasets (such as the ones described in "Data" section) the concerns on the processing time, complexity of the models and the ability to interpret the results are becoming a key issue.
This then poses key questions about how to process and then reason about the results from large datasets with large variations. Questions about the results even require effort in their interpretation. Work such as [26] look at the problem from a conceptual perspective on the areas of focus for big data and how the user can interact with the data that are results from a post analysis. The work of [27] provides a high level overview of the tools and approaches available in visually investigating the data and the results of different methods return. It is possible to include the full set of interpretable outcomes and the full set of relevant data features, but that does produce a challenge for the practitioner to determine which features are or prime interest. A dimensionality reduction approach [28] provides a more effective experience for the practitioner.
Graph Neural Networks (GNNs), [29], provides a methodological framework for combining node feature and the network data information in order to produce predictions within a machine learning paradigm. There are many applications ranging from image object positions in a non-euclidean space representation, molecular properties and citation networks [30]. Therefore this work deals with investigating an even more simple GNN, the Simple Graph Convolution (SGC) [31], which has a simpler methodological definition and a competitive predictive accuracy. It is in developing a modified SGC that the task of reducing the dimensionality in large datasets with a GNN will be explored. As will be shown in "Methodology" section, the simplicity of the model allows for it to be a basis for extensions that can incorporate constraints such as shrinkage upon the parameters. This is done in a manner similar to the regularization procedure of Lasso [32]. This will allow the large complex datasets to be processed in such a manner as to be interpretable and more accessible in terms of the computations resources required. Models other than the SGC would incorporate more complexity upon procedures already complex making large dataset investigations an increasingly large challenge to apply. The work effectively takes the SGC and extends it so that the model can introduce constraints upon the parameter vectors in such a manner as to allow the model to be more easily interpreted. This allows the parameters for each class to be more sparse and for each class to have less of an overlap between each other. Altogether this results in a parameter matrix which can more easily be inspected.
In "Related work" section, a selection of previous work highlights the development of the GNN which led to the SGC is acknowledged. The "Data" section describes the datasets used in exploring the results of the proposed modification on the SGC with regularization. The methodology is described in "Methodology" section, where the SGC formalism is presented and the proposed modification where the regularization upon the features and their parameters allows for a reduction of the redundant features and hence ease in the interpretability. The results are displayed in "Results" section where the ability for the SGC and the proposed SGC allow the model to fit data which would not be linearly separable but is made separable by incorporating the graph information, and then the application to a scientific citation dataset (Cora [33]) is shown.

Related work
Convolutional Neural Networks (CNNs) [34] has brought a methodological approach for handling high dimensional problems more efficiently than other paradigms. As noted in [35] in conjunction with deep learning, CNNs have greatly improved the ability to classify sound and image data. The work of [36] introduces formally how graph based methods can be used with CNNs. A key contribution of [36] is that the extension of the model to generalize to graphs is founded upon localized graph filters instead of the CNN's localized convolution filter (or kernel). It presents a spectral graph formulation and how filters can be defined in respect to individual nodes in the graph with a certain number of 'hops' distance. These 'hops' are representative of the number of edges traversed between nodes and is the result of the powers of the adjacency matrix where the number of walks can be calculated [5] (walks are paths which allow node revisits).
An introduction to the motivation from basic principals can be found in [37], where the fundamental analysis operations of signals from regular grids (lattice structures) to more general graphs is developed. The authors in [38] utilize the theory of signals on graphs in order to show how a shift-invariant convolution filter can be formulated as a polynomial of adjacency matrices. The discussion of how low pass filters are an underlying principal in the GNN is discussed in [39] which is also described in the work of the SGC. [40] proposes a Graph Convolutional Network (GCN) by adapting Convolutional Neural Networks (CNNs) for graph-structured data. The GCN learns a graph representation via layer-wise propagation rules that represents localized spectral filters.
The GNN can allow for the augmentation of a users social network and their features to make a more accurate prediction and similarly for an academic paper that the features (keywords or low dimensional representation) with the citation links can more accurately place its relevance. The machine learning framework can introduce large overheads in the processing time especially for large datasets but fortunately research has shown that simpler GNN models display peak performance [41]. The work of [40] which introduces a semi-supervised approach to GNNs, shows in appendix B the performance of the methodology with the number of 'layers' employed in the model and how there is an actual degradation of the performance after a few layers. The SGC [31] provides an efficient framework to provide a similar model of the data associations as the GCN but avoid the necessity of the layers the GCN introduces. The methodology of the SGC, as shown in "Methodology" section allows a single layer of matrix computations with a non-linear activation function. This is similar to the processing steps taken for logistic regression which can be computed for large datasets very efficiently. Building upon this efficient model allows an investigator to explore further constraints which would be much more computationally demanding with the incorporation of layers.

Data
Three different datasets are employed in order to explore the model proposed, with 2 of them being synthetic and the last being a real dataset which is well explored [33]. The first synthetic dataset has 2 dimensional features with data points placed in a circle and labels applied on opposite sides of the identity line ( x 1 = x 2 ). The other synthetic dataset also has 2 dimensions and placed in such a way which clustering or a non-network based model, relying upon distance measures, would incorrectly classify the node labels. More about these 2 datasets is described below. Figure 1 shows the synthetic data produced with points allocated along a circle based at the origin. There are 100 points and 50 of them are allocated to each class placing them on either side of the identity line. A key aspect of this data is that the model will attempt to shrink the feature projections which can incur a penalty on the optimization procedure. The compromise between the error function on the data and the regularization penalization term will require a balance as a single feature reduces the shrinkage penalty but the direction for the optimal fit uses both dimensions. This compromise is induced since the optimal projection will be with a vector containing non-negative parameters for each dimension in equal value at a direction x 1 = −x 2 therefore highlighting the shrinkage of one of the parameters. The network data (the adjacency matrix) is a ring network connecting neighboring nodes.

Circular data
Linearly inseparable data Figure 2 shows 30 synthetically produced data points in 2 dimensions ( x 1 , x 2 ) which form 4 distinct clusters. Each class has 15 data points randomly generated and it is separated into 2 clusters across the axis. We produce a non-disjoint network (single component) structure for the data points to be connected with a more dense connectivity set between points of the same label. This production is inline with the concept of modularity in networks [42] where the density of the edges between nodes of the same label is proportionately greater than the density between nodes with different labels. Without the network structure, distance metrics would produce erroneous results and the introduction of this information increases the accuracy. This allows the linear operations to produce a separation for the class labels. [40] develops Graph Convolutional Networks (GCNs) by adapting Convolutional Neural Networks (CNNs) for graph-structured data and the work of [31] (proposing the Simple Graph Convolution (SGC)) builds upon it. The SGC removes the non-linear transitions between the layers in the model. This simplification speeds up processing time significantly yet still performs on par with GCNs and other state-of-the-art graph neural network models across multiple benchmark graph datasets. The model modification will allow easier interpretability of the parameters fitted by the optimization procedure with the application of a set of constraints. The constraints introduced into the loss function will force the stochastic gradient descent algorithm to find directions which have fewer non-zero values and less overlap for the parameters between the classes. This addresses the problem of how Fig. 1 The application of the proposed methodology to the circular data. Data points are produced about the origin at a fixed radius so that 100 points are equally spaced. The data points have class labels allocated so that there are 50 in each of 2 classes. This is shown in the plot with a line which determines the separation Pho and Mantzaris J Big Data (2020) 7:91 to inspect effectively the matrix of parameters and the vectors of the parameters for each class. This takes inspiration from regularization methods. We adopt the notations presented in [40] and [31] for the GCN and SGC respectively. A graph G = (V ; A) can be defined as a collection of nodes (vertexes) set V = (v 1 , v 2 , ..., v N ) containing N nodes and an adjacency matrix A ∈ R N ×N where a ij is the weighted edge between node v i and v j ( a ij = 0 if v i and v j are not connected). We define the degree matrix D = diag(d 1 , d 2 , ..., d N ) as a diagonal matrix whose off-diagonal elements are zero and each diagonal element d i capture the degree of node v i and d i = j a ij . There is a feature matrix (also referred to as the design matrix) X ∈ R N ×D where each row x i is the feature vector measured on each node of the graph. This can be thought of as each row is the feature data belonging to a node, and the columns to a different dimension of the features. Each node i has a class label from C classes and hence can be coded as one hot vector y i ∈ {0, 1} C . The GCNs and SGC add self-loops and normalize the adjacency matrix to get the matrix S:

Methodology
where Ã = A + I and D = diag(Ã) . This normalization allows successive powers of the matrix to not influence the overall size the projections. The SGC removes non-linear transformation from the kth-layer of the GCN resulting in a linear model of the form: Fig. 2 15 data points of 2 classes are randomly generated and form 2 distinct clusters residing on the opposite sides of the axis. a Presents the scatter plots of these data points. b Shows the network connections among those data points. In terms of a distance metric between the points, or from using a linear projection, erroneous label assignments can arise from using only a data but the incorporation of the network associations shown in b allows this to be bypassed. The results of the application to the proposed methodology is shown in "Methodology" section Pho and Mantzaris J Big Data (2020) 7:91 The SGC classifier is then achieved by collapsing the repetitive multiplication of matrix S into the k th power matrix S K and reparameterizing the successive weight matrices as The parameter k corresponds to the number of 'hops' which is the number of edge traversals in the network adjacency matrix S . k can be thought of as accumulating information from a certain number of hops away from a node (as described visually in [31]). If k = 0 the methodology becomes equivalent to a logistic regression application which is known to be scalable to large datasets. Since the SGC introduces the matrix S as linear operation the same scalability applies. The weight matrix is trained by minimizing the cross entropy loss: where Y L is a collection of labeled nodes. As motivated in "Introduction", the SGC shows how an efficient formulation of GNNs can be derived, it does not provide as well the ability to reduce the feature set. To reduce the number of parameter values, we introduce a flexible set of constraints as shrinkage operators in the loss for Eq. 4: The first component of L R is the loss from SGC being L . Next, L 1 is the shrinkage term for penalizing the number or parameters by reducing the penalization with a larger skew in the number of elements in the columns of R . The term | R(·,c) | 4 denotes the normalized vector for each class projection in the parameter matrix (which are columns) and that each element is raised to the power of 4. The L 2 term is the total magnitude of the parameter vector so that the distribution of the terms are not influential but only the norm result. The term L 3 is the term which penalizes class label projection which have large overlaps, so that vectors will be orthogonal or depending upon the value of L 3 to support opposing directions. The parameters for the regularized fit using the shrinkage in the loss will be referred to as R . To impose an orthogonality constraint between the projection vectors the term for the L 3 is modified: (2) Y = softmax(S . . . SSX (1) (2) . . . (K ) ).
This methodology therefore delivers a formulation which is based upon an approach with layers as other 'deep learning' frameworks provide, but without the computational burdens that come along with it. The simplified model implementation is therefore capable to be run on a personal computer with Pytorch [43].

Results
Here we present the results of applying the proposed methodology to the datasets described in "Data" section. The synthetic circularly placed datapoints with labels allocated on the sides of the identity line of 2 dimensions, described in "Circular data" section of the "Data" section. The synthetic datapoints placed along 2 dimensions without a linear separation of the labels based upon a distance metric but possible with the network information is described in "Linearly inseparable data" section and the results for it shown in the subsection of "Results" section, "Synthetic linearly inseparable data". The results of the application to the real dataset of [33] (Cora citation dataset) is shown in the subsection of "Results" section "Application to the Cora dataset". The methods of logistic regression, SGC and the regularized SGC are applied and the results are compared revealing that the fitted parameter vectors for each class have less overlap between themselves so that their characteristics for the classes can be more effectively interpreted.

Synthetic circular data
The results of applying the SGC methodology with and without regularization on the synthetic circular data, is shown in Fig. 3. The points in the dataset have 2 features x 1 and x 2 . In Subfigure (a) the SGC model produces a perfect accuracy with 2 parameters used the projection vectors pointing to the proper direction of each class. The regularized SGC returns different solutions due to the regularization in the parameter matrix . Subfigure (b) shows the regularized parameter vectors under different initializations of the learning algorithm which applies constraints. These constraints reduce the number of parameters used, the size of the vectors as a norm, and the direction between the vectors to be more informative. It can be seen how different random initializations produce different loss values and accuracy depending upon the local optima arrived at. These different stable points do show that the shrinkage factors are affecting the vectors for each class in .
A separation based upon the identity line for the 2 dimensions, represents a situation where there is equal weight upon all the features of the data and the inference scheme must make a choice in the penalization. The choice results in a decrease in the loss of accuracy in order to decrease the penalization from the regularization from the 3 components calculated from the projections; L 1 , L 2 and L 3 as discussed in "Methodology" section. The variation shows that the model is able to explore a with range of vectors for (6) the matrix columns of R . From the range the choice with the largest accuracy (lowest loss) can be chosen.

Synthetic linearly inseparable data
In this subsection, we apply the SGC method with and without regularization on the linearly inseparable data which contains feature coordinates and a network of associations. The dataset used here is described in "Data" section's subsection "Linearly inseparable data" where the coordinate space of the datapoints and the network are displayed.
The key aspect which this dataset emphasizes is that the features alone without the network information cannot produce a linear separation, but with the incorporation of the network information (with linear operators) this classification then becomes possible. Figure 4 shows the results of applying the SGC to the dataset without the network information being used k = 0 , and is effectively an application of logistic regression. The methodology cannot separate the data correctly with a pair of linear projections but that can be alleviated as seen in the next figures by incorporating the network information as well ( k > 0). Using the SGC (by setting the shrinkage parameters to 0), in Fig. 5, Subfigure (a) shows that although the vectors for the class projections, as columns in , do not enable a separation between the groups the network information enables a perfect accuracy to be produced. This is because although the support for an erroneous class can be accumulated for a point, the feature space 'communicated' to it from the edge connections of features overrides the nodes' own features in these cases. Each plot is an independent run with slight changes in . Subfigure (b) shows a set of plots but where the axes x1 * and x2 * for each data point represents the projection of the features with the 'neighborhoods' of the points. With k = 2 , S 2 , aggregates the weights from '2 hops' distance in the network, so that the multiplication of S 2 X is shown on these new axes. It can then be understood why the data is then 'linearly' separable after this transformation. This emphasizes how the network information can be used to improve the accuracy and maintain model simplicity.
In Fig. 6 the regularized SGC is applied to the dataset (with L2 = 0 ) and the parameter vectors for each class from R are plotted in both Subfigures (a) and (b). The constraints (shrinkages) are placed on the sum of the elements within ·,j R and the direction of the vectors which reduces the total value summation for feature extraction. In Subfigure (a) the projection vectors of R are shown and as with Fig. 5 the results produce a perfect accuracy. What can be seen is that the model explores alternative parameterizations which are not found previously without regularization. The projections all display a drop of a features dimension. Subfigure (b) shows the S 2 X projection and that perfect accuracy can still be achieved. In each of the plots it can be seen how various equivalent (in terms of the accuracy) projections can be searched which reduce effectively the number of features used for each class being predicted. The ability for the non-linearly separable data to be correctly classified without introduction of new parameters or 'layers' in the CNN enables explorations to be done more efficiently on large datasets in terms of time and processing capabilities.
The change of the L 3 constraint is utilized so that the projection vectors for each class are fit to be orthogonal to each other (shown in Eq. 6. This allows the possibility for a smaller number of classes to be fit, if the class number is not known and differs from the previous application in that the support for each class would be seen as a separate linear function's projected value. Figure 7 shows the results in Subfigure (a) and Subfigure (b) where the vector fit in the space of the data points is seen and how the data is transformed into different axes using the network data respectively. It can bee seen how the constraint for the orthogonality is preserved and the accuracy for the fits is still achieved for this problem.

Application to the Cora dataset
Here is presented the application of the SGC methodology and the proposed regularized SGC to the dataset of Cora [33]. The purpose is to examine the capability of both the SGC and the regularized SGC to a dataset with a large number of features. There are many situations in big data applications where the datasets have large numbers of features due to larger data gathering schemes and a requirement to select key features without supervision. The SGC has been applied to the Cora dataset [44], and here the performance with a regularized version is mainly directed at the interpretability in highlighting the key variables in the feature set while also applying other constraints. Figure 8 present the results with 2 Subfigures with heatmaps displaying the parameter values fitted for each class in and R . The dataset classifies each document as belonging to one of 7 different classes where the SGC then produces a parameter matrix with 7 columns and d rows for the feature number. The constraint upon L 3 is set so that the projection vectors between classes are in opposing direction so that class feature loadings are differentiated by their placement in a histogram of the values. With the SGC applied, Subfigure (a) shows the weights of the parameters for each class (a single column in ) as a separate heatmap with a legend for the values indicated. Analogously the same set of results but produced with the regularized SGC proposed here is shown in Subfigure (b). The approach produces a new parameter matrix R introduces the regularizations in the inference scheme for the parameters by penalizing their total sum and directions to be as informative about the features in terms of accuracy prediction and lack of overlap (removing redundancy within large feature spaces typical of large datasets). It can be seen there are fewer variables highlighted for the practitioner to examine, which looks to investigate and highlight which variables are important for the class membership determination. The 7 cells highlighted in the bottom right are padding. Using the data in the heatmaps shown in Fig. 8 a histogram of the values for the parameter values for each class and the features is created for the SGC and the values from the regularized SGC held in the matrices (Subfigure a) and R (Subfigure b). In Fig. 9, Subfigure (a) shows how there is a smaller group of features which provide positive contribution to the class identification and that an apparent 2 mode distribution can be made out. Each plot belongs to a different class in the dataset and are a different column in . Subfigure (b) shows the parameter value distribution within R . The effect of the regularization can be seen in comparison with Subfigure (a) where the number of feature values at value 0 are the majority. This makes the exploration and backtracking process the features easier. Fig. 6 The application of the regularized SGC to the dataset where linear projections are incapable of class separation. a Shows the parameter vectors produced by introducing constraints into SGC method on the feature space with original data points and how the classification can then produce perfect accuracy. It can be seen the proposed SGC reduces the effective number of features used in the columns of the matrix R . b Shows the plots for the same set of weight vectors displayed on the projection axes of S 2 X where each datapoint (node) accumulates feature information from neighbors 2-hops away ( k = 2)

Discussion
This work proposes a model extension of the Simple Graph Convolution (SGC) which aims at producing a smaller and more meaningful set of projections in which classification labels are presented. It addresses a key issue with interpretability of model applications in big data where many features may be used which are redundant and remove the ability for a practitioner to examine the weights. A key reason for why the SGC was chosen to be extended with this capability is that the operations are linear in the methodology with the exception of the softmax function application can be run relatively efficiently in comparison to methodologies relying on more parameterizations and more 'layers' in order to improve accuracy.
The SGC incorporating the network information can produce accurate classification of points in a feature space which is not linearly separable by utilizing the network information via linear operations. The results demonstrated this capability on a small dataset where the network projection effectively linearizes the search by having information from the node 'neighborhood' accumulated from 'k-hops' distance (relying upon the powers of the adjacency matrix). This allows for fast run times and the application to services which rely upon small delays. The methodology was applied to the Cora citation dataset which has a large number of features and the reduction is Fig. 7 The results of applying the regularized SGC with an orthogonal constraint upon the projection achieved by using a change in regularization term L 3 . In a the vector projections within the methodology can be fit so that there is no loss and the orthogonality constraint is satisfied. In b the transformation upon the data with the network information is presented and how within this space the linear projections can separate the classes with the orthogonal vectors Fig. 8 The plots showing the results of applying the proposed method of SGC with and without regularization on the features to the dataset of Cora. a Shows the heatmap of the class columns of matrix which holds the parameter values for the feature projections of the data X after the inference with SGC without the regularization. b Analogously shows the parameter values but with the inference procedure applying the constraints for the regularization as proposed which produces the shown values of R . On the bottom right of each plot there are 7 cells with padded values to produce the heatmaps. The columns of the parameter vector correspond to different classes, each shown separately, and the weights applied to each feature belonging to the nodes. It can be seen that the regularization reduces the amount of weighting over the features highlighting key variables Pho and Mantzaris J Big Data (2020) 7:91 Fig. 9 The distribution of the parameter values inferred for the Cora dataset with the application of the SGC and the regularized SGC. In a the SGC is applied and the histograms of the parameter values for each class in is shown in the plots. b Shows the equivalent plots but using the regularized SGC that penalizes the number of features. The majority of the features are around value zero significant in the number of features highlighted to the user. This provides a set small enough to explore manually if required.

Conclusion and future work
The SGC model extension presented here allows for a more explainable set of results to be presented to the user. The regularization terms reduces the number of non-zero parameters and the overlap between parameterizations of the different classes. Future work could entail a more in depth exploration of how the network can be 'decomposed' in such a way as to minimize the number of label alterations. Producing a network separation by eliminating edges can find applications in social networks where polarized communities must be isolated as a means of inoculation.