GraphZIP: a cliquebased sparse graph compression method
 Ryan A. Rossi^{1}Email authorView ORCID ID profile and
 Rong Zhou^{2}
Received: 29 December 2017
Accepted: 24 February 2018
Published: 3 March 2018
Abstract
Massive graphs are ubiquitous and at the heart of many realworld problems and applications ranging from the World Wide Web to social networks. As a result, techniques for compressing graphs have become increasingly important and remains a challenging and unsolved problem. In this work, we propose a graph compression and encoding framework called GraphZIP based on the observation that realworld graphs often form many cliques of a large size. Using this as a foundation, the proposed technique decomposes a graph into a set of large cliques, which is then used to compress and represent the graph succinctly. In particular, diskresident and inmemory graph encodings are proposed and shown to be effective with three important benefits. First, it reduces the space needed to store the graph on disk (or other permanent storage device) and inmemory. Second, GraphZIP reduces IO traffic involved in using the graph. Third, it reduces the amount of work involved in running an algorithm on the graph. The experiments demonstrate the scalability, flexibility, and effectiveness of the cliquebased compression techniques using a collection of networks from various domains.
Keywords
Introduction
This work proposes a fast parallel framework for graph compression based on the notion of cliques. There are two fundamentally important challenges in the era of big graph data [1], namely, developing faster and more efficient graph algorithms and reducing the amount of space required to store the graph on disk or load it into memory [2, 3]. As such, we propose a cliquebased graph compression method called GraphZIP that improves both the runtime and performance of graph algorithms (and more generally computations). Moreover, it reduces the space required to store the graph in both memory and disk and carry out computations. In addition, the techniques are shown to be effective for compressing both large sparse graphs as well as dense graphs. Both types of graphs arise frequently in many application domains and thus are of fundamental importance.
The cliquebased methods support both lossless [4] and lossy graph compression [5]. For lossless compression, any heuristic or exact clique finder can be used in the clique discovery phase [6–9]. Heuristic clique finders are faster but lead to worse compression. Depending on application domain constraints, one may trade off time and accuracy (better compression). For instance, compressing the graph is a onetime cost and for many applications it might be warranted to spend extra time computing a better compression (via an exact clique finder). These techniques are also flexible for lossy graph compression by relaxing the notion of clique via any arbitrary clique relaxation [10–12]. For instance, consider approximate cliques (nearcliques) that are missing a few edges (that would otherwise form a valid clique). Thus, it is easy to simply encode the missing edges as errors, and use previous techniques with the exception that the computations would need to incorporate the errors, which adds a small computational cost, but may reduce the space needed quite significantly. Clearly, the pseudo cliques must not have too many missing edges (errors), otherwise the computational costs outweigh the benefits. Most of these algorithms are easily constrained such that a pseudo clique does not contain more than k missing edges.
The applications and advantages of such graph compression techniques are endless. For instance, parallel computing architectures such as the GPU [13–15] have limited memory and thus unable to handle massive graphs [16, 17]. In addition, GPU algorithms must be designed with this limitation in mind, and in many cases, may be required to perform significantly more computations, in order to avoid using additional memory [18, 19]. For instance, checking if a neighbor exists on the CPU takes o(1) using a perfect hash table whereas on the GPU it takes \(\mathcal {O}(\log N(v))\) using a binary search. This is because storing a perfect hash table for each worker (core, processing unit) on the GPU is impractical for most graphs.
Existing graph algorithms may leverage the proposed graph encoding with little effort (as opposed to the existing specialized techniques). For instance, it can easily replace the current graph encoding such as CSR/CSC [20] and the ilk by implementing a few graph primitive functions such as “retrieve neighbors of a node”, “check if a neighbor exists”, “get degree of a node”, among other primitive graph operations used as building blocks. Most importantly, the method is efficient for big data and easily parallelized for both shared and distributed memory architectures. The method also has a variety of applications beyond the time and space improvements described above. For instance, one may use this technique to visualize graph data better as the large cliques can be combined to reveal other important structural properties [21, 22].
 (1)
Inmemory graph encoding.
 (2)
Diskresident graph encoding.

Reduce space/storage requirements.

Speedup graph algorithms/computations and primitives.

Reduce IO costs (e.g., CPU to GPU).

Improve caching (and thus performance).
Summary of contributions
We propose a class of cliquebased graph compression techniques. Our main contributions are as follows:
Query preserving compression
Given an arbitrary query Q on G, the query Q can be evaluated directly using the significantly smaller compressed graph \(G_{c}.\) Thus, any algorithm for answering Q in G can be applied to answer Q in the compressed graph \(G_{c}\), without requiring decompression. This is in contrast to many existing techniques designed for social networks [25] and web graphs [26–29].
Support for arbitrary graph queries
The approach naturally generalizes for arbitrary graph queries including (i) graph matching queries that determine if a subgraph pattern exists or not, as well as (ii) reachability queries that determine whether a path exists between two vertices in G, among others.
Efficient incremental updates
Most realworld graphs are changing over time with the addition or deletion of nodes and edges [30, 31]. As opposed to existing methods, our approach provides efficient incremental maintenance of the compressed graph directly. Thus, the compressed graph is computed, and then incrementally maintained using fast parallel localized updates. In particular, given a new edge \((v_i, v_j)\), our approach requires searching only the local neighborhood around that edge (to check if a larger clique of size \(k+1\) is possible).
Parallel
The graph compression method is also scalable for massive data with billions or more edges. A parallel implementation of the algorithms is used and shown to be fast and scalable for massive networks.
Flexible
Methods
Preliminaries
Let \(G=(V,E)\) be a graph. Given a vertex \(v \in V\), let \(N(v) = \{w \,  \, (v,w) \in E\}\) be the set of vertices adjacent to v in G. The degree \(\mathrm{d}_{v}\) of \(v \in V\) is the size of the neighborhood \(N(v)\) of v. Let \(diam(G) = max_{v,u} D(v,u)\) denote the diameter of G defined as the longest shortest path between any two vertices (v, u) of G where D(v, u) is the graph distance between v and u, i.e., the minimum length of the paths connecting them. A clique is a set of vertices any two of which are adjacent. The maximum size of a clique in G is the clique number of G and is denoted by \(\omega (G)\). We exploit many of the fundamental properties of cliques in the design of the parallel framework. Importantly, cliques are nested such that a clique C of size k contains a clique of size \(k1\). Cliques are closed under exclusion, for instance, consider a clique C and \(v \in C\), then \(C  {v}\) is also a clique. Hence, a clique C is a heredity graph property, since every induced subgraph of C is also a clique. Finally, cliques can also be overlapping and in general may overlap in all but a single vertex. In network analysis and network science, cliques can be viewed as the most strict definition of a community. From that perspective, cliques are perfectly dense (degree of \(k1\), larger degree not possible), perfectly compact (\(diam(C) = 1\)), and perfectly connected, i.e., C is a \(k1\) vertexconnected and \(k1\) edgeconnected graph). Examples of cliques of size 5 and 10 are provided in Fig. 1. We also use the notion of kcores. A kcore in G is a vertex induced subgraph where all vertices have degree of at least k. The kcore number of a vertex v denoted as K(v) is the largest k such that v is in a kcore. Let \(\mathrm{K}(G)\) be the largest core in G, then \(K(G)+1\) is an upper bound on the size of the maximum clique. Many of these facts are exploited in the design of the parallel framework for graph compression and encoding.
For each individual vertex, edge, and subgraph, we place an upper bound on the size of the maximum clique possible. These vertex, edge, and subgraph upper bound(s) enable us to potentially remove each individual vertex, edge, and subgraph for which a clique of maximum size cannot arise (using the individual upper bound on each graph element).
Inmemory graph encoding
Now, we describe a generalization of the compressed sparse column/row (CSC/CSR) encoding [20] for the cliquebased graph compression techniques.^{1} This is particularly useful as an inmemory encoding of the graph. In particular, the generalization makes it easy for use them with many existing graph algorithms. To generalize CSC/CSR, a new vertex type is introduced called a clique vertex. Intuitively, the vertex represents a clique C in G and the size of the clique denoted \(k=C\) must be of a sufficient size. The approach always finds and uses the largest such cliques in G to achieve the best compression. However, cliques of size \(k>2\) are beneficial as they may reduce space, IO costs, algorithm performance, and thus, this work requires cliques to be at least \(k=3\) and above. Obviously cliques of size 2 are not useful and would not improve compression. Clearly, the larger the cliques, the more compact the graph can be encoded, resulting in reduced IO costs, and so forth.^{2}
Moreover, this also reduces space needed to store the graph (on disk), as we avoid writing each clique one by one, and can simply write the first vertex id of that clique. For instance, one possibility is to write the first vertex id of each clique (in order) on the first line of each file. See Fig. 3 for an example illustrating the space savings using such an approach to encode the graph in Fig. 4. To read this type of compressed graph from disk (assuming the graph reader simply reads the graph and encodes it using any arbitrary encoding), we would read the first actual line, and immediately know the cliques by simply examining the start id and the vertex start id of the next clique. Notice that this also reduces costly IO associated with reading the graph from disk, and/or the IO involved in sending the data from the CPU to GPU, or across the network in a distributed architecture, etc. Afterwards, the graph reader would proceed as usual reading the edges, etc.
The difference in cost between storing the cliques directly and the format that relabels vertices (that is, exchanges vertex ids) based on the cliques is \(\mathcal {O}(\sum _{C \in \mathcal {C}} C)\) compared to \(\mathcal {O}(\mathcal {C}+1)\). As shown in "Results and discussion" section, this difference is significant for many realworld graphs since \(\mathcal {C}\) is often thousands (and in many cases much larger), and the cliques in \(\mathcal {C}\) can often be quite large as well. Cost of encoding all remaining edges is identical and thus not shown above. Note that the cost above is for diskresident graph encodings. Though, the difference is similar for inmemory encodings.
Example
Given the simple graph in Fig. 4a, the method identifies the cliques of largest size shown in Fig. 4b. Figure 4c shows an intuitive example of a possible graph encoding scheme using the proposed framework. To store the graph compactly (e.g., to disk), notice that one can simply write the cliques to a file such that each line contains the vertices of a clique \(C \in \mathcal {C}\), followed by the remaining edges (between nodes that do not share a clique). This can be reduced further by first recoding the vertex ids such that the vertex ids in the cliques are \(C_1=\{v_1,\ldots , v_k\}\), \(C_{2}=\{v_{k+1},\ldots , \}\), and so on. Now, it is possible to encode the vertices in each clique by simply storing the first (min) vertex id. For instance, inmemory or on disk, the first vertex id of each clique is stored one after the other, for instance, storing \(v_1, v_6, v_{10}\) (in memory as an array, or to disk/a single line of a file) is enough since the clique \(C_{i}\) is retrieved by examining the ith position. That is, to retrieve the first clique in Fig. 4b, we retrieve the first vertex \(v_1\) and can compute the last vertex id by the \(i+1\) position, which corresponds to the first vertex id of the clique \(i+1\). Thus, using this, we immediately get the last vertex id \(v_5\). Furthermore, fundamental graph primitives (such as “checking if a node is a neighbor”) may benefit from this encoding as many graph operations can be answered in \(\mathcal {o}(1)\) with a simple range check.
Algorithm
GraphZIP compression results for a variety of realworld graphs. Note \(S_c\) denotes the space saved by GraphZIP defined as \(S_c = 1  \frac{G_c}{G}\)
Graph  \(\rho\)  G (bytes)  \(G_{c}\) (bytes)  Space savings (\(S_c\))  2χ comp. ratio  Time (s) 

iaenrononly  0.0614  4119  3004  27.07  69.94  0.006 
iainfectdub.  0.0330  20,662  13,287  35.69  79.90  0.007 
iaemailuniv  0.0085  41,868  32,405  22.60  79.34  0.017 
bioDD21  0.0009  136,847  85,510  37.51  80.72  0.139 
caHepPh  0.0019  1,178,671  498,661  57.69  89.49  5.637 
caAstroPh  0.0012  2,121,214  1,423,607  32.89  65.89  1.294 
caCondMat  0.0004  1,001,147  695,000  30.58  80.46  0.892 
caMathSciNet  0.0000  10,482,388  8,370,757  20.14  76.53  24.909 
webgoogle  0.0033  23,043  12,216  46.99  81.42  0.004 
webBerkStan  0.0003  199,330  135,813  31.87  86.55  0.170 
websk2005  0.0000  4,083,783  2,040,783  50.03  92.60  9.030 
webindochina  0.0007  473,984  167,729  64.61  94.23  0.141 
socgplus  0.0001  453,060  328,927  27.40  92.36  0.046 
hamming62  0.9048  10,493  4916  53.15  90.59  0.004 
hamming82  0.9686  226,317  110,838  51.03  96.32  0.745 
cfat2001  0.0771  10,664  4674  56.17  84.57  0.003 
chesapeake  0.2294  1035  749  27.63  58.94  0.001 
Heuristic and exact clique methods are used for finding the largest possible clique at each iteration (Algorithm 1 Line 4). The heuristic method is faster but results in slightly less compression, whereas the exact clique ordering takes longer, but results in a better compression. Many other variants also exist. For instance, one approach that performed well was to search only the topk vertices (using either a heuristic or exact clique finder, that is, a method that for each vertex neighborhood subgraph \(N(v)\) returns either a large clique or the maximum clique existing at that vertex neighborhood). The topk vertices to search were chosen based on the maximum kcore number of each vertex.^{5} This ordering can then be recomputed after each iteration as it is fast taking \(\mathcal {O}(E)\) where E is the number of edges remaining in G.
Parallel algorithm variant
Results and discussion
To evaluate the effectiveness of the proposed techniques, we use a wide variety of realworld graphs^{7} from different application domains.
Results comparing the encoded graph \(G_c\) to the compressed \(f(G_c)\)
Graph  Space savings, \(S_c = 1  \frac{G_c}{G}\)  

Encoded graph, \(G_c\)  Compressed, \(f(G_c)\)  
iaenrononly  27.07  69.94 
iainfectdub.  35.69  79.90 
iaemailuniv  22.60  79.34 
bioDD21  37.51  80.72 
caHepPh  57.69  89.49 
caAstroPh  32.89  65.89 
caCondMat  30.58  80.46 
webgoogle  46.99  81.42 
webBerkStan  31.87  86.55 
websk2005  50.03  92.60 
webindochina  64.61  94.23 
socgplus  27.40  92.36 
hamming62  53.15  90.59 
hamming82  51.03  96.32 
cfat2001  56.17  84.57 
chesapeake  27.63  58.94 
Results for compressing the encoded graph is shown in Table 2. These results use the same basic GraphZIP format used in Table 1 but reduce the space further by applying an additional compression technique to further compress the graph. This is most useful for storing large graphs on disk. Table 3 compares GraphZIP to the bvgraph compression method using Layered Label Propagation (LLP) order [45]. In both cases, GraphZIP compares favorably to this approach.
GraphzIP Compared to bvgraph compression with layered label propagation (Llp) order
Graph  V (bytes)  E (bytes)  G (bytes)  \(S_{LLC}\) (bytes)  \(S_C\) (bytes)  \(S_{F(G_C)}\) (bytes) 

SocfbPenn  42K  1.4M  4237,507  40.49  46.56  78.16 
SocfbTexas  36K  1.6M  4,605,427  29.80  34.13  79.53 
Conclusion
In this work, we described a class of of cliquebased graph compression methods that leverage large cliques in realworld networks. Using this as a foundation, the proposed technique decomposes a graph into a set of large cliques, which is then used to compress and encode the graph succinctly. Diskresident and inmemory graph encodings were proposed and shown to be effective with the following important benefits: (1) It reduces the space needed to store the graph on disk (or other permanent storage device) and inmemory, (2) GraphZIP reduces IO traffic involved in using the graph, (3) it reduces the amount of work involved in running an algorithm on the graph, and (4) improves caching and performance. The graph compression techniques were shown to reduce memory (both on disk and inmemory) and speedup graph computations considerably. The methods also have a variety of applications beyond the time and space improvements described above. For instance, one may use this technique to visualize graph data better as the large cliques can be combined to reveal other important structural properties. Future work will explore using GraphZIP and its inmemory encoding to speedup common graph algorithms. This requires implementation of graph primitives that leverage the encoding. In addition, we will also investigate GraphZIP for compressing graphs from a variety of other domains.
The CSR (CSC) encoding represents a sparse (adjacency) matrix \(\varvec{\mathrm {M}}\) of a graph G by three arrays that contain nonzero values, the extents of rows (columns), and column (row) indices.
However, since it requires additional time to find such cliques, then depending on the domain application it might be preferred to find cliques of a larger size. Thus, the methods allow for the user to adjust this parameter, which can improve runtime at the cost of space.
A successor iterator is an enumerator that can be used to obtain a list of the neighbors of a node in some (usually deterministic) order.
Significant overlap between cliques occurs frequently due to properties discussed in "Preliminaries" section.
Declarations
Authors' contributions
All authors contributed equally in the conceptualization of GraphZIP and its design and development. Both authors read and approved the final manuscript.
Acknowledgements
We thank Dan Davies for discussions.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
Data is available online at NetworkRepository [46]: http://networkrepository.com.
Ethics approval and consent to participate
Not applicable.
Funding
No funding to declare.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Fisher D, DeLine R, Czerwinski M, Drucker S. Interactions with big data analytics. Interactions. 2012;19(3):50–9.View ArticleGoogle Scholar
 Kambatla K, Kollias G, Kumar V, Grama A. Trends in big data analytics. J Parallel Distrib Comput. 2014;74(7):2561–73.View ArticleGoogle Scholar
 Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S. Starfish: A selftuning system for big data analytics. CIDR. 2011;11:261–72.Google Scholar
 Peshkin L. Structure induction by lossless graph compression. arXiv preprint arXiv:cs/0703132. 2007.
 Gupta A, Verdú S. Nonlinear sparsegraph codes for lossy compression. IEEE Trans Inf Theory. 2009;55(5):1961–75.MathSciNetView ArticleMATHGoogle Scholar
 Rossi RA, Gleich DF, Gebremedhin AH, Patwary M, Ali M. Parallel maximum clique algorithms with applications to network analysis and storage, vol 10. arXiv preprint arXiv:1302.6256. 2013.
 Tomita E, Sutani Y, Higashi T, Takahashi S, Wakatsuki M. A simple and faster branchandbound algorithm for finding a maximum clique. In: WALCOM: Algorithms and computation. 2010. p. 191–203.Google Scholar
 San Segundo P, RodríguezLosada D, Jiménez A. An exact bitparallel algorithm for the maximum clique problem. Comput Oper Res. 2011;38:571–81.MathSciNetView ArticleMATHGoogle Scholar
 Pullan WJ, Hoos HH. Dynamic local search for the maximum clique problem. JAIR. 2006;25:159–85.MATHGoogle Scholar
 Pattillo J, Youssef N, Butenko S. Clique relaxation models in social network analysis. Handbook of optimization in complex networks. Springer; 2012. p. 143–62.Google Scholar
 Balasundaram B, Butenko S, Hicks I, Sachdeva S. Clique relaxations in social network analysis: the maximum kplex problem. Oper Res. 2011;59(1):133–142.MathSciNetView ArticleMATHGoogle Scholar
 Alba RD. A graphtheoretic definition of asociometric clique. J Math Soc. 1973;3(1):113–26.View ArticleMATHGoogle Scholar
 Harish P, Narayanan P. Accelerating large graph algorithms on the GPU using cuda. In: HiPC. 2007. p. 197–208.Google Scholar
 Vineet V, Narayanan P. Cuda cuts: fast graph cuts on the gpu. In: Computer vision and pattern recognition workshops (CVPRW). 2008. p. 1–8.Google Scholar
 Zhou R. System and method for selecting useful smart kernels for generalpurpose GPU computing. US Patent 20,150,324,707. 2015.Google Scholar
 Liu X, Li M, Li S, Peng S, Liao X, Lu X. Imgpu: Gpuaccelerated influence maximization in largescale social networks. IEEE Trans Parallel Distrib Syst. 2014;25(1):136–45.View ArticleGoogle Scholar
 Pan Y, Wang Y, Wu Y, Yang C, Owens JD. Multigpu graph analytics. arXiv preprint arXiv:1504.04804. 2015.
 Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: SIGPLAN. New York: ACM; 2008, p. 73–82.Google Scholar
 Zhou R. Systems and methods for efficient sparse matrix representations with applications to sparse matrixvector multiplication and PageRank on the GPU. 2015.Google Scholar
 Kepner J, Gilbert J. Graph algorithms in the language of linear algebra. In: SIAM. 2011.Google Scholar
 Von Landesberger T, Kuijper A, Schreck T, Kohlhammer J, van Wijk JJ, Fekete JD, Fellner DW. Visual analysis of large graphs: stateoftheart and future research challenges. Computer Graphics ForumNew York: Wiley Online Library; 2011. p. 1719–49.Google Scholar
 Ahmed NK, Rossi RA. Interactive visual graph analytics on the web. In: ICWSM. 2015, p. 566–9.Google Scholar
 Traud AL, Mucha PJ, Porter MA. Social structure of facebook networks. Phys A Stat Mech Appl. 2011;391:4165–80.View ArticleGoogle Scholar
 Girvan M, Newman MEJ. Community structure in social and biological networks. In: PNAS. 2002;99(12):7821–6.MathSciNetView ArticleMATHGoogle Scholar
 Chierichetti F, Kumar R, Lattanzi S, Mitzenmacher M, Panconesi A, Raghavan P. On compressing social networks. SIGKDD, 2009. p. 219–28.Google Scholar
 Grabowski S, Bieniecki W. Tight and simple web graph compression. arXiv preprint arXiv:1006.0809. 2010.
 Buehrer G, Chellapilla K. A scalable pattern mining approach to web graph compression with communities. In: WSDM. New York: ACM; 2008. p. 95–106.Google Scholar
 Boldi P, Vigna S. The webgraph framework i: compression techniques. In: WWW. 2004. p. 595–602.Google Scholar
 Suel T, Yuan J. Compressing the graph structure of the web. In: IEEE data compression conference. 2001. p. 213–22.Google Scholar
 Kempe D, Kleinberg J, Kumar A. Connectivity and inference problems for temporal networks. STOC, 2000. p. 504–13.Google Scholar
 Ahmed NK, Berchmans F, Neville J, Kompella R. Timebased sampling of social network activity graphs. In: SIGKDD MLG. 2010. p. 1–9.Google Scholar
 Friedman N, Getoor L, Koller D, Pfeffer A. Learning probabilistic relational models. In: IJCAI. 1999. p. 1300–09.Google Scholar
 Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv. 2009;41(3):15.View ArticleGoogle Scholar
 Akoglu L, McGlohon M, Faloutsos C. Oddball: spotting anomalies in weighted graphs. In: Advances in knowledge discovery and data mining. 2010. p. 410–21.Google Scholar
 Eberle W, Holder L. Anomaly detection in data represented as graphs. Intell Data Anal. 2007;11(6):663–89.Google Scholar
 Rossi RA, Ahmed NK. Role discovery in networks. TKDE. 2015;27(4):1112–31.Google Scholar
 Sun J, Faloutsos C, Papadimitriou S, Yu PS. Graphscope: parameterfree mining of large timeevolving graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2007. p. 687–96.Google Scholar
 Prakash BA, Sridharan A, Seshadri M, Machiraju S, Faloutsos C. Eigenspokes: surprising patterns and scalable community chipping in large graphs. In: Advances in knowledge discovery and data mining. 2010. p. 435–48.Google Scholar
 Leskovec J, Lang KJ, Dasgupta A, Mahoney MW. Community structure in large networks: natural cluster sizes and the absence of large welldefined clusters. Internet Math. 2009;6(1):29–123.MathSciNetView ArticleMATHGoogle Scholar
 Hayashida M, Akutsu T. Comparing biological networks via graph compression. BMC Syst Biol. 2010;4(Suppl 2):13.View ArticleGoogle Scholar
 Ketkar NS, Holder LB, Cook DJ. Subdue: compressionbased frequent pattern discovery in graph data. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. New York: ACM; 2005. p. 71–6.Google Scholar
 Ahmed NK, Duffield N, Neville J, Kompella R. Graph sample and hold: a framework for biggraph analytics. In: SIGKDD. New York: ACM; 2014. p. 1446–55.Google Scholar
 Margaritis D, Faloutsos C, Thrun S. Netcube: a scalable tool for fast data mining and compression. VLDB, 2001.Google Scholar
 Rossi RA, Ahmed NK. Coloring large complex networks. Soc Net Anal Mining. 2014;4(1):37.Google Scholar
 Boldi P, Rosa M, Santini M, Vigna S. Layered label propagation: a multiresolution coordinatefree ordering for compressing social networks. WWW. New York: ACM; 2011. p. 587–96.Google Scholar
 Rossi RA, Ahmed NK. The network data repository with interactive graph analytics and visualization. In: Proceedings of the 29th AAAI conference on artificial intelligence. 2015. p. 4292–93. http://networkrepository.com