Very large graphs, such as Web snapshots, large social networks,
or software dependency graphs, can be analyzed using two approaches: distributing
the computation on multiple computational
units or compressing the graph to fit in the memory of a single
one. A popular instance of the latter approach is the WebGraph
framework, which has been recently
rethought and reimplemented in Rust due to
some Java limitations such as restricted array size (
This project aims to enhance compression performances by replacing instantaneous codes used by WebGraph with a recently proposed entropy coder, Asymmetrical Numeral Systems which, used together with other ideas such as symbol folding, has shown to be incredibly effective.
Since this crate is based on WebGraph, please refer to its docs to learn more about the files and structs mentioned next.
Recompressing a BvGraph will produce the <graph_name>.ans
,
<graph_name>.pointers
and<graph_name>.states
files in the output directory.
- Compile the bvcomp binary:
$ cargo build --release --bin bvcomp
- Run bvcomp to recompress the graph
$ ./target/release/bvcomp <path_to_graph> <output_dir> <new_graph_name> [<compression_params>]
For example
$ ./target/release/bvcomp tests/data/cnr-2000/cnr-2000 ans-cnr-2000
This command recompresses the cnr-2000.graph file located in tests/data/cnr-2000/ and saves the output in current directory with the name ans-cnr-2000.
Note: [compression_params] is optional. If omitted, default
values are used.
To load a BvGraphSeq
, you only need the BASENAME.ans
file. Then, you can use ANSBvGraphSeq
:
let graph = ANSBvGraphSeq::load("BASENAME")?;
Once a BvGraphSeq
, you can use it to visit sequentially the graph.
To load a BvGraph
the user needs to supply the following files: BASENAME.ans
, BASENAME.pointers
and
BASENAME.states
.
let graph = ANSBvGraph::load("BASENAME")?;
Once a BvGraph
, you can use it to visit the graph, even randomly.
The proposed methodology is evaluated by comparing its compression effectiveness and its decoding speed to the
performances of WebGraph by using several real web graphs and social graphs available in the LAW
website:
Graph | Nodes | Arcs |
---|---|---|
dblp-2011 | 986,324 | 6,707,236 |
enwiki-2023 | 6,625,370 | 165,206,104 |
eu-2015 | 1,070,557,254 | 91,792,261,600 |
eu-2015-host | 11,264,052 | 386,915,963 |
gsh-2015 | 988,490,691 | 33,877,399,152 |
gsh-2015-host | 68,660,142 | 1,802,747,600 |
hollywood-2011 | 2,180,759 | 228,985,632 |
twitter-2010 | 41,652,230 | 1,468,365,182 |
All experiments were performed on an Intel 12th Gen i7-12700KF CPU equipped with 64GB RAM and 37 MB cache while the code was compiled by using level 3 optimizations and native CPU instructions.
Ps: the tables present in the next sections are computed using percomponent_analysis.py
and script.py
, hence you can
see the performances on your machine running the mentioned scripts.
Next tables show the comparison between WebGraph and the proposed methodology. In particular, given a graph:
- BVGraph is the size of the compressed graph using WebGraph;
- ANSBVGraph is the size of the file resulting from the compression of the graph with the proposed methodology;
- bit/link represents the number of bits required to represent an arc of the graph;
- phases indicates the cost of representing all the needed data to randomly access the graph;
- random speed the time needed to enumerate the successors of 10 million random nodes.
The values between parenthesis represent the comparison with WebGraph.
The results highlight how the proposed approach is able to improve the compression performances by an average of 10% on the whole set of
graphs. This is particularly striking for web graphs, because WebGraph is already using specialized codes
that are
tailored for the distribution of residuals, and indeed the compression records obtained by BVGraph with LLP ordering
have been standing for almost two decades. This is the first time a significant increase in
compression has been observed. Even social graphs, notably hard to compress due to the absence of the exploitable
patterns present in web graphs, are compressed with a better outcome.
Graph | BVGraph | ANSBVGraph | bit/link | phases | total | random speed (ns/arc) |
---|---|---|---|---|---|---|
dblp-2011 | 7.0MiB | 6.2MiB | 7.728 (-11%) | 4.2MiB (+341%) | 10.4MiB | 58.0 (+10.9%) |
enwiki-2023 | 266.8MiB | 244.0MiB | 12.389 (-9%) | 30.4MiB (+263%) | 274.4MiB | 55.0 (+65.2%) |
eu-2015 | 12.8GiB | 11.4GiB | 1.064 (-11%) | 4.6GiB (+313%) | 16.0GiB | 22.0 (+22.7%) |
eu-2015-host | 151.0MiB | 140.9MiB | 3.056 (-7%) | 49.6MiB (+299%) | 190.6MiB | 23.0 (+37.7%) |
gsh-2015 | 8.1GiB | 7.2GiB | 1.823 (-11%) | 4.2GiB (+330%) | 11.4GiB | 36.0 (+25.3%) |
gsh-2015-host | 1.3GiB | 1.2GiB | 5.764 (-7%) | 306.9MiB (+289%) | 1.5GiB | 38.0 (+32.0%) |
hollywood-2011 | 139.6MiB | 132.0MiB | 4.834 (-5%) | 10.2MiB (+245%) | 142.2MiB | 32.0 (+64.9%) |
twitter-2010 | 2.4GiB | 2.2GiB | 12.974 (-8%) | 194.1MiB (+237%) | 2.4GiB | 45.0 (+61.0%) |
The next table gives a more in-depth look at the compression results shown above. In particular, the compression rates of the proposed methodology are shown and compared with those of WebGraph for each individual component of the BVGraph format, allowing the performance of compression using ANS to be highlighted without taking into account the cost associated with storing the models.
Name | Blocks | Intervals | Residuals | References | Outdegree |
---|---|---|---|---|---|
dblp-2011 | 546.2KiB (-2.4%) | 256.1KiB (-33.1%) | 4.6MiB (-11.4%) | 235.8KiB (+1.1%) | 489.6KiB (-18.0%) |
enwiki-2023 | 11.2MiB (-2.5%) | 1.3MiB (-38.3%) | 224.6MiB (-8.3%) | 2.1MiB (-5.0%) | 4.6MiB (-24.7%) |
eu-2015 | 1.6GiB (-5.4%) | 1.1GiB (-22.1%) | 7.4GiB (-6.8%) | 334.4MiB (-24.1%) | 948.0MiB (-26.4%) |
eu-2015-host | 7.7MiB (+0.5%) | 2.8MiB (-29.0%) | 122.4MiB (-6.4%) | 2.2MiB (-5.1%) | 5.8MiB (-8.0%) |
gsh-2015 | 814.7MiB (-4.4%) | 480.9MiB (-18.1%) | 4.9GiB (-8.0%) | 305.0MiB (-21.8%) | 749.7MiB (-24.7%) |
gsh-2015-host | 70.4MiB (+0.1%) | 23.0MiB (-25.4%) | 1.1GiB (-7.0%) | 16.2MiB (-6.0%) | 39.9MiB (-9.3%) |
hollywood-2011 | 10.8MiB (-0.4%) | 6.5MiB (-26.7%) | 111.9MiB (-4.0%) | 649.3KiB (-14.1%) | 2.0MiB (-23.2%) |
twitter-2010 | 79.3MiB (+6.8%) | 7.9MiB (-45.7%) | 2.1GiB (-8.3%) | 12.0MiB (+1.5%) | 26.2MiB (-18.7%) |
The next table shows instead the encoding results of the high-compressed graphs, with an average increase of the compression performances of around 10% even in this scenario. Like in the previous results, the sequential speed, which measures the time needed to enumerate the successors of all graph nodes visited sequentially, increases in a reasonable manner allowing the user to retrieve the data at very high speed despite the additional layer of compression.
Graph | hc-BVGraph | hc-ANSBVGraph | hc-bit/link | Sequential Speed (ns/arc) |
---|---|---|---|---|
dblp-2011 | 6.8MiB | 6.0MiB | 7.456 (-12%) | 21.5 (+94.2%) |
enwiki-2023 | 259.8MiB | 236.0MiB | 11.985 (-9%) | 18.7 (+74.9%) |
eu-2015 | 9.6GiB | 8.6GiB | 0.801 (-11%) | 3.4 (-13.0%) |
eu-2015-host | 144.5MiB | 134.5MiB | 2.916 (-7%) | 6.4 (+81.6%) |
gsh-2015 | 6.3GiB | 5.6GiB | 1.418 (-12%) | 4.4 (+46.6%) |
gsh-2015-host | 1.2GiB | 1.2GiB | 5.492 (-7%) | 9.9 (+72.5%) |
hollywood-2011 | 133.3MiB | 125.6MiB | 4.602 (-6%) | 10.5 (+71.6%) |
twitter-2010 | 2.4GiB | 2.2GiB | 12.731 (-8%) | 16.4 (+76.3%) |
Even if less important, rates of compression of single components achieve peaks of around 45% despite the smaller presence of exploitable patterns caused by the higher level of compression performed by WebGraph.
Graph | Outdegree | Reference | Block | Intervals | Residuals |
---|---|---|---|---|---|
dblp-2011 | 489.6KiB (-18.0%) | 250.8KiB (-3.6%) | 634.9KiB (-0.6%) | 232.1KiB (-35.6%) | 4.3MiB (-12.4%) |
enwiki-2023 | 4.6MiB (-24.7%) | 2.7MiB (-10.6%) | 14.2MiB (+1.0%) | 1.0MiB (-43.7%) | 213.3MiB (-9.1%) |
eu-2015 | 947.8MiB (-26.4%) | 337.0MiB (-23.9%) | 1.4GiB (-5.7%) | 457.1MiB (-24.7%) | 5.4GiB (-6.6%) |
eu-2015-host | 5.8MiB (-8.0%) | 2.4MiB (-1.6%) | 8.7MiB (+2.7%) | 2.0MiB (-33.6%) | 115.5MiB (-7.1%) |
gsh-2015 | 749.7MiB (-24.7%) | 275.0MiB (-20.4%) | 661.6MiB (-6.2%) | 202.3MiB (-25.0%) | 3.7GiB (-7.9%) |
gsh-2015-host | 39.9MiB (-9.3%) | 18.8MiB (-5.8%) | 80.8MiB (+1.0%) | 15.5MiB (-31.1%) | 1.0GiB (-7.5%) |
hollywood-2011 | 1.9MiB (-23.2%) | 514.0KiB (-25.3%) | 11.8MiB (-0.7%) | 5.7MiB (-28.6%) | 105.5MiB (-4.3%) |
twitter-2010 | 26.2MiB (-18.7%) | 13.5MiB (-5.9%) | 97.8MiB (+9.6%) | 6.3MiB (-51.3%) | 2.0GiB (-8.8%) |
The proposed methodology has shown to be an excellent alternative to encode the BV format representation of a web graph. The average increase in performance is by no means negligible, with an average value of 10% there is a saving in storage, which with very large web graphs can be significant. The proposed methodology represents the first and unique attempt to make this encoding technique work in combination with the BV format, thus it can certainly be enhanced by improving some of its parts.
An important open problem is that of improving the storage of phases. At this time for very sparse graphs phases are so large that they partially cancel the improvement in the size of the bit stream. While this is not important for sequential decoding, an improvement in phase storage would further reduce the footprint of a random-access ANSBVGraph.
PS: BV graphs can be found here.