webgraph-ans

Very large graphs, such as Web snapshots, large social networks, or software dependency graphs, can be analyzed using two approaches: distributing the computation on multiple computational units or compressing the graph to fit in the memory of a single one. A popular instance of the latter approach is the WebGraph framework, which has been recently rethought and reimplemented in Rust due to some Java limitations such as restricted array size ($2^{31}$ elements).

This project aims to enhance compression performances by replacing instantaneous codes used by WebGraph with a recently proposed entropy coder, Asymmetrical Numeral Systems which, used together with other ideas such as symbol folding, has shown to be incredibly effective.

Since this crate is based on WebGraph, please refer to its docs to learn more about the files and structs mentioned next.

Recompressing a BvGraph

Recompressing a BvGraph will produce the <graph_name>.ans, <graph_name>.pointers and<graph_name>.states files in the output directory.

Compile the bvcomp binary:

$ cargo build --release --bin bvcomp

Run bvcomp to recompress the graph

$ ./target/release/bvcomp <path_to_graph> <output_dir> <new_graph_name> [<compression_params>]

For example

$ ./target/release/bvcomp tests/data/cnr-2000/cnr-2000 ans-cnr-2000

This command recompresses the cnr-2000.graph file located in tests/data/cnr-2000/ and saves the output in current directory with the name ans-cnr-2000.

Note: [compression_params] is optional. If omitted, default values are used.

Loading a BVGraphSeq with ANSBvGraphSeq

To load a BvGraphSeq, you only need the BASENAME.ans file. Then, you can use ANSBvGraphSeq:

 let graph = ANSBvGraphSeq::load("BASENAME")?;

Once a BvGraphSeq, you can use it to visit sequentially the graph.

Loading a BVGraph with ANSBvGraph

To load a BvGraph the user needs to supply the following files: BASENAME.ans, BASENAME.pointers and BASENAME.states.

 let graph = ANSBvGraph::load("BASENAME")?;

Once a BvGraph, you can use it to visit the graph, even randomly.

Results

The proposed methodology is evaluated by comparing its compression effectiveness and its decoding speed to the performances of WebGraph by using several real web graphs and social graphs available in the LAW website:

Graph	Nodes	Arcs
dblp-2011	986,324	6,707,236
enwiki-2023	6,625,370	165,206,104
eu-2015	1,070,557,254	91,792,261,600
eu-2015-host	11,264,052	386,915,963
gsh-2015	988,490,691	33,877,399,152
gsh-2015-host	68,660,142	1,802,747,600
hollywood-2011	2,180,759	228,985,632
twitter-2010	41,652,230	1,468,365,182

All experiments were performed on an Intel 12th Gen i7-12700KF CPU equipped with 64GB RAM and 37 MB cache while the code was compiled by using level 3 optimizations and native CPU instructions.

Ps: the tables present in the next sections are computed using percomponent_analysis.py and script.py, hence you can see the performances on your machine running the mentioned scripts.

Standard compression

Next tables show the comparison between WebGraph and the proposed methodology. In particular, given a graph:

BVGraph is the size of the compressed graph using WebGraph;
ANSBVGraph is the size of the file resulting from the compression of the graph with the proposed methodology;
bit/link represents the number of bits required to represent an arc of the graph;
phases indicates the cost of representing all the needed data to randomly access the graph;
random speed the time needed to enumerate the successors of 10 million random nodes.

The values between parenthesis represent the comparison with WebGraph.

The results highlight how the proposed approach is able to improve the compression performances by an average of 10% on the whole set of graphs. This is particularly striking for web graphs, because WebGraph is already using specialized codes that are tailored for the distribution of residuals, and indeed the compression records obtained by BVGraph with LLP ordering have been standing for almost two decades. This is the first time a significant increase in compression has been observed. Even social graphs, notably hard to compress due to the absence of the exploitable patterns present in web graphs, are compressed with a better outcome.

Graph	BVGraph	ANSBVGraph	bit/link	phases	total	random speed (ns/arc)
dblp-2011	7.0MiB	6.2MiB	7.728 (-11%)	4.2MiB (+341%)	10.4MiB	58.0 (+10.9%)
enwiki-2023	266.8MiB	244.0MiB	12.389 (-9%)	30.4MiB (+263%)	274.4MiB	55.0 (+65.2%)
eu-2015	12.8GiB	11.4GiB	1.064 (-11%)	4.6GiB (+313%)	16.0GiB	22.0 (+22.7%)
eu-2015-host	151.0MiB	140.9MiB	3.056 (-7%)	49.6MiB (+299%)	190.6MiB	23.0 (+37.7%)
gsh-2015	8.1GiB	7.2GiB	1.823 (-11%)	4.2GiB (+330%)	11.4GiB	36.0 (+25.3%)
gsh-2015-host	1.3GiB	1.2GiB	5.764 (-7%)	306.9MiB (+289%)	1.5GiB	38.0 (+32.0%)
hollywood-2011	139.6MiB	132.0MiB	4.834 (-5%)	10.2MiB (+245%)	142.2MiB	32.0 (+64.9%)
twitter-2010	2.4GiB	2.2GiB	12.974 (-8%)	194.1MiB (+237%)	2.4GiB	45.0 (+61.0%)

The next table gives a more in-depth look at the compression results shown above. In particular, the compression rates of the proposed methodology are shown and compared with those of WebGraph for each individual component of the BVGraph format, allowing the performance of compression using ANS to be highlighted without taking into account the cost associated with storing the models.

Name	Blocks	Intervals	Residuals	References	Outdegree
dblp-2011	546.2KiB (-2.4%)	256.1KiB (-33.1%)	4.6MiB (-11.4%)	235.8KiB (+1.1%)	489.6KiB (-18.0%)
enwiki-2023	11.2MiB (-2.5%)	1.3MiB (-38.3%)	224.6MiB (-8.3%)	2.1MiB (-5.0%)	4.6MiB (-24.7%)
eu-2015	1.6GiB (-5.4%)	1.1GiB (-22.1%)	7.4GiB (-6.8%)	334.4MiB (-24.1%)	948.0MiB (-26.4%)
eu-2015-host	7.7MiB (+0.5%)	2.8MiB (-29.0%)	122.4MiB (-6.4%)	2.2MiB (-5.1%)	5.8MiB (-8.0%)
gsh-2015	814.7MiB (-4.4%)	480.9MiB (-18.1%)	4.9GiB (-8.0%)	305.0MiB (-21.8%)	749.7MiB (-24.7%)
gsh-2015-host	70.4MiB (+0.1%)	23.0MiB (-25.4%)	1.1GiB (-7.0%)	16.2MiB (-6.0%)	39.9MiB (-9.3%)
hollywood-2011	10.8MiB (-0.4%)	6.5MiB (-26.7%)	111.9MiB (-4.0%)	649.3KiB (-14.1%)	2.0MiB (-23.2%)
twitter-2010	79.3MiB (+6.8%)	7.9MiB (-45.7%)	2.1GiB (-8.3%)	12.0MiB (+1.5%)	26.2MiB (-18.7%)

High compression

The next table shows instead the encoding results of the high-compressed graphs, with an average increase of the compression performances of around 10% even in this scenario. Like in the previous results, the sequential speed, which measures the time needed to enumerate the successors of all graph nodes visited sequentially, increases in a reasonable manner allowing the user to retrieve the data at very high speed despite the additional layer of compression.

Graph	hc-BVGraph	hc-ANSBVGraph	hc-bit/link	Sequential Speed (ns/arc)
dblp-2011	6.8MiB	6.0MiB	7.456 (-12%)	21.5 (+94.2%)
enwiki-2023	259.8MiB	236.0MiB	11.985 (-9%)	18.7 (+74.9%)
eu-2015	9.6GiB	8.6GiB	0.801 (-11%)	3.4 (-13.0%)
eu-2015-host	144.5MiB	134.5MiB	2.916 (-7%)	6.4 (+81.6%)
gsh-2015	6.3GiB	5.6GiB	1.418 (-12%)	4.4 (+46.6%)
gsh-2015-host	1.2GiB	1.2GiB	5.492 (-7%)	9.9 (+72.5%)
hollywood-2011	133.3MiB	125.6MiB	4.602 (-6%)	10.5 (+71.6%)
twitter-2010	2.4GiB	2.2GiB	12.731 (-8%)	16.4 (+76.3%)

Even if less important, rates of compression of single components achieve peaks of around 45% despite the smaller presence of exploitable patterns caused by the higher level of compression performed by WebGraph.

Graph	Outdegree	Reference	Block	Intervals	Residuals
dblp-2011	489.6KiB (-18.0%)	250.8KiB (-3.6%)	634.9KiB (-0.6%)	232.1KiB (-35.6%)	4.3MiB (-12.4%)
enwiki-2023	4.6MiB (-24.7%)	2.7MiB (-10.6%)	14.2MiB (+1.0%)	1.0MiB (-43.7%)	213.3MiB (-9.1%)
eu-2015	947.8MiB (-26.4%)	337.0MiB (-23.9%)	1.4GiB (-5.7%)	457.1MiB (-24.7%)	5.4GiB (-6.6%)
eu-2015-host	5.8MiB (-8.0%)	2.4MiB (-1.6%)	8.7MiB (+2.7%)	2.0MiB (-33.6%)	115.5MiB (-7.1%)
gsh-2015	749.7MiB (-24.7%)	275.0MiB (-20.4%)	661.6MiB (-6.2%)	202.3MiB (-25.0%)	3.7GiB (-7.9%)
gsh-2015-host	39.9MiB (-9.3%)	18.8MiB (-5.8%)	80.8MiB (+1.0%)	15.5MiB (-31.1%)	1.0GiB (-7.5%)
hollywood-2011	1.9MiB (-23.2%)	514.0KiB (-25.3%)	11.8MiB (-0.7%)	5.7MiB (-28.6%)	105.5MiB (-4.3%)
twitter-2010	26.2MiB (-18.7%)	13.5MiB (-5.9%)	97.8MiB (+9.6%)	6.3MiB (-51.3%)	2.0GiB (-8.8%)

Conclusions

The proposed methodology has shown to be an excellent alternative to encode the BV format representation of a web graph. The average increase in performance is by no means negligible, with an average value of 10% there is a saving in storage, which with very large web graphs can be significant. The proposed methodology represents the first and unique attempt to make this encoding technique work in combination with the BV format, thus it can certainly be enhanced by improving some of its parts.

An important open problem is that of improving the storage of phases. At this time for very sparse graphs phases are so large that they partially cancel the improvement in the size of the bit stream. While this is not important for sequential decoding, an improvement in phase storage would further reduce the footprint of a random-access ANSBVGraph.

PS: BV graphs can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benches		benches
examples		examples
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
percomponent_analysis.py		percomponent_analysis.py
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webgraph-ans

Recompressing a BvGraph

Loading a BVGraphSeq with ANSBvGraphSeq

Loading a BVGraph with ANSBvGraph

Results

Standard compression

High compression

Conclusions

About

Releases

Packages

Contributors 2

Languages

ciminilorenzo/webgraph-ans-rs

Folders and files

Latest commit

History

Repository files navigation

webgraph-ans

Recompressing a BvGraph

Loading a BVGraphSeq with ANSBvGraphSeq

Loading a BVGraph with ANSBvGraph

Results

Standard compression

High compression

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages