Skip to content

Commit

Permalink
readme updated
Browse files Browse the repository at this point in the history
  • Loading branch information
ciminilorenzo committed Nov 13, 2024
1 parent bc02bf2 commit 2a47597
Showing 1 changed file with 10 additions and 28 deletions.
38 changes: 10 additions & 28 deletions README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -11,45 +11,27 @@ a framework that, beyond offering various tools that can be used to operate un s
of web graphs (locality and similarity), as well as some other ideas tailored for the context, to compress them in an efficient format called BVGraph.

This project aims to improve the records of the mentioned frameworks, which have been standing for almost
two decades, by adding another layer of compression by means of
[Asymmetrical Numeral Systems](https://en.wikipedia.org/wiki/Asymmetric_numeral_systems) (ANS) over the first layer of compression performed by WebGraph.

two decades, by switching from instantaneous codes to [Asymmetrical Numeral Systems](https://en.wikipedia.org/wiki/Asymmetric_numeral_systems) (ANS) when compressing
the graph in the BvGraph format.

### Compressor
Since the BVGraph format is composed of 9 models (Outdegree, ReferenceOffset, BlockCount, Blocks, IntervalCount,
IntervalStart, IntervalLen, FirstResidual, Residual), the model used by the compressor is going to be switched on the
fly among the nine built for each specific component of the compression format. Moreover, to overcome the problem of
dealing with enormous alphabets (we set as maximum symbol $2^{48} - 1$), the symbol folding technique introduced
by Moffat and Petri in their [work](https://dl.acm.org/doi/10.1145/3397175) is
implemented.

In general, the coder uses a 32-bit state and 16-bit renormalization step with some other constraints discussed below.

The compressor's interval for each model is defined as:
```math
I = [M * K, M * K * B)
```
where:
1. $M$ is the sum (power of two) of all approximated symbols frequencies for a specific model.
2. $K = 2^{16} - M$
3. $B = 2^{16}$

All this is done to guarantee that the shared interval between all models is $[2^{16}, 2^{32})$, a guarantee needed to make the compressor
able to switch models on the fly.

PS: This implementation assumes that the most frequent symbols are the smallest positive integers.
WIP

### Binaries
The binary that can be used to recompress a .graph is bvcomp:
```
$ cargo build --release --bin bvcomp
$ ./target/release/bvcomp <path_to_graph> <output_dir> <new_graph_name> <compression_params>
$ ./target/release/bvcomp <path_to_graph> <output_dir> <new_graph_name> [<compression_params>]
```
For example:
```
$ ./target/release/bvcomp tests/data/cnr-2000/cnr-2000 ans-cnr-2000
```
recompresses with standard compression parameters the cnr-2000.graph file in the tests/data/cnr-2000/ directory and save
the new compressed graph in the current directory with the name ans-cnr-2000.graph.
This command recompresses the cnr-2000.graph file located in the tests/data/cnr-2000/ directory using the default compression parameters and saves the new compressed graph in the current directory as ans-cnr-2000.graph.

Note: <compression_params> is optional. When not specified, default compression values are utilized.

### Benches
WIP

PS: graphs can be found [here](http://law.di.unimi.it/datasets.php).

0 comments on commit 2a47597

Please # to comment.