[WIP] Minor compression ratio improvements for high compression modes on small data #2781
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR bundles 3 modifications for high compression modes, targeting higher compression ratio on "small" (one-block) data. The benefit may extend beyond the first block, but they are vanishing (and so does the cost).
greedy
algorithms for collection of initial statistics inbtultra
. This step is globally positive, though benefits vary widely depending on file. It comes at a speed cost though, of around ~-10%, on one-block data (speed cost rapidly becomes insignificant for large inputs, likesilesia.tar
). Therefore, it was selected to not use this new method forbtopt
.btultra2
, which usesbtultra
as an initializer. The impact is small, but constantly positive, and there is no significant cost.greedy -> btultra -> btultra2
initialization. The impact is globally positive, but not always. There are wide variations depending on file type and input size. For this reason, the proposed heuristic is to enable the new method for inputs >= 8 KB, as the impact is rather positive for larger blocks, and rather negative for small ones. The speed cost, while not null, is not significant.1. Using
greedy
to collect initial statistics forbtultra
This step is globally beneficial, by non-negligible amounts. But there is also a sizable speed cost to it, so the trade-off is debatable. I think I'm rather favorable to it, since speed is less of a concern at
btultra
levels, but it also explains why this technique is not extended to the faster variantbtopt
.For benchmarking, I'm using multiple block sizes, in order to observe the impact of initialization. This is compared to previous PR #2771 which updated initialization of
btultra
with slightly altered starting value, producing better compression ratios on small data.2 KB blocks, btultra, level 14
dev
ultra_init
dev
ultra_init
At 2K block size, the impact is small but globally positive, saving ~100 KB.
But it comes at a cost, with speed down by -8.2% on average, due to the heavier initialization process using
greedy
instead of "blind" default values. There are very wide variations in this initialization cost depending on exact file. I'm unable to explain some of the most positive changes, such as a speed improvement fornci
(+3.3%), but it was tested multiple times to confirm the timing, and all measurements are consistently accurate within ~0.2%, so it's not a fluke.The best impact is achieved on
x-ray
(ratio +0.66%), but there is one negative impact, onmr
, which is pretty large (ratio -0.59%). I haven't yet been able to find a clear explanation for this case.followup investigation : the large negative impact on
mr
seems strongly correlated to the difference in initialization of literals. Essentially with "default" literals initialization "from source", the resulting literals statistics are a lot more flat than when using the outcome ofgreedy
(which seems to leave a lot of literals). For some reason, flatter literals statistics seems better formr
. Unclear if this is a general conclusion, or really a specificity ofmr
.One of the possible explanations could be that
greedy
leaves too many literals, thus producing more squeezed statistics favoring more frequent literals, making them cheaper hence preferable. These frequent literals might be over-represented, possible as a consequence ofgreedy
being unable to find 3-bytes matches and thus eliminate corresponding literals. This impact is of course very variable depending on files, and it's also possible that in some other cases, the literals probabilities produced bygreedy
are accurate enough, maybe because there aren't many 3-bytes matches worth selecting, and trying to artificially distort these statistics, typically by flattening them, ends up hurting compression ratio (ex:nci
).16 KB blocks, btultra, level 14
dev
ultra_init
dev
ultra_init
At 16 KB block size, there are a few improvements : compression benefits increases to 130 KB, now representing a 0.18% ratio improvement, and the speed cost is slightly better, at -6.4%.
The best case remains
x-ray
, with a ratio at+0.65%
, but there are many changes in the other files.mozilla
, which used to be neutral at 2K, is now a big positive contributor at 16K.mr
, which was negative at 2K is now positive at 16K. A few files which were slightly positive at 2K become slightly negative at 16K. So it's all over the place. Difficult to draw conclusions from these details, other than it's "globally more positive" at 16K.nci
continues to defy expectation by running faster despite a heavier initialization process.128 KB blocks, btultra, level 17
dev
ultra_init
dev
ultra_init
At 128 KB block size, the compression benefits increase even more, at > 200 KB, now representing a +0.33% ratio improvement, which is quite sizable for just a difference in statistics initialization.
The majority of the gains come from
mozilla
,mr
andx-ray
.mr
is the new champion of compression ratio gains, at +0.87%. It's all the more remarkable as it was previously the only file in negative territory at 2K blocks. There are a few files which lose compression ratio (compared to default initialization), though it's never by large amounts. Once again, it's difficult to pinpoint a single clearly reproducible reason.The average speed cost increases to -11.4%, with large differences depending on exact file. Worst speed costs are on
mr
andsao
, with a drop of almost -25%, whilenci
andwebster
end up running faster despite the much heavier initialization.It might be possible to reduce the speed cost by making initialization faster, either by searching less or compressing a portion of the input, but this will likely have an impact on compression ratio, so would have to be carefully monitored.
follow-up investigation : limiting statistics analysis to the first 64 KB of the first block sensibly reduces the speed cost (by more than half). However, it also loses a part of the compression ratio gain in the process (between 1/4th and 1/3rd).
2.
btultra2
: modify statistics collectionbtultra2
runs the first block withbtultra
, then throws away the result, but keeps the statistics, and re-starts compression of first block using these baseline statistics.The second modification changes the statistics transmission logic, rebuilding them from the
seqStore
, instead of simply inheriting the ones which are naturally stored into theoptPtr
as a consequence of runningbtultra
.This results in a small, yet consistently positive, compression ratio gain for
btultra2
, at essentially no cost.Note that, in this scenario, the first round of
btultra
still uses the default "blind" initialization (same asdev
), not thegreedy
first pass mentioned in earlier part.The benefit though is pretty small, and barely worth mentioning (I'm not detailing benchmarks here), averaging 0.01% ratio improvement. What makes it desirable is that it comes from free, and is a necessary ingredient for stage 3 to work properly.
3.
btultra2
: initialize statistics frombtultra
, itself initialized bygreedy
So that's more or less the intended end game of this exercise. Let's see if, with a first round using
greedy
, statistics produced bybtultra
get better, resulting in a better final compression ratio forbtultra2
.2K blocks, btultra2, level 19
16K blocks, btultra2, level 19
128K blocks, btultra2, level 19
So that's interesting. For some reason, this technique doesn't work well for small blocks (2 KB). It does however tend to improve compression ratio for larger blocks, by a measurable margin, reaching an almost 300 KB gain at 128 KB blocks. That's a 0.44% ratio advantage, which is quite a lot for high compression modes.
For this reason, the logic proposed in this PR has a cutoff point at 8 KB block size : below that threshold, let's start with default statistics, above that, let's initialize with
greedy
.Let's be clear though that this is not a universal gain : it depends on the file. In this sample set, the largest gain is achieved by
x-ray
at 128 KB, achieving a staggering +3.7% compression ratio gain, which is almost oversized for simple alteration of initial statistics. It's followed bysamba
(+0.52%) andmozilla
(+0.36%), which are pretty respectable.On the flip side,
dickens
loses -0.14% andreymont
-0.12%. So this is not universal.Results at 16 KB are much more tamed, with
x-ray
gains reduced to +0.07%, hence barely relevant, whiledickens
losses increase to -0.25%. The total of 16KB changes still achieve a compression ratio benefit, but by +0.04%, which is limited.The oversized
x-ray
gains deserve an explanation. Sadly, I'm reduced to guesses right now. It likely receives a correct hint from initial statistics built withgreedy
, that make it capable to find a better global path. A classical example would be a path which favor usage of repcodes, over other path consisting of larger yet more costly matches. This shorter global path would be essentially rejected due to wrong cost estimation when using the more standard (dev
) initialization. It adds up over the full block.However, for such explanation, I would have expected some good results at 16 KB block size too. Why it needs larger block size to really show up is unclear.
followup investigation : seems
x-ray
is "a medical picture of child's hand stored in 12-bit gray scaled image", maybe it needs a certain size for multiple lines to be present, hence for repcode to start making an impact (?). This is highly speculative though, as the image is "large" (8 MB) and compresses poorly, it's difficult to imagine +3.7% compression ratio gains just on repcode targeting previous line.(sidenote : any speed cost due to
greedy
initialization is pretty minimalistic and irrelevant at this compression level).