Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How do I get --patch-from to work when using zstd as a library? #2835

Closed
wspl opened this issue Oct 28, 2021 · 21 comments
Closed

How do I get --patch-from to work when using zstd as a library? #2835

wspl opened this issue Oct 28, 2021 · 21 comments
Labels

Comments

@wspl
Copy link

wspl commented Oct 28, 2021

I found out that the implementation of the --patch-from is placed in the "programs" folder and it is not distributed with the static or dynamic libraries. This means that if I don't use the cli program, I wouldn't be able to use the --patch-from function directly.

But I don't know much about how zstd works, how should I configure the compress or decompress parameters to make the --patch-from works in my own program?

@Cyan4973
Copy link
Contributor

The core library API equivalent to --patch-from are ZSTD_compress_usingDict() and ZSTD_decompress_usingDict().
--patch-from additionally modifies some advanced parameters, like windowLog and enables the ldm mode.
In short, whatever --patch-from does can be done with only the library, but --patch-from packages in a single command a number of advanced parameters that must be done manually when using the library directly.

@wspl wspl closed this as completed Oct 28, 2021
@wspl wspl reopened this Oct 28, 2021
@wspl
Copy link
Author

wspl commented Oct 28, 2021

The core library API equivalent to --patch-from are ZSTD_compress_usingDict() and ZSTD_decompress_usingDict(). --patch-from additionally modifies some advanced parameters, like windowLog and enables the ldm mode. In short, whatever --patch-from does can be done with only the library, but --patch-from packages in a single command a number of advanced parameters that must be done manually when using the library directly.

@Cyan4973 I use a zstd Rust Wrapper (https://github.com/gyscos/zstd-rs) that uses ZSTD_compress_usingDict for file compression internally, and when I compressed the files through this library, I found that the resulting diff file was much larger than the one produced by --patch-from of cli.

Also, compressing a small file with zstd -D produces a much larger file than --patch-from.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 28, 2021

As mentioned, --patch-from is a combination of dictionary mode + a number of well selected parameters.
If the rust wrapper gives you access to the full stable API, then it should be possible to reproduce the same thing.
But that requires understanding advanced parameters.

If what you are asking for is a dedicated API to replicate what is achieved with command --patch-from at library level, without the need to understand parameters, then be aware that it will take quite some time before it gets accessible through the rust wrapper.

@wspl
Copy link
Author

wspl commented Oct 29, 2021

@Cyan4973 My guess is the same. So I found out by reading the code that the windowLog parameter might need to be set, so I set a value as calculated in the source code, and it didn't work. This is what I found by reading the pull request for --patch-from (#1940), I don't understand if I need to set some other parameters? Can you improve some hint for this piece?

zstd is great for delta updates, and I want my application to benefit from its extremely fast decompression speed and small delta package size.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 1, 2021

The 2 main things I remember is to set the windowLog to a value larger than the content being compressed,
and to enable the --long mode (ZSTD_c_enableLongDistanceMatching in API equivalent).
There are then other differences, but it's more complex, and less clear cut.
The 2 above parameters matter the most.

@wspl
Copy link
Author

wspl commented Nov 2, 2021

@Cyan4973 For a 123,983kb file, I use windowLog=27 and set ZSTD_c_enableLongDistanceMatching to 1. Set the old file (the one that needs to be patched) to cctx via ZSTD_CCtx_refPrefix (ZSTD_CCtx_loadDictionary also has no effect). Then use ZSTD_compressCCtx for compression. Nothing works, the resulting file is still 55M in size.

The size (55M) is the same as if it had been compressed directly without the dictionary, which I suspect is probably because the dictionary didn't take effect at all. And ZSTD_compress_usingDict is the same without any effect.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 2, 2021

ZSTD_CCtx_refPrefix() and ZSTD_CCtx_loadDictionary() are part of the "advanced compression API", applying sticky parameters into target ZSTD_CCtx* object.

Sticky parameters are compatible with ZSTD_compressStream() and ZSTD_compress2().

In contrast, ZSTD_compressCCtx() resets sticky parameters, in order to remain consistent with legacy behavior, which predates advanced sticky parameters.

Try with ZSTD_compress2().

@wspl
Copy link
Author

wspl commented Nov 2, 2021

@Cyan4973 Now it works properly! Thank you for your help!

@wspl wspl closed this as completed Nov 2, 2021
@reyqn
Copy link

reyqn commented Mar 6, 2023

In my case, the dictionary didn't seem to have any effect when using ZSTD_CCtx_loadDictionary() but using ZSTD_CCtx_refPrefix() worked.

@Cyan4973
Copy link
Contributor

In my case, the dictionary didn't seem to have any effect when using ZSTD_CCtx_loadDictionary() but using ZSTD_CCtx_refPrefix() worked.

This is weird. Do you have a reproduction case for this issue with ZSTD_CCtx_loadDictionary() ?

@reyqn
Copy link

reyqn commented Mar 10, 2023

This is weird. Do you have a reproduction case for this issue with ZSTD_CCtx_loadDictionary() ?

I'm sorry, I didn't think it was worth looking into, since I don't think there is an issue with zstd, just with how I'm using it without really knowing how the internals work.
I should have mentioned I'm using this library.
The difference I saw is that ZSTD_CCtx_loadDictionary() is setting localDict while ZSTD_CCtx_refPrefix() is setting prefixDict, but there could be an issue in ZstdSharp.

@Cyan4973
Copy link
Contributor

Yeah, it's probably better to forward this issue to ZstdSharp maintainer.

@reyqn
Copy link

reyqn commented Mar 13, 2023

In the end, I removed my dependency to the library, and I'm just calling the dll functions via C# interop. I still have the issue though, so maybe there really is something to it?
Basically if I call

ZSTD_createCCtx();
ZSTD_CCtx_setParameter(_internalDict, ZSTD_cParameter.ZSTD_c_windowLog, highByte);
ZSTD_CCtx_setParameter(_internalDict, ZSTD_cParameter.ZSTD_c_enableLongDistanceMatching, 1);
ZSTD_CCtx_refPrefix(_internalDict, dict, (nuint)arr.Length);
ZSTD_compress2(_internalDict, destPtr, (nuint)dest.Length, srcPtr, (nuint)src.Length)

it works, but if I call ZSTD_CCtx_loadDictionary() instead of ZSTD_CCtx_refPrefix() the dictionary doesn't seem to be applied.

I attached my project files in zsharptd.zip (it's in C# though, and you have to change the hardcoded paths and run with --patch in order to try it out)

@Cyan4973
Copy link
Contributor

What's the size of the dictionary content ?

@reyqn
Copy link

reyqn commented Mar 13, 2023

I think it was something like 700MB. The patch is 160MB with refPrefix and 700MB with loadDictionary.

Edit: I can check the actual size if necessary (to calculate windowLog maybe)

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 13, 2023

OK, so I think we have a reason here.

To begin with, I'm even surprised it accepted to load such a dictionary, this is way beyond its default memory usage limits. So maybe there's something to look at here.
(Or maybe it returns an error and that's not checked so it's invisible?)

Second, the LDM mode is not compatible with CDict, and by extension, it's not compatible with loadDictionary.
So, even if it was loaded, the dictionary wouldn't benefit from LDM, which tends to be important for such a huge amount of data to search into.

Now, there are still differences between "it's less effective", "it's barely effective", "it's not effective at all".
If your scenario is implying the 3rd option, meaning that compression is the same, with or without dictionary loadDictionary(), then I expect the scenario to be the one mentioned above : the dictionary is not loaded at all, it's way too big and blows memory limits. And the API provides a return error code to tell it, but it's probably not checked.

But then, even if the dictionary is loaded, by increasing memory limit typically, it's likely going to be less effective than refPrefix() anyway, which is the proper pattern for such a use case.

@terrelln
Copy link
Contributor

@Cyan4973 we don't have a dictionary size limit on either the compressor or decompressor side.

But yeah, LDM doesn't work with loadDictionary(), and the normal match finders won't be able to effectively index the 700MB dictionary, unless you drastically increased the hashLog and chainLog. So I expect the benefit of a dictionary to shrink drastically because LDM is disabled.

@reyqn
Copy link

reyqn commented Mar 13, 2023

So I tried with a 31913kB dictionary, (the target file being 34335kB - I just downloaded a zip of wordpress 6.0 and 6.1.1 source code from their github) using ZSTD_CCtx_refPrefix() without ldm gives a 9264kB delta, with ldm a 9300kB delta, and ZSTD_CCtx_loadDictionary() instead gives a 30768kB delta.

I'm not sure ldm alone being disabled explains everything, I remember disabling it from the 700MB file I used first, and the delta was about 400MB, so obviously ldm is really useful for this usecase.

I'm not sure there is even a problem here, I just wanted to mention for people who followed this issue as kind of a tutorial on how to use --patch-from like functionality with the library rather than the executable that ZSTD_CCtx_refPrefix() seems to work while ZSTD_CCtx_loadDictionary() doesn't.

If you think this is worth investigating I have no trouble providing any information you want, but I wouldn't want you spending time on this because you think I'm stuck or anything.

@Cyan4973
Copy link
Contributor

Good point @reyqn , we understand that ZSTD_CCtx_refPrefix() is a good enough answer to unblock your use case.

At this point, we are trying to understand if there is an API antipattern with ZSTD_CCtx_loadDictionary() which would be either not well documented, or actually buggy in some way. The idea is to avoid future users to go into the same troubles as you.

So, my only remaining question would be :
what happens when you compress the aforementioned target file directly, without ZSTD_CCtx_loadDictionary() ?
Does it compress the same as with ZSTD_CCtx_loadDictionary(), as in exactly identical, or roughly the same (there's a difference) ?

This result would be helpful to guide an investigation.

@reyqn
Copy link

reyqn commented Mar 13, 2023

It's roughly the same. File hashes are different. One is 31,506,058 bytes, the other is 31,506,225 bytes. The file hashes are reproducible (each time I run the C# program calling the library with ZSTD_CCtx_loadDictionary() they give the same hash, and without, the same different hash)

@Cyan4973
Copy link
Contributor

OK, so it's roughly inefficient.

That seems a little too inefficient, given the results with refPrefix() without ldm,
so there might be something to investigate there.

Thanks for the feedback @reyqn .

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants