-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
EstimateSketchUDF isn't processing BINARY fields correctly #50
Comments
This is a bug in lots of places, as noted in the PR. I'll start working to fix them all (that might take a bit but I'll try to get it done soon). Wrapping the larger buffer will be harmless in most cases. The specific issue here is that there was an older version of the specialized SingleItemSketch that didn't include a flag indicating that it followed the slightly different single item format. To handle that problem, the only workaround that could be identified was to use the buffer length. In general, the images shouldn't need to rely on buffer size (other than ensuring it's not too small) when being read. And checks against size for the empty sketch scenario should be a performance optimization, but work properly if parsing an actual empty sketch later. But we should fix this properly anyway. |
My PR replaces all calls to getBytes() in the repo (aside from in the new code to wrap BytesWritable in a Memory). Had to copy bytes in a few places, but I was mostly able to just wrap using getLength() bytes. Just need it reviewed now. |
@leerho Please revert that change and apply the PR that completely addresses the issue. |
Done.
On Mon, Jun 1, 2020 at 8:52 PM Jon Malkin ***@***.***> wrote:
@leerho <https://github.com/leerho> Please revert that change and apply
the PR that completely addresses the issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#50 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQU2TY3H5XZGJON4MKLRURZO7ANCNFSM4NN5LCNQ>
.
--
From my cell phone.
|
Thanks again @koke for finding this. We did add your unit test, while fixing the underlying issue everywhere in the repo so that future changes in other sketches won't risk triggering this problem again. |
Is there a plan to do a new release with this? I just hit a new issue with the |
I've been scratching my head for a while with this one, but I was writing some unit tests where I created a theta sketch with a single item, and the estimate function was returning an estimate of minus 800M.
This seems easily reproducible for me (using Hive version 1.1.0-cdh5.16.1):
For some reason there were some extra bytes in the
BytesWritable
storage, which was breaking the calculations. What was supposed to be a 16 byteSingleItemSketch
, got an extra 8 bytes (zero-filled), making datasketches think it was a completely different thing.A unit test of what I was seeing coming from Hive:
Adding this wrapper around
EstimateSketchUDF
fixes the problem:The text was updated successfully, but these errors were encountered: