-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Question about file size incorporation #14
Comments
Hi, and thanks for your interest in the project! The short answer is that both approaches are practical, and the project's origin story is why it is the way it is more than any technical analysis. It has been over ten years, but I recall reviewing lists of large files I needed to sync and immediately gravitating to how most of the differentiation could be done by looking at just the size. But there were some dupes based on size alone, so the sampling+hash was added to different same-sized files. Do I love the choice... meh. I considered a V2 (see #6) that would incorporate the size into the hash, and I prefer the uniformity of that approach from an aesthetic point of view. That said, I've thought about this and consider file size to be essentially a valuable "hash" that the file system is keeping track of, and scrambling into the rest of the hash could reasonably be less effective than keeping it separate. A rigorous technical analysis of which yields better performance would be pretty tough and probably application-dependent. Over the years there has been some decent use of the project and some ports to other languages, so I think the major hurdle of "is this useful?" has been passed. By default, folks accept the "good enough" nature of this scheme, and they have the ability to hash more data by adjusting the sample size, so the size-injection method is probably a minor factor. |
Oh, I completely forgot that one of the V1.1 features was in fact to incorporate size into the hash (everything in that update would be optional). I've not had a chance to progress that work so it slipped my mind. |
This hash is a very interesting idea and I've been thinking about something like it for years before running into imohash just now.
I like the file size being included in the hash, but I'm wondering why it is being incorporated the way it is. My thinking is to just append the file size to the end of the hash input, incorporating it that way.
The readme states:
If the hash included, let's say, data that by happenstance was identical in the sampled areas, but for files that also happened to be different length, yes, partially clobbering the hash bits with the size would change up the hash. But including the size in the input to the hash would also do that, per the avalanche effect of most good hashing functions (which certainly includes MurmurHash3 even if it isn't a cryptographic hash).
And what's more, it would do this consistently, whereas clobbering the hash bits for larger and larger files would take away more and more hash bits, making theoretical collisions more likely for larger files. (Not actually likely, since the avalanche effect is still in play for the hash portion, and so even very similar inputs don't produce "neighboring" results, but still more likely.)
In general, there's something about using a thoughtfully constructed hash algorithm with its good properties, and then taking the output and writing over parts of it that makes me wonder if it's not undoing those properties, when the alternative of just varying the input is easy to consider and would retain as many of those properties as is possible. The two downsides I see is the inability to extract the file size from the hash by knowing where to look (at the risk of false positives), and not knowing for sure that a certain set of bits are not the hash.
So, this question is likely mostly academic but I'm curious since I did not see this alternative or tradeoff addressed in the documentation or in other issues.
The text was updated successfully, but these errors were encountered: