-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
New hashmap implementation #5999
Conversation
Just to clarify, I didn't do any work on the new hash map implementation. That was all @andrewrk. I just made some graphs. |
aecb1f3
to
e34b954
Compare
There's still a few errors (for example I think the translate-c code depends on the ordering of the current hashmap somewhere), which I'll solve if there is interest in this. |
Ran my basic insertion benchmark out of curiosity and got some interesting results. (note: this only goes up to 10 million insertions) I keep coming back to this comment: ziglang/gotta-go-fast#2 (comment)
|
I'm a total beginner to this type of stuff so I'm more than willing to believe that I'm not taking the proper things into account. My graphs are very strange indeed. |
I haven't read through your implementation in detail, but from my knowledge of the swiss tables implementation they make good use of SIMD to speed up lookups. What advantages does your design have over such an approach? Or could that be a potential future improvement? |
@ifreund It is my understanding that google tries to achieve extremely high load factor in its hashmaps, because memory usage is literally money to them. I would have to watch their videos again so don't quote me on that but IIRC they go as high as 97.5% occupancy to reduce memory waste. This means that even with a very good hash function that provides good distribution, they likely have long collision chains when their maps are full; because the set of possible input keys is almost always larger than the set of possible hash results. Flat hashmaps (as this one, or the stdandard one) performance relies heavily on reducing collision probability by using more memory than would be minimally needed to store all their elements. The more additional memory you reserve, the more there are slots that will be unused and thus break collision chains. This means than when looking for a slot (to either do a lookup or a modification) you can statistically stop traversing your data earlier. When you allow very high load factors, the number of free slots is reduced to a bare minimum that you find acceptable (this really is just a tradeoff) and collision chains get longer and longer. This implies that when looking for a slot, you will likely have to probe for more already used slots before finding the good one. In both google and this implementation, probing is mainly done by looking at an array of 8bit metadata. When collision chains get longer, testing multiple metadata at once with SIMD becomes interesting. However, doing probing by packs of say 16 metadata implies some complexification of the design and implementation (which are best explained in their conferences so I will not go on only by my memory of it :) that IMO is not needed with less extreme load factors to achieve good performance. So to conclude, it might be an improvement but it also might not. 😄 |
they were pretty graphs tho 😉
I agree with this conclusion. However, let's try to make it really clear what the differences and usage patterns would be. I think it would be fair to say that the API of the master branch hash map is strictly more convenient. Having the ArrayList of entries available for direct access is pretty nice in terms of ergonomics, and the fact that order is preserved (and independent of the hash function) is a really nice property. So it would seem the benefits of this implementation are trading some of this API convenience for better resource usage (memory & CPU). This being the case, I'd like to make sure it actually is significantly better than status quo hash maps before adding it as another option. Based on some of the data here, it's not entirely clear, right? I think before merging this it would be worth it to understand the benefits in a more conclusive way, so that the doc comments in the std lib can confidently explain when to choose one over the other.
My guess is it has to do with the underlying memory allocator returning memory that it already has available vs requesting extra memory from the OS. E.g. imagine appending to an array list, and on the 9th element appended, it doubles its capacity. |
Well, to me the results are pretty clear in that this PR has better (up to 2.5-3x) performance on all operations except iteration. The results are focused on large hashmaps, but that is what we currently have to measure. Can we define the required level of insight you need to accept of reject this PR ? Thinking again about the sawtooth, apart from the influence of the allocator, in my opinion this is normal and expected, but not necessarily representative. Since storage is not reserved for the hashmap in the benchmark, it has to grow when necessary. Depending on the number of elements inserted (x values), if a grow was just triggered for this number because it exceeded previous capacity, the |
I see - I think I focused too much on the graphs and neglected to pay more attention to your original performance statistics. OK I see now.
OK never mind about the performance situation, I'm convinced on that end. Let's start with the fully qualified names for each hash map, along with their doc comments, and once we get those settled I think it will be easy to take the next step towards merging. So far it looks like we have:
Now imagine you're someone who has never used zig before and you want a hash map, and you're faced with these two options. Let's come up with some nice names and explanations to guide such a user into making the choice that will work best for them. |
Totally agree, naming things is hard though. I've been in this position before, having to name two different implementations that optimized for different usecase, and it's not easy. For example we have to avoid names that, when compared one to another, will lead to think there's an absolute better option. For the current hashmap, I propose For the new hashmap from this PR, it's a bit harder, I would go for |
"fast" is always implied in all code, and there is always a better adjective to describe in the name. Better for the name to mention the constraints rather than the ephemeral performance. Consider how misleading the name "quicksort" is, regardless of whether it was actually faster than its contemporaries. A better name for "quicksort" would have been "partition-exchange sort". It is misleading because depending on the usage pattern, the sliceable hash map might be faster, for example, if the bottleneck in the code will be iterating over the hash map. Better to describe the semantic limitations or lackthereof. Here are some ideas:
After thinking about this a bit I agree with your original proposal to make it the default and name it simply HashMap (the second bullet point). I inspected the code a bit more closely and I see that this new hash map has low memory overhead for small maps, which is great! I think this is quite suitable for default use. |
First off, I'm sorry this comes shortly after
@squeek502 'swork on the robin hood hashmap. I have been working on and off on this hashmap implementation for quite some time with the goal to contribute it to the std.However I think the two implementations are complementary and no work was wasted.
I tried to adhere as much as possible to the current API,
Design
It's based on open addressing (all elements are stored in a single contiguous array) and linear probing (we resolve collisions by just trying the next slot in the array). Quite similar to the widely publicized google's swiss tables, but a lot simpler.
1. Fast
The goal is to have a hashmap that is as fast as possible for lookups (considered most important usecase) and insertion/removal (second most important).
Statistics
We assume that the hash function is of good quality, giving unbiased results and dispatching elements all over the available slots. This is an absolute prerequisite for hashmap implementations, and that is the case with Zig's standard hash function (though there are other candidates obviously).
The probability of an element being assigned to a slot is
1/number_of_slot
. This does not mean that we can have a bijection between a key and a slot, therefore we need to handle collisions.Linear probing is an efficient (see next §) and very simple way to deal with collisions. I quite like that it's very easy to understand, and has very predictable behavior for the CPU. However, when probing a collision chain to find the correct key, the algorithm needs to perform equality comparison on the keys. For simple types such as ints, that's not really a problem, but keys can be larger and much more complex (you don't want to do lots of string comparisons for example). To remedy this, the hashmap keeps 6 bits from each hash and stores it as a metadata by slot (along with its state : free, used, tombstone).
The mecanism to determine which ideal slot the key belongs to already uses
log2(number_of_slots)
bits from the hash ; since we always keep a power-of-two number of slots to do fast modulus by masking. Keys that belong in the same ideal slot will have their hashes with identicallog2(number_of_slots)
low bits. But since the hash function is assumed to give random results, the 6 high bits of their hashes are very likely to be different. Using this further helps to differenciate keys without resorting to equality comparison.Spatial locality
Pieces of data that are accessed together benefit greatly from begin close together in memory as CPUs optimize for this usecase. By using only 8bits of metadata per element and storing all the metadata contiguously, when accessing one (while doing a lookup for example) we typically get the metadata of 7 other slots for free (assuming a 64bytes cache line).
Effectively this means that even if the hashmap is nearly full and has collisions that require probing, it is almost free since a probe chain of length 8 is already in cache.
Not having to access the slot array until we're certain to have found the correct slot also means that we don't waste memory bandwith with unused data. If the metadata were to be embedded within the slots, probing would make a bad use of cache as they would be espaced by the size of an element.
2. Memory efficient
8 bits of metadata is quite low, and it's hard to go further with comparable advantages.
The hashmap also holds only one allocation that contains both the metadata and slot array. This is hard to measure but I think this helps reduce pressure on the allocator and fragmentation.
3. Small footprint
Extra effort was spent in trying to keep the
struct
as small as possible : it's only 16 bytes (24 for theManaged
variant).The fields embedded in the struct are the ones needed most frequently to avoid unnecessary cache misses. Other field-candidates are stored in the allocation (mainly, the capacity).
To keep it at 16 bytes, the size of the hashmap is limited to 32 bits. This seems like a reasonnable choice to me.
Comparison with status quo
Pros
Cons
Performance
I've taken the best of 3 runs from my benchmark, which is basically just a longer version of what's now used in gotta-go-fast. Ported from https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/
Master
This PR
Right tool for the job
I think there's use for both implementations, and renamed the current one to
sliceable_hash_map
as it is its main advantage. That is only a proposal.