add benchmarks for std.AutoHashMap #2

andrewrk · 2020-05-19T00:58:55Z

No description provided.

squeek502 · 2020-05-19T03:56:44Z

Relevant, but not very fleshed out (and not updated to latest Zig) so maybe not that useful: https://github.com/squeek502/zig-hash-map-bench

The links in the readme should be useful, though.

squeek502 · 2020-06-21T19:23:23Z

One thing I'm unsure about with this is whether or not it's possible to have a single benchmark be representative of the speed of a hash map (if you look at the links in the repo linked above, all benchmarks are presented as graphs). That is, because hash maps can have very different performance characteristics depending on the size of the map, a benchmark like 'add 1,000,000 elements, get 1,000,000 times, then remove all the elements one by one' would only represent one single point along the continuum of possible hash map sizes, and therefore it would fail to catch regressions in performance for other map sizes. And doing something like running that same benchmark for all sizes 1-1,000,000 and then measuring the total time would still underrepresent the small sizes and therefore would still allow for regressions in small hash maps without the benchmark noticing.

Unless I'm thinking about it wrong, I'm not sure this is a solvable problem without breaking the benchmarking into multiple different benchmarks differentiated by the size of the hash map being benchmarked. And, if that's true, then it's not clear how many different sizes should be benchmarked and what those sizes should be (would only 'small' and 'huge' be necessary? 'small', 'medium', 'large', and 'huge'? way more?).

andrewrk · 2020-06-21T19:25:41Z

The idea here would be to create multiple different benchmarks, representing each "use case" that you want to keep a performance history on. Hopefully it should be reasonable to maintain this, since the manifest JSON file lets you re-use directories for multiple different benchmarks, and so you should only have to fulfill one tiny function for each different benchmark.

My intuition tells me that 3-5 benchmarks should be enough to get a reasonable performance picture of this API.

Another way to approach this would be to create benchmarks that represent real world usage of the API. So if you have a project that utilizes the API, make a benchmark that simulates how your project would use it. This ensures that the perf of your use case is explicitly being monitored for regressions. As you can imagine, this will become relevant for making sure self-hosted is and stays fast.

lemmi · 2020-07-02T11:59:38Z

So I was toying around with zig and implemented Project Euler 14 for fun. Since it was easy I ported it directly to go and run some benchmarks. The this makes heavy use of the HashMap, maybe it's a decent benchmark for some use cases.

TL;DR:

for both implementations the map is preallocated, so only one allocation takes place
looks like zig uses twice the memory of go
zig is almost as fast as the go implementation in -Drelease-fast
zig takes about twice as long with -Drelease-safe
Tests were running on AMD FX(tm)-8350

Code

const std = @import("std");
const AutoHashMap = std.AutoHashMap;

fn step(x: u64) u64 {
    if (x & 1 > 0) {
        return 3 * x + 1;
    } else {
        return x / 2;
    }
}

fn length(cache: *AutoHashMap(u64, u64), x: u64) anyerror!u64 {
    if (x <= 1) return 0;
    if (cache.getValue(x)) |e| {
        return e;
    } else {
        const next = step(x);
        const len = 1 + try length(cache, next);
        try cache.putNoClobber(x, len);
        return len;
    }
}

pub fn main() anyerror!void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();
    const alloc = &arena.allocator;
    var cache = AutoHashMap(u64, u64).init(alloc);
    defer cache.deinit();

    try cache.ensureCapacity(2000000);

    var x: u64 = 0;
    var maxx: u64 = 0;
    var maxl: u64 = 0;
    while (x < 1000000) : (x += 1) {
        const l = try length(&cache, x);
        if (l > maxl) {
            maxl = l;
            maxx = x;
        }
    }
    std.debug.warn("{} {}\n", .{ maxx, maxl });
}

package main

import "fmt"

func step(x uint64) uint64 {
	if x&1 > 0 {
		return 3*x + 1
	} else {
		return x / 2
	}
}

func length(cache map[uint64]uint64, x uint64) uint64 {
	if x <= 1 {
		return 0
	}
	if e, ok := cache[x]; ok {
		return e
	} else {
		next := step(x)
		l := 1 + length(cache, next)
		cache[x] = l
		return l
	}
}

func main() {
	cache := make(map[uint64]uint64, 2000000)

	var maxx uint64 = 0
	var maxl uint64 = 0
	for x := uint64(0); x < 1000000; x++ {
		l := length(cache, x)
		if l > maxl {
			maxl = l
			maxx = x
		}
	}

	fmt.Println(maxx, maxl)
}

Timings

go build -o collaz
/usr/bin/time -v ./collaz
837799 524
	Command being timed: "./collaz"
	User time (seconds): 0.35
	System time (seconds): 0.02
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.37
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 82208
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1689
	Voluntary context switches: 147
	Involuntary context switches: 105
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

zig build -Drelease-safe
/usr/bin/time -v ./zig-cache/bin/collaz
837799 524
	Command being timed: "./zig-cache/bin/collaz"
	User time (seconds): 0.57
	System time (seconds): 0.07
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.65
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 163748
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 40987
	Voluntary context switches: 1
	Involuntary context switches: 179
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

zig build -Drelease-fast
/usr/bin/time -v ./zig-cache/bin/collaz
837799 524
	Command being timed: "./zig-cache/bin/collaz"
	User time (seconds): 0.38
	System time (seconds): 0.08
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 163692
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 40984
	Voluntary context switches: 1
	Involuntary context switches: 140
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

see #2

andrewrk · 2020-07-06T02:50:32Z

Thanks @lemmi for this benchmark, it's now added to this repo. I think probably before closing this issue we should come up with 2-3 more benchmarks that try to cover other use cases of hash maps. Maybe one that deals with strings, and one that creates and destroys many small hash maps in memory, and any other suggestions are welcome too.

Sahnvour · 2020-07-06T07:09:13Z

Just saw Andrew's tweets, good work everyone.
In case that's of any interest, I had some hashmap benchmarks at https://github.com/Sahnvour/zig-containers/blob/master/benchmarks/martinus_map.zig that tried to reproduce the ones mentionned by @squeek502 at https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ .

The code is a bit old however.

lemmi · 2020-07-06T08:59:30Z

Short update. While @andrewrk integrated the test, I was in the middle of benchmarking my implementation of the swiss hash map to make use of simd for (hopefully) faster accesstimes.

Code: https://gist.github.com/lemmi/6576efbfd9fcc7fbf74371e42e3146a8

Once I ported my code over to the new API, I'll run the tests again with the new HashMap from current master.

Brief explaination: The last number indicates the simd width. So 32 for example means 32 byte wide simd is used for metadata storage. With 16 the compiler emits SSE code, with 32 I can actually see AVX code. It's still rather puzzeling to me why the scalar version is the fastest among all architectures. Maybe the cost of loading a whole vector and iterating through the bitmask is too costly compared to the cheap comparison of a 64 bit key, which will almost never be a miss with this kind of table.

AMD FX(tm)-8350

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                           505ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,2)                           562ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,4)                           523ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,8)                           528ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                          529ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,32)                          546ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,64)                          624ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,128)                         720ms
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql) 416ms

Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                           383ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,2)                           408ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,4)                           390ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,8)                           393ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                          401ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,32)                          421ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,64)                          469ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,128)                         545ms
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql) 431ms

AMD Ryzen 5 3600X 6-Core Processor

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                           400ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,2)                           458ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,4)                           432ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,8)                           435ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                          435ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,32)                          448ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,64)                          461ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,128)                         497ms
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql) 373ms

Sahnvour · 2020-07-06T10:03:50Z

@lemmi In my experience, going for a similar design to google's hashtable (8bit meta data per element) with scalar code was faster. I think the SIMD is beneficial to them because they try to achieve extremely high load factor (97.5% iirc) which implies quite a lot of probing, especially when they use C++'s standard hash which is of very poor quality.
With a hash function of good quality, the bottleneck is likely on the initial cache-miss more than on the probing.

In your implementation, I see you're using modulus % operations in the groupIter code. This is very slow, and your code would probably be faster if you used a bitmask since you already work with power-of-two sizes.

lemmi · 2020-07-06T12:03:07Z

@Sahnvour thanks a lot for your input. Your comment made me want to measure memory usage behaviour when not preallocating memory. Also getting rid of the modulus operations was indeed getting me a couple tens of milliseconds in comparison to the numbers above.
The google style hashmap can indeed be pushed to much higher loadfactors before loosing performance. And now the numbers look very different. Since the old std map needed to keep a lot of free space, more allocations and therefore copying/rehashing is necessary. It also uses quite a bit more memory, which is coherent with my findings while comparing to the go implementation.

Benchmarks without preallocation

Third column is the maximum allocated memory. I left out the worse performing vector sizes.

Code: https://gist.github.com/lemmi/c4e9dbedc5424a82e300c0000cf46728

AMD FX(tm)-8350

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                             737ms 104448K
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                            809ms 104448K
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql)   838ms 196608K

Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                             499ms 104448K
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                            527ms 104448K
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql)   696ms 196608K

AMD Ryzen 5 3600X 6-Core Processor

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                             418ms 104448K
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                            447ms 104448K
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql)   507ms 196608K

I hope I don't carry this too much off topic.

andrewrk · 2020-07-06T23:00:39Z

I hope I don't carry this too much off topic.

This is very much on topic!

Sahnvour · 2020-07-17T18:11:27Z

I'm adding some benchmarks that can take up to a few tens of seconds. Should we make them as atomic as possible, or is it ok to have them all in std-hash-map ?

andrewrk changed the title ~~add benchmark for std.AutoHashMap~~ add benchmarks for std.AutoHashMap Jun 21, 2020

andrewrk mentioned this issue Jul 1, 2020

stage1: reimplement HashMap ziglang/zig#5767

Merged

andrewrk added a commit that referenced this issue Jul 3, 2020

add std.AutoHashMap benchmark

15cf093

see #2

squeek502 mentioned this issue Jul 4, 2020

reimplement std.HashMap ziglang/zig#5786

Merged

andrewrk mentioned this issue Jul 6, 2020

performance optimization idea: skip Wyhash for auto hashing of pointers and simply truncate their address to u32 #9

Closed

squeek502 mentioned this issue Aug 10, 2020

New hashmap implementation ziglang/zig#5999

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add benchmarks for std.AutoHashMap #2

add benchmarks for std.AutoHashMap #2

andrewrk commented May 19, 2020

squeek502 commented May 19, 2020

squeek502 commented Jun 21, 2020

andrewrk commented Jun 21, 2020 •

edited

Loading

lemmi commented Jul 2, 2020 •

edited

Loading

andrewrk commented Jul 6, 2020

Sahnvour commented Jul 6, 2020 •

edited

Loading

lemmi commented Jul 6, 2020 •

edited

Loading

Sahnvour commented Jul 6, 2020

lemmi commented Jul 6, 2020

andrewrk commented Jul 6, 2020

Sahnvour commented Jul 17, 2020

add benchmarks for std.AutoHashMap #2

add benchmarks for std.AutoHashMap #2

Comments

andrewrk commented May 19, 2020

squeek502 commented May 19, 2020

squeek502 commented Jun 21, 2020

andrewrk commented Jun 21, 2020 • edited Loading

lemmi commented Jul 2, 2020 • edited Loading

TL;DR:

Code

Timings

andrewrk commented Jul 6, 2020

Sahnvour commented Jul 6, 2020 • edited Loading

lemmi commented Jul 6, 2020 • edited Loading

AMD FX(tm)-8350

Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz

AMD Ryzen 5 3600X 6-Core Processor

Sahnvour commented Jul 6, 2020

lemmi commented Jul 6, 2020

Benchmarks without preallocation

AMD FX(tm)-8350

Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz

AMD Ryzen 5 3600X 6-Core Processor

andrewrk commented Jul 6, 2020

Sahnvour commented Jul 17, 2020

andrewrk commented Jun 21, 2020 •

edited

Loading

lemmi commented Jul 2, 2020 •

edited

Loading

Sahnvour commented Jul 6, 2020 •

edited

Loading

lemmi commented Jul 6, 2020 •

edited

Loading