cache-align the shards to improve throughput #303

conradludgate · 2024-06-10T09:00:35Z

Using CachePadded to cache-align the rwlock shards for improved locking performance. This offers considerable (30-50%) improvements in both latency and throughput perf on my M2.

Note

This increases memory usage. On x86_64, the cache alignment is set to 128. The current size of RwLock<HashMap<K, V, std::RandomState>> is 8 + 16 + 32 = 56 bytes. The current size of RwLock<HashMap<K, V, ahash::RandomState>> is 8 + 32 + 32 = 72 bytes. So this will double the size of the empty collection. Eg on a 64 core CPU the empty std dashmap size will increase from 14KiB to 32KiB. This size increase is constant though and does not scale per element inserted into the map.

Important

This is a breaking change for the raw shards api.

Benchmark results

In this benchmark,