Skip to content

Generic hashtable #12848

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Generic hashtable #12848

wants to merge 1 commit into from

Conversation

mechakotik
Copy link
Contributor

SDL_HashTable uses callbacks to calculate hash, compare keys and destroy items. Compiler doesn't know anything about functions that hashtable is going to call, so some optimizations can't be applied. But in almost all cases we know the functions to be called at compile time.

Generic hashtable is a macro that generates hashtable implementation for concrete hash, compare and destroy functions. It allows them to be inlined by the compiler, resulting in 1.5-4x performance boost compared to SDL_HashTable. The implementation is taken from SDL_HashTable, with a few improvements:

  • Lowered max load factor to 128 (50%). Seems to be a good default value, as find is unlikely to load more than one cache line. The value also can be overriden by using SDL_MAKE_HASHTABLE_TYPE_EXT.

  • Removed live bit field, using probe_len = 0 to indicate whether the item is live instead. Seems to be a bit faster, as CPU doesn't need to reset live bit each time probe_len is accessed.

  • Removed max_probe_len. Unneeded, as we compare to probe_len of each item we visit, which is a stronger break condition.

Benchmarks were done with Google Benchmark on Ryzen 9 7940HS, Gentoo Linux, performance governor, turbo boost disabled.

> taskset -c 0 compare.py benchmarks ./hashtable ./hashtable_generic --benchmark_enable_random_interleaving=true
Benchmark                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------
insert/n:100                            -0.5584         -0.5573          1317           582          1317           583
insert/n:10000                          -0.6977         -0.6978        100929         30510        100841         30471
insert/n:1000000                        -0.6082         -0.6083      75190100      29460990      75101661      29420417
find/n:100/exists:0                     -0.8311         -0.8311           657           111           657           111
find/n:10000/exists:0                   -0.8244         -0.8245         61932         10874         61887         10863
find/n:1000000/exists:0                 -0.6472         -0.6476      51812061      18281111      51705272      18222144
find/n:100/exists:1                     -0.8419         -0.8419           760           120           760           120
find/n:10000/exists:1                   -0.7959         -0.7960         84718         17289         84666         17274
find/n:1000000/exists:1                 -0.6622         -0.6625      55225998      18654651      55115561      18600205
erase/n:100/exists:0                    -0.5960         -0.5952          1123           454          1124           455
erase/n:10000/exists:0                  -0.8033         -0.8033         76623         15071         76567         15059
erase/n:1000000/exists:0                -0.7002         -0.7002      53117387      15926845      52996234      15888620
erase/n:100/exists:1                    -0.5693         -0.5688          1379           594          1379           595
erase/n:10000/exists:1                  -0.6896         -0.6898        101706         31569        101613         31519
erase/n:1000000/exists:1                -0.5730         -0.5730      76410369      32630551      76293897      32577283
OVERALL_GEOMEAN                         -0.7112         -0.7112             0             0             0             0
Benchmark code for SDL_HashTable
#include <SDL3/SDL.h>
#include <SDL_hashtable.h>
#include <benchmark/benchmark.h>
#include <chrono>
#include <random>

void insert(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n);
    for(int i = 0; i < n; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    SDL_HashTable* table = SDL_CreateHashTable(n * 2, false, SDL_HashID, SDL_KeyMatchID, nullptr, nullptr);
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        state.PauseTiming();
        SDL_ClearHashTable(table);
        state.ResumeTiming();

        for(int i = 0; i < n; i++) {
            SDL_InsertIntoHashTable(table, (const void*)seq[i], nullptr, false);
        }
    }

    SDL_DestroyHashTable(table);
}

BENCHMARK(insert)->ArgNames({"n"})->ArgsProduct({{100, 10000, 1000000}});

void find(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n * 2);
    for(int i = 0; i < n * 2; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    SDL_HashTable* table = SDL_CreateHashTable(0, false, SDL_HashID, SDL_KeyMatchID, nullptr, nullptr);
    for(int i = 0; i < n; i++) {
        SDL_InsertIntoHashTable(table, (const void*)seq[i], nullptr, false);
    }
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        const void* value = nullptr;
        if(state.range(1)) {
            for(int i = 0; i < n; i++) {
                SDL_FindInHashTable(table, (const void*)seq[i], &value);
            }
        } else {
            for(int i = n; i < n * 2; i++) {
                SDL_FindInHashTable(table, (const void*)seq[i], &value);
            }
        }
        benchmark::DoNotOptimize(value);
    }
}

BENCHMARK(find)->ArgNames({"n", "exists"})->ArgsProduct({{100, 10000, 1000000}, {0, 1}});

void erase(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n * 2);
    for(int i = 0; i < n * 2; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    SDL_HashTable* table = SDL_CreateHashTable(0, false, SDL_HashID, SDL_KeyMatchID, nullptr, nullptr);
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        state.PauseTiming();
        SDL_ClearHashTable(table);
        for(int i = 0; i < n; i++) {
            SDL_InsertIntoHashTable(table, (const void*)seq[i], nullptr, false);
        }
        state.ResumeTiming();

        if(state.range(1)) {
            for(int i = 0; i < n; i++) {
                SDL_RemoveFromHashTable(table, (const void*)seq[i]);
            }
        } else {
            for(int i = n; i < n * 2; i++) {
                SDL_RemoveFromHashTable(table, (const void*)seq[i]);
            }
        }
    }
}

BENCHMARK(erase)->ArgNames({"n", "exists"})->ArgsProduct({{100, 10000, 1000000}, {0, 1}});

BENCHMARK_MAIN();
Benchmark code for generic hashtable
#include <SDL3/SDL.h>
#include <SDL3/SDL_stdinc.h>
#include <SDL_generic_hashtable.h>
#include <SDL_hashtable.h>
#include <benchmark/benchmark.h>
#include <chrono>
#include <iostream>
#include <random>

SDL_MAKE_HASHTABLE_TYPE(IDHashTable, SDL_HashIDInline, SDL_KeyMatchIDInline, SDL_DestroyEmptyInline, false);

void insert(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n);
    for(int i = 0; i < n; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    IDHashTable* table = IDHashTable_Create(n * 2, nullptr);
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        state.PauseTiming();
        IDHashTable_Clear(table);
        state.ResumeTiming();

        for(int i = 0; i < n; i++) {
            IDHashTable_Insert(table, (const void*)seq[i], nullptr, false);
        }
    }

    IDHashTable_Destroy(table);
}

BENCHMARK(insert)->ArgNames({"n"})->ArgsProduct({{100, 10000, 1000000}});

void find(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n * 2);
    for(int i = 0; i < n * 2; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    IDHashTable* table = IDHashTable_Create(0, nullptr);
    for(int i = 0; i < n; i++) {
        IDHashTable_Insert(table, (const void*)seq[i], nullptr, false);
    }
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        const void* value = nullptr;
        if(state.range(1)) {
            for(int i = 0; i < n; i++) {
                IDHashTable_Find(table, (const void*)seq[i], &value);
            }
        } else {
            for(int i = n; i < n * 2; i++) {
                IDHashTable_Find(table, (const void*)seq[i], &value);
            }
        }
        benchmark::DoNotOptimize(value);
    }
}

BENCHMARK(find)->ArgNames({"n", "exists"})->ArgsProduct({{100, 10000, 1000000}, {0, 1}});

void erase(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n * 2);
    for(int i = 0; i < n * 2; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    IDHashTable* table = IDHashTable_Create(0, nullptr);
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        state.PauseTiming();
        IDHashTable_Clear(table);
        for(int i = 0; i < n; i++) {
            IDHashTable_Insert(table, (const void*)seq[i], nullptr, false);
        }
        state.ResumeTiming();

        if(state.range(1)) {
            for(int i = 0; i < n; i++) {
                IDHashTable_Remove(table, (const void*)seq[i]);
            }
        } else {
            for(int i = n; i < n * 2; i++) {
                IDHashTable_Remove(table, (const void*)seq[i]);
            }
        }
    }
}

BENCHMARK(erase)->ArgNames({"n", "exists"})->ArgsProduct({{100, 10000, 1000000}, {0, 1}});

BENCHMARK_MAIN();

This was not tested thoroughly, but I've done basic stress test against SDL_HashTable, so at least there is no obvious issues with it.

Stress test code
#include <SDL3/SDL.h>
#include <SDL3/SDL_stdinc.h>
#include <SDL_generic_hashtable.h>
#include <SDL_hashtable.h>
#include <benchmark/benchmark.h>
#include <chrono>
#include <iostream>
#include <random>

SDL_MAKE_HASHTABLE_TYPE(IDHashTable, SDL_HashIDInline, SDL_KeyMatchIDInline, SDL_DestroyEmptyInline, false);

std::mt19937 gen(std::chrono::steady_clock::now().time_since_epoch().count());

void stress() {
    SDL_HashTable* table = SDL_CreateHashTable(0, true, SDL_HashID, SDL_KeyMatchID, NULL, NULL);
    IDHashTable* idtable = IDHashTable_Create(0, NULL);

    const int N = 1000;
    std::vector<Uint32> keys(N), values(N);
    for(int i = 0; i < N; i++) {
        keys[i] = gen();
        values[i] = gen();
    }

    Uint32* res1 = new Uint32;
    Uint32* res2 = new Uint32;
    bool fl1, fl2;

    for(int it = 0; it < N * 20; it++) {
        Uint32 op = gen() % 3;
        Uint32 arg = gen() % N;
        switch(op) {
            case 0:
                fl1 = SDL_InsertIntoHashTable(table, (const void*)keys[arg], &values[arg], true);
                fl2 = IDHashTable_Insert(idtable, (const void*)keys[arg], &values[arg], true);
                if(fl1 != fl2) exit(1);
                break;
            case 1:
                fl1 = SDL_RemoveFromHashTable(table, (const void*)keys[arg]);
                fl2 = IDHashTable_Remove(idtable, (const void*)keys[arg]);
                if(fl1 != fl2) exit(1);
                break;
            case 2:
                *res1 = UINT32_MAX;
                *res2 = UINT32_MAX;
                SDL_FindInHashTable(table, (const void*)keys[arg], (const void**)(&res1));
                IDHashTable_Find(idtable, (const void*)keys[arg], (const void**)(&res2));
                if(*res1 != *res2) exit(1);
                break;
        }
    }
}

int main() {
    while(true) {
        stress();
    }
}

@slouken slouken requested a review from icculus April 22, 2025 19:04
@slouken slouken added this to the 3.4.0 milestone Apr 22, 2025
@slouken
Copy link
Collaborator

slouken commented Apr 22, 2025

Do you have a separate commit that switches SDL to use this implementation? Do you have info on real-world performance and size differences?

@icculus
Copy link
Collaborator

icculus commented Apr 23, 2025

I was trying to avoid this sort of thing, but we've started to depend on hashes pretty heavily in some places, and the benchmarks are pretty compelling.

I'd like to see if the implementation tweaks are more important than the inlining, but I doubt they are.

I'd also like @Akaricchi's opinion here if possible.

@Akaricchi
Copy link
Contributor

I've always advocated for templated hashtables (my original implementation was templated), but if we're doing this, I'd like to see adjustable key and value types instead of being stuck with void*. IMO that's actually the reason to bother with a templated implementation, and the inlining is a nice bonus.

I'm not a fan of the function definitions contained entirely inside macros. A reusable header would've been a little nicer IMO, e.g.

#define SDL_HASHTABLE_SUFFIX             StringToMyBigStruct
#define SDL_HASHTABLE_KEY_TYPE           char*
#define SDL_HASHTABLE_VALUE_TYPE         MyBigStruct
#define SDL_HASHTABLE_KEYS_EQUAL(a, b)   (!SDL_strcmp(a, b))
#define SDL_HASHTABLE_HASH_KEY(k)        my_hash_func(k)
#define SDL_HASHTABLE_COPY_KEY(dst, src) (*(dst) = SDL_strdup(src))
#define SDL_HASHTABLE_FREE_KEY(k)        SDL_free(k)
// ... other parameters here

#include "SDL_hashtable.impl.h"
// Inside header: no include guard, #undef all parameter macros at the end
// This way it can be re-included from the same file if needed

Additionally, some of the internal helper functions like get_probe_length don't need to be templated, though it probably doesn't matter if it's going to get inlined anyway.

As for the implementation tweaks, lowering the load factor this much is questionable to me, since robin hood hashtables are supposed to be well behaved under high loads. I'm not sure how much it affects performance, but it's definitely going to increase memory usage. The other tweaks look ok from a quick glance, but it's almost 6 AM and i haven't slept in like 12 hours, so I'll need to have another look at this later.

@mechakotik
Copy link
Contributor Author

Thanks for the feedback! I've implemented it as reusable header, now the usage looks like this:

#define SDL_HASHTABLE_TYPENAME UintTable
#define SDL_HASHTABLE_KEY_TYPE Uint64
#define SDL_HASHTABLE_VALUE_TYPE Uint64
#define SDL_HASHTABLE_HASH_KEY(userdata, key) ((key) * 0x9E3779B1u)
#define SDL_HASHTABLE_KEYS_EQUAL(userdata, a, b) ((a) == (b))
#define SDL_HASHTABLE_THREAD_SAFE false
#define SDL_HASHTABLE_MAX_LOAD_FACTOR 128
#include <SDL_generic_hashtable.h>
Updated benchmark code
#include <SDL3/SDL.h>
#include <SDL3/SDL_stdinc.h>
#include <SDL_hashtable.h>
#include <benchmark/benchmark.h>
#include <chrono>
#include <iostream>
#include <random>

#define SDL_HASHTABLE_TYPENAME UintTable
#define SDL_HASHTABLE_KEY_TYPE Uint64
#define SDL_HASHTABLE_VALUE_TYPE Uint64
#define SDL_HASHTABLE_HASH_KEY(userdata, key) ((key) * 0x9E3779B1u)
#define SDL_HASHTABLE_KEYS_EQUAL(userdata, a, b) ((a) == (b))
#include <SDL_generic_hashtable.h>

void insert(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n);
    for(int i = 0; i < n; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    UintTable* table = UintTable_Create(n * 2, nullptr);
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        state.PauseTiming();
        UintTable_Clear(table);
        state.ResumeTiming();

        for(int i = 0; i < n; i++) {
            UintTable_Insert(table, seq[i], 0, false);
        }
    }

    UintTable_Destroy(table);
}

BENCHMARK(insert)->ArgNames({"n"})->ArgsProduct({{100, 10000, 1000000}});

void find(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n * 2);
    for(int i = 0; i < n * 2; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    UintTable* table = UintTable_Create(0, nullptr);
    for(int i = 0; i < n; i++) {
        UintTable_Insert(table, seq[i], 0, false);
    }
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        Uint64 value;
        if(state.range(1)) {
            for(int i = 0; i < n; i++) {
                UintTable_Find(table, seq[i], &value);
            }
        } else {
            for(int i = n; i < n * 2; i++) {
                UintTable_Find(table, seq[i], &value);
            }
        }
        benchmark::DoNotOptimize(value);
    }
}

BENCHMARK(find)->ArgNames({"n", "exists"})->ArgsProduct({{100, 10000, 1000000}, {0, 1}});

void erase(benchmark::State& state) {
    int n = state.range(0);
    std::mt19937_64 gen(42);
    std::vector<Uint64> seq(n * 2);
    for(int i = 0; i < n * 2; i++) {
        seq[i] = gen();
    }
    benchmark::DoNotOptimize(seq);

    UintTable* table = UintTable_Create(0, nullptr);
    benchmark::DoNotOptimize(table);

    for(auto _ : state) {
        state.PauseTiming();
        UintTable_Clear(table);
        for(int i = 0; i < n; i++) {
            UintTable_Insert(table, seq[i], 0, false);
        }
        state.ResumeTiming();

        if(state.range(1)) {
            for(int i = 0; i < n; i++) {
                UintTable_Remove(table, seq[i]);
            }
        } else {
            for(int i = n; i < n * 2; i++) {
                UintTable_Remove(table, seq[i]);
            }
        }
    }
}

BENCHMARK(erase)->ArgNames({"n", "exists"})->ArgsProduct({{100, 10000, 1000000}, {0, 1}});

BENCHMARK_MAIN();
Updated stress test code
#include <SDL3/SDL.h>
#include <SDL3/SDL_stdinc.h>
#include <SDL_hashtable.h>
#include <benchmark/benchmark.h>
#include <chrono>
#include <iostream>
#include <random>

std::mt19937 gen(std::chrono::steady_clock::now().time_since_epoch().count());

void stress() {
    SDL_HashTable* table = SDL_CreateHashTable(0, true, SDL_HashID, SDL_KeyMatchID, NULL, NULL);
    UintTable* utable = UintTable_Create(0, NULL);

    const int N = 1000;
    std::vector<Uint64> keys(N), values(N);
    for(int i = 0; i < N; i++) {
        keys[i] = gen();
        values[i] = gen();
    }

    Uint64 res1, res2;
    bool fl1, fl2;

    for(int it = 0; it < N * 20; it++) {
        Uint64 op = gen() % 3;
        Uint64 arg = gen() % N;
        switch(op) {
            case 0:
                fl1 = SDL_InsertIntoHashTable(table, (const void*)keys[arg], (const void*)values[arg], true);
                fl2 = UintTable_Insert(utable, keys[arg], values[arg], true);
                if(fl1 != fl2) {
                    exit(1);
                }
                break;
            case 1:
                fl1 = SDL_RemoveFromHashTable(table, (const void*)keys[arg]);
                fl2 = UintTable_Remove(utable, keys[arg]);
                if(fl1 != fl2) {
                    exit(2);
                }
                break;
            case 2:
                res1 = UINT64_MAX;
                res2 = UINT64_MAX;
                SDL_FindInHashTable(table, (const void*)keys[arg], (const void**)(&res1));
                UintTable_Find(utable, keys[arg], &res2);
                if(res1 != res2) {
                    exit(3);
                }
                break;
        }
    }
}

int main() {
    while(true) {
        stress();
    }
}

Key and value are still stored in SDL_HashItem, so this is not suitable for large size keys and values. In that case user is expected to manage memory by himself and store pointers/indices in hashtable.

SDL_HashTable uses callbacks to calculate hash, compare keys and destroy items. Compiler doesn't know anything about functions that hashtable is going to call, so some optimizations can't be applied. But in almost all cases we know the functions to be called at compile time.

Generic hashtable is a reusable header that generates hashtable implementation for concrete hash, compare and destroy functions. It allows them to be inlined by the compiler, resulting in 1.5-4x performance boost compared to SDL_HashTable. The implementation is taken from SDL_HashTable, with a few improvements:

- Lowered max load factor to 128 (50%). Seems to be a good default value, as find is unlikely to load more than one cache line. The value also can be overriden by setting SDL_HASHTABLE_MAX_LOAD_FACTOR.

- Removed live bit field, using probe_len = 0 to indicate whether the item is live instead. Seems to be a bit faster, as CPU doesn't need to reset live bit each time probe_len is accessed.

- Removed max_probe_len. Unneeded, as we compare to probe_len of each item we visit, which is a stronger break condition.
@mechakotik
Copy link
Contributor Author

Do you have a separate commit that switches SDL to use this implementation?

I'll try to do this when the API will be more stable, currently it changes quickly and keeping everything up to date will be a lot of work.

@icculus
Copy link
Collaborator

icculus commented Apr 24, 2025

I'll try to do this when the API will be more stable, currently it changes quickly and keeping everything up to date will be a lot of work.

Yes, don't switch things over until everyone is happy with the implementation.

I appreciate how responsive to feedback you're being, but let's not overwork you. :)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants