[llvm-branch-commits] [libc] [libc][wctype] Upstream common utils header file from PtrHash-cc prototype to LLVM libc (PR #174798)

Muhammad Bassiouni via llvm-branch-commits llvm-branch-commits at lists.llvm.org
Sat Jan 10 10:15:54 PST 2026


bassiounix wrote:

> More of a general question looking at the complexity of the solution: have you actually measured that this perfect hash approach does better than a simple identity hash? I haven't looked at the hash function it generates in detail, but I assume it has a pretty big constant overhead over a simple modulo. Is it worth it over a few collisions? The lower and upper conversion tables will have only ~1400 entries and even the library author says PtrHash is optimized for tables with [at least a million entries](https://github.com/RagnarGrootKoerkamp/PtrHash?tab=readme-ov-file#input).

We want (by design) a perfect hash table, not a hash table with holes or collisions, we know our data at compile-time, so it seems wasteful to have such implementation memory wise.

In the search of available implementations of perfect hash functions, I found that PtrHash is the fastest one with any number of keys from the smallest to the largest with a very low memory footprint, and that's exactly what we are looking for. If you remember I was working with gperf at first but it had some limitations. So I ended up with PtrHash.

On the other hand, I can go the route of figuring out a minimal perfect hash function manually with our existing data. This means that if there was any update to the conversion data, a new hash function needs to be calculated manually by hand.
Writing the perfect hash function manually is a bad idea for the reasons I suggested. It's a lot of effort and that will need to be redone for every unicode update.

Or for a third solution, I can go to a generic (not optimized) hash table implementation, which is not desired either memory wise or performance wise.

Access time in PtrHash is done via one pass, no probing or cache misses.
And its internal memory footprint is very small as I said, 2.4 bits per key (around 840 additional bytes in memory for both tables) + the size of the value table elements.

I can only expect the size of this conversion data to go bigger and bigger over time, and having something that's fast, memory efficient, and time proof for LLVM libc seems a very good idea!

When generating the internal pilots there's a constant overhead with any number of keys, that's because the original implementation is using rust and computes the internal pilots at runtime. What I planned initially is to make the code `constexpr` to build the pilots at compile time but this increased the number of `constexpr` steps (which we don't want ideally), so what I'm going with is to compute the pilots when we generate the data (which happen once anyway). That means all the good benefits of the implementation without the bad parts of other approaches :)

https://github.com/llvm/llvm-project/pull/174798


More information about the llvm-branch-commits mailing list