[LLVMdev] RFC: Using hashing for switch statements

Fri Jan 24 12:35:44 PST 2014

On Fri, Jan 24, 2014 at 08:51:56PM +0100, Jasper Neumann wrote:
> Why not Jenkins' hashing?
> 
> The search time of Jenkins' hashing might become unacceptably long
> for extraordinary large label sets.

It is generally not linear time, making it hardly appropiate for the
default pass pipeline. That's not just a question of very large label
sets -- it happens even with much smaller sets in the area of 100+
elements.

> Why CHM?
> 
> CHM is extraordinarily fast at generating the extra table with a
> runtime of O(1).

Probalistic O(n), not O(1).

> Why not CHM?
> 
> CHM needs extra table space of 2.09 * n elements; this is larger
> compared to Jenkins' method.

This is misleading. It requires 2 * n elements for the construction
using two independent hash functions, it is ~1.24 * n elments for the
construction using three independent hash functions.

> CHM (in the implementation of cmph-2.0) is much more complicated
> than the code of Jenkins' hashing:
> h1 = f1(x) % N;
> h2 = f2(x) % N;
> if (h1 == h2 && ++h2 >= N)  h2 = 0;
> h = (Table[h1] + Table[h2]) % n;

The third line is strange -- that's normally just rejected during graph
construction. As such the normal output is:

table[M] = { ... }
h1 = f1(x) % M;
h2 = f2(x) % M;

return (table[h1] + table[h2]) % N;

with M and N being compile-time constants. As such, the modulo
operations will be replaced by two multiplications. Note that the two
table accesses are independent and parallizable.

The third alternative is BPZ, which would give a hash function like:
h1 = f1(x) % M;
h2 = f2(x) % M;
idx = (g[h1] + g[h2]) % 2
return idx == 0 ? h1 : h2;

The table load is a bit load here. This version would require 2 bit/key
for a perfect hash function. Bit load tends to be somewhat messier
though, so this is less attractive.

Joerg