[llvm-dev] Concurrent Hashmap?

Tue Apr 13 09:35:35 PDT 2021

On 12/04/21 21:45, Reid Kleckner wrote:
> On Fri, Apr 9, 2021 at 12:01 PM antlists via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> 
>     Could you do a dynamic hash instead? In other words grow the hash
>     instead of starting over - what's going to be the cost of retrying as
>     opposed to growing?
> 
> 
> At a really high level, I believe starting the insertion over has the
> same algorithmic complexity as rehashing with the partially filled
> table, it's all O(n). Eventually you build a table of size n, and along
> the way you have to build a table of size n/2, n/4, n/8..., and if you
> sum up all the work to build those tables, you get 2n, or O(n).

Hmmm ... looking at it like that, dynamic hashing, although the maths is
completely different, comes to the same result

cost( 1->2 ) == cost( 2->3 ) == cost( 3->4 )

and going direct cost ( 1->4 ) is the same as summing the previous three
(because in practice, it IS the previous 3).
> 
> At a lower level, for common inputs, rehashing could be faster by a
> constant factor. Suppose there are n input symbols and m unique output
> symbols, and m is significantly smaller than n. Rehashing only takes
> O(m) time, but reinserting takes O(n) time. The overall work is still
> O(n). However, rehashing requires a much fancier concurrent data
> structure, probably with higher constant factor overhead. You may gain
> as much as you lose. And you open yourself up to concurrency bugs. Even
> if you take a tested concurrent data structure off the shelf, they can
> be tricky to use correctly.

Well, provided we're happy with a colliding read/write returning either
the pre or post state indeterminately, dynamic hashing scores here - I
think I can easily spec a NR&1W implementation. I might need more help
implementing it, seeing as I'm a FORTRAN/C guy, not a C++ guy, but I
pick things up quick.
> 
> So, at least for PDB global type hashing, I felt it was better to
> implement a less general data structure that only does bulk inserts
> followed by lookups. I also knew how many insertions I was going to do,
> so I could also conservatively overallocate a big hash table and
> sidestep rehashing. We can't quite do the same for symbols because
> symbol resolution can discover new inputs.

Well, that's the thing, dynamic hashing can grow with minimal cost.
About the only thing I can think of that might require a global lock
(and we might well be able to get round that) is expanding the bucket
table (which would effectively be "just" a re-alloc).

Anything else, the table could grow in the background, locking at most
one bucket at a time, allowing reads to proceed completely unhindered.

Cheers,
Wol