[PATCH] D79467: [PDB] Optimize public symbol processing

Fri May 8 12:20:22 PDT 2020

aganea added a comment.

Yes, I wanted to get back to that GHash parallelization at some point, but I'm swamped in deploying Clang on production and shipping some of our games. I was planning to back to that eventually: D55585 <https://reviews.llvm.org/D55585> -- However the plan was to first move all Type-related code from PDB.cpp to DebugTypes.cpp, to ease things a bit: D59226 <https://reviews.llvm.org/D59226> -- this still needs to be completed (steps 5-7).

D55585 <https://reviews.llvm.org/D55585> was only parallelizing the hashing, not the type merging. It would be possible however to do hashing in parallel, without dividing the keyspace, however I'm not sure yet what is the best strategy. Internally at Ubisoft we have a lock-free hashmap which is being used for the past 12 years, it's very stable and well tested. We will be happy to open source it or reimplement it in LLVM. However it is lock-free, not wait-free, so I wanted to try it first (on the type merging).

The other strategy I was pursuing was a lock-free and wait-free hashmap. But that requires atomic operations, with fixed 64-bit (or 128-bit) buckets (key+value). A 2x 64-bit is also doable I suppose, if the target architecture doesn't have 128-bit atomics. The big problem then is resizing the hashmap, but there again, there could be a lock-free solution, by re-hashing in a thread while the other threads are still inserting in the old hashmap. It's tricky, but there's prior art in this domain. 
I think this solution would scale better that our current lock-free hashmap, which requires spinlocks when inserting nodes. That can potentially give back time slices to the kernel in the form of Sleep() or SuspendThread(), and that is a bad thing IMHO. I don't see that scaling well past hunderd-core mark; only a atomic hashmap could work in my sense (if we want to plan ahead for the future decade).

Another subject is how to scale this kind of algorithm across NUMA nodes. Any operation that crosses the CPU socket or NUMA boundary is very expensive. This maybe requires a latent strategy, where operation could be synchonized in bluk, not independelty, across NUMA boundaries. Again, this is a hot topic in my sense, and I don't know if there's active research there (aside of decades-long MPI knowledge in the super computing world, which could apply maybe). If we build today a parallel Type merging which won't scale well in two years on the future EPYC, that would be a pity. I don't know, maybe it's worth doing it until it doesn't scale anymore?

If you feel like modifying or landing any of the patches above, feel free! If not, I'll eventually get back to them.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79467/new/

https://reviews.llvm.org/D79467