[PATCH] D27155: Merge strings using concurrent hash map (3rd try!)

Sean Silva via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Nov 29 22:58:19 PST 2016


silvas added a comment.

Sorry for the delay.



================
Comment at: ELF/Concurrent.cpp:53
+      size_t Cnt = Counter.fetch_add(256);
+      if (Cnt * 4 > NumBuckets * 3)
+        IsTableFull = true;
----------------
I assume the idea here is to identify as early as possible whether we underestimated the number of buckets. I like this idea. Do you have any justification for the choice of numbers? Why is this better than just waiting for the table to become full? How much better? Should the check be 50%? (Knuth Vol 3 Sec 6.4 provides a good analysis for linearly probed open-addressed hash tables; the lookup costs will skyrocket after 75% load; is that your motivation? It would be good to have a comment)

It may be better to instead track the duplication factor (or an approximation thereof). That way, even if we end up full this time, the caller can ask for the estimated duplication factor and get it right next time with high probability. This will probably be enough to avoid any worry about an excessive number of retries (with the current 10x duplication factor assumption, we might need up to 4 retries with doubling like you have it now; with an estimate, we could make it max of 2 with high probability).
Another possibility is 

Also, I would like to see a comment justifying why it is deterministic whether or not IsTableFull is set to true. Note that if it is nondeterministic then LLD's output will be nondeterministic, so it is important to get this right.


================
Comment at: ELF/Concurrent.h:34
+// them in single thread. This concurrent hash table can uniquify them
+// in a few hundred milliseconds.
+//
----------------
Using how many cores?


================
Comment at: ELF/Concurrent.h:57
+// multiple threads claim buckets, claimed buckets are claimed, and
+// unclaimed buckets remain unclaimed. Therefore, by sorting buckets
+// for each streak, we can make the entire hash table deterministic.
----------------
Please mention that this is crucially dependent on the linear probing strategy. Also, I would add a scary comment on the loop that does the probing saying that it *must* be kept as linear probing.


https://reviews.llvm.org/D27155





More information about the llvm-commits mailing list