[PATCH] D27155: Merge strings using concurrent hash map (3rd try!)

Wed Nov 30 12:55:30 PST 2016

ruiu added inline comments.

================
Comment at: ELF/Concurrent.cpp:24
+    : Alignment(Alignment),
+      NumBuckets(std::max<size_t>(PowerOf2Ceil(EstimatedNumEntries * 2), 1024)),
+      Counter(0) {
----------------
silvas wrote:
> Why double it?
> 
> Also, you have a similar std::max calculation in MergeOutputSection<ELFT>::finalize. Do you need it in both places?
Because my target load factor is 0.5. I chose this number because

 - I don't want to exceeds 0.75 at which point open-addressing hash table gets much slower
 - Even if it can contain all given strings, a long streak of occupied buckets would make the hash table slower because of std::sort.

In order to tune this formula, I need to run this against various programs, but that's not that much important at this moment. I'd do that later.

================
Comment at: ELF/Concurrent.cpp:26
+      Counter(0) {
+  Buckets = (EntryTy *)calloc(NumBuckets, sizeof(EntryTy));
+}
----------------
silvas wrote:
> Can you use a std::unique_ptr<EntryTy[]> to manage this?
Changed to use std::vector.

================
Comment at: ELF/Concurrent.cpp:53
+      size_t Cnt = Counter.fetch_add(256);
+      if (Cnt * 4 > NumBuckets * 3)
+        IsTableFull = true;
----------------
silvas wrote:
> I assume the idea here is to identify as early as possible whether we underestimated the number of buckets. I like this idea. Do you have any justification for the choice of numbers? Why is this better than just waiting for the table to become full? How much better? Should the check be 50%? (Knuth Vol 3 Sec 6.4 provides a good analysis for linearly probed open-addressed hash tables; the lookup costs will skyrocket after 75% load; is that your motivation? It would be good to have a comment)
> 
> It may be better to instead track the duplication factor (or an approximation thereof). That way, even if we end up full this time, the caller can ask for the estimated duplication factor and get it right next time with high probability. This will probably be enough to avoid any worry about an excessive number of retries (with the current 10x duplication factor assumption, we might need up to 4 retries with doubling like you have it now; with an estimate, we could make it max of 2 with high probability).
> Another possibility is 
> 
> Also, I would like to see a comment justifying why it is deterministic whether or not IsTableFull is set to true. Note that if it is nondeterministic then LLD's output will be nondeterministic, so it is important to get this right.
I do not have precise reasoning for these questions, I can give my guts. These should be room to optimize it, but we can do that later.

 - Why it is 0.75 and not 0.5?

I think the load factor 0.5 is too early to give up. When we give up, we need to create a new table and restart. We should continue using the same table until it approaches to much higher load. 0.75 seems like a good cutoff because beyond that it gets really slow.

- We should keep track the duplication factor, shouldn't we?

We should if we could. But currently we stop adding items to the hash table when it gets "full", so once it becomes full, we don't know whether remaining strings are duplicate or not.

I added comment about the load factor, and simplified code for isFull(). I hope this makes things clear.

================
Comment at: ELF/Concurrent.cpp:73
+  }
+  IsTableFull = true;
+}
----------------
silvas wrote:
> I think this technically needs to be atomic to avoid races.
Yeah, I was tempted to say that this is a benign race, but by experts there is no such thing like "benign race". Fixed.

================
Comment at: ELF/OutputSections.cpp:564
+      break;
+    Estimated *= 2;
+  }
----------------
silvas wrote:
> I'm very concerned that there may be user scenarios where we end up almost always needing multiple trips through this loop (e.g., maybe "most" non-debug builds end up with only 2 duplicates on average?). How will we determine if this is the case? Otherwise, this "optimization" may backfire and users in the wild will get slower links; we need to have some feedback loop to correct this if it is the case, or be really confident that this optimization will always speed things up.
> 
> I would propose that we do a run of Poudriere or Debian or Gentoo with this patch applied and treat the resize case as a fatal error (and ideally the error message would also give the number of resizes). That way we can get an idea of how common this is.
> 
I added code here to get better estimation. Now we keep track an approximate number N of successfully inserted items. If N < NumPieces, only N / NumPieces were inserted, so we need to enlarge the table by the inverse, namely NumPieces / N.

https://reviews.llvm.org/D27155