[PATCH] D27152: Merge strings using sharded hash tables.

Sun Nov 27 14:58:51 PST 2016

silvas added a comment.

This is the way I was going to suggest making the probabilistic approach deterministic.

> Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.

Not exactly. You are assuming that there is no contention for the `Offset.fetch_add(Size);`; keep in mind that that atomic add can actually be as expensive as one of the hash table lookups. If the small string map is covering, say 95% of the strings, that still means that you are spending 1/20 of your time contending with other cores (these are just ballpark numbers). Using an approach like in this patch where each thread can operate truly independently (well, except for false sharing as I noted above) is generally easier to reason about and can scale better.

Also, the approach in this patch is an interesting contrast with the map lookups, since each map shard will generally be smaller, and so lookups in the hash table are faster. On the other hand, the cores aren't sharing the map in LLC, which the approach in https://reviews.llvm.org/D27146 does do.

Also, one thing that may be worth considering is reducing the overall number of times that we read all the strings from DRAM. Currently we do it a couple times I can think of:

1. in MergeInputSection<ELFT>::splitStrings
2. in the string deduplication
3. when writing to the output

Ideally we could do some amount of this deduplication work (or all of it) in 1, while we still have the string in cache right after hashing it.

================
Comment at: ELF/OutputSections.cpp:573
+      if (P.second) {
+        Piece.First = true;
+        Piece.OutputOff = Off;
----------------
If two cores try to update nearby section pieces they will have false sharing. SectionPiece is only 16 bytes. So on x86-64 hosted LLD there are 4 of them per cache line, so assuming the hashes are distributed well, it's not unlikely for them to collide.

Even worse, there is negative feedback that prevents individual cores from running ahead. So the will have a hard time "spreading out" and stop stepping on each other's toes.
- Since they all visit the string tables in the same order, if one goes ahead, it will start to experience LLC cache misses when it goes to a new string table, which will slow it down causing the others to catch up with it.
- When there are three threads T1, T2, T3 where T1 is ahead of T2 which is ahead of T3, the following will happen: T1 updates a section piece. This leave a "landmine" for T2 to trip on (it needs to fetch and invalidate T1's cacheline when it tries to write). When T2 trips on it, then T3 can catch up with T2, etc. So the net result is that the cores won't easily spread out.  

I can think of two solutions:
1. Have forEachPiece visit the input object files in a different order for each shard (this may hurt cache locality for the string values though)
2. Do a locality-improving sort (even just a couple levels of partitioning would be enough; it doesn't have to be perfectly sorted) of the string pieces. It doesn't even have to be sort-like really. It just needs to somehow permute the entries in a way that avoids false sharing.

================
Comment at: ELF/OutputSections.cpp:591
+  // Add a shard starting offset to each section piece.
+  parallel_for_each(Sections.begin(), Sections.end(),
+                    [&](MergeInputSection<ELFT> *Sec) {
----------------
One interesting thing about this patch is that it shows that visiting all the pieces is not too expensive. This suggests that maybe doing some amount of preprocessing on the pieces in order to improve locality for the other parts could be worth it.

https://reviews.llvm.org/D27152