[PATCH] D27155: Merge strings using concurrent hash map (3rd try!)

Fri Dec 9 21:01:08 PST 2016

Rui Ueyama via Phabricator via llvm-commits
<llvm-commits at lists.llvm.org> writes:

> ruiu added a comment.
>
> I'm struggling to improve single-core performance of this patch. It scales well, but it's single-core performance sucks. This is a table to link time of clang with debug info (unit is second). As you can see, you need at least 4 cores to take advantage of this patch.

What is the number of cache misses?

Given

 +  size_t NumPieces = 0;
 +  for (MergeInputSection<ELFT> *Sec : Sections)
 +    NumPieces += Sec->Pieces.size();
 +  ParallelBuilder =
 +      new ParallelStringTableBuilder(NumPieces / 2, StringAlignment);

I expect this table to be enormous. Also, why is it valid to divide by 2?

> We cannot make the linker use this algorithm only when it detects 4 or more cores because a choice of algorithm affects layout of mergeable output sections. We want to get deterministic outputs for the same input regardless how many processors are available on a computer.

What would be the slowdown from fully sorting the table?

Out of curiosity, have you tried something that works by divide an
conquer? It should be interesting to try a parallel_sort followed by
std::unique since there are already good implementations of that.

Last but not least, I would still suggest checking how many strings .dwo
avoids copying.

Cheers,
Rafael