[PATCH] D27146: Merge strings using a probabilistic algorithm to reduce latency.

Sun Nov 27 04:00:53 PST 2016

silvas added a comment.

I looked more closely at the data (LLD self-link, which includes LLVM for LTO so is reasonably large). I don't think that there will be an easy static heuristic. The bulk are symbol names and the duplication factor is just a complicated function of how the input program is structured (e.g. if a particular class is a common utility used in many places, etc.).

Also, one interesting thing is that the total size is not dominated by strings with any particular number of duplicates. It is somewhat randomly spread out across various discrete numbers of duplicates.

Here is a plot. The horizontal axis is the number of duplicates. The vertical axis is the number of bytes represented by strings with that number of duplicates. Area under the curve is proportional to total string size for the non-deduplicated output.
https://reviews.llvm.org/F2622328

Another useful thing to look at is that same plot, but "integrated": https://reviews.llvm.org/F2622335

This makes it easier to see why the approach in this patch works well. About 80% of string mass is present in strings which are duplicated more than 50 times.
There are approximately 750 files in the link. So assuming random sampling, if we look at 10% of them (so, 75), then of the strings with more than 50 repetitions, we will miss approximately (700 / 750) * (699 / 749) * (698 / 748) * ... * (625 / 675) ~= 2% of them; so this approach will save almost all of that 80% string mass.
(this is just to get intuition; of course, some strings that are replicated less than 50 times will be deduplicated too, but that will only increase the savings).

https://reviews.llvm.org/D27146