[PATCH] D27146: Merge strings using a probabilistic algorithm to reduce latency.

Sun Nov 27 14:19:23 PST 2016

silvas added a comment.

In https://reviews.llvm.org/D27146#606203, @ruiu wrote:

> Sean, thank you for the investigation and the numbers! I knew some of the numbers (that's why I could come up with this), but I just implemented, set the threshold to 10%, and it just worked, so I didn't take a close look at each section contents.
>
> One thing I'd like to note in addition to your results is that strings are already uniquified for each input section (for obvious reason... compilers don't emit redundant strings.) So the likelihood of identifying duplicate strings when picking up random samples from input sections is larger than picking up random samples from 30 million strings. Instead of a single bag containing 30 million strings, we have approximately 1000 bags containing 30 million strings in total, and inside each bag all strings are unique. That's why I construct the small string table from random samples of input sections instead of random samples of section pieces.

Nice observation!

> 
> 
>> One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?
> 
> The speedup is entirely due to the parallelizing. If you run this with a single thread, you'd get the same latency as before. And that's a good property because this algorithm doesn't slow down the linker when available CPU resource is scarce.

An approach like this could improve overall CPU time by reducing the amount of time in hash table lookups. Smaller hash tables are faster to look up, and so we can bail out faster for strings without duplicates.
Also, the observation that false negatives don't hurt can be used to speed up the hash table (but that would be quite a bit of work). I think you're observation that the string merging for -O1 can accept false negatives for detecting duplicates is the key idea. There are lots of different optimizations to be done based on this.

I was going to suggest using modified bloom filters that detect duplicates, but Anton beat me to it! (I didn't know such a thing already had a name "counting bloom filters").
If we already have hash values for each string, then doing the bloom filter lookup is pretty cheap, and it is easy to combine 
This talk is a good example of bloom filters in a similar problem (except that "input object files" are "documents" and "strings" are "terms"): https://www.youtube.com/watch?v=80LKF2qph6I
The good thing is that the final bloom filter can be constructed hierarchically in parallel. I'm not sure how useful they would be for this application though since for string deduplication "most" strings are duplicates, and so we need to fetch the offset anyway, so there is not much advantage to try to filter out the cases where we *don't* need to do the offset lookup.

https://reviews.llvm.org/D27146