[PATCH] D47289: [scudo] Improve the scalability of the shared TSD model

Thu Jun 7 11:16:02 PDT 2018

cryptoad added a comment.

Here are some answers to Dmitry's requests:

- Regarding `getTSDAndLockSlow` and the division: pprof shows no significant time spent in the function outside of the `tryLock` & `lock` so I think we are good here;
- Regarding the precedence: I tested a version where I dropped it entirely, results are mixed:
  - For Android's "improved" `memory_replay`: it is faster in all cases, but we only have 2 caches for that specific platform (due to memory constraints compared to the default allocator);
  - For `rpc2-benchmark`: mostly similar numbers;
  - For `t-test1`: the version with precedence shows better performances in almost all situations; this benchmark also demonstrates a slowdown with the number of TSDs scanned in the slowpath, eg: scanning 4 and slow locking if they all failed to tryLock performs better overall than scanning 32. And this can be a significant slowdown, for example with `t-test1 800 40 800000 100000`, it's 900s spent in allocation functions vs 1150s. The argument here is that this benchmark only does {de}allocations (& memset) and as such isn't very representative of "real" programs, but it's exercising the most contention on the caches.

I can't seem to get a definitive answer overall as with or without precedence have both win & lose situations.
The only sure thing so far is that both are better than the current version.

I am open to suggestion or potential improvements, otherwise I'd keep the current version of the CL (and will address the review comments).

Repository:
  rCRT Compiler Runtime

https://reviews.llvm.org/D47289