[PATCH] D47289: [scudo] Improve the scalability of the shared TSD model

Sat May 26 01:28:09 PDT 2018

dvyukov added inline comments.

================
Comment at: lib/scudo/scudo_tsd_shared.cpp:71
   if (NumberOfTSDs > 1) {
-    // Go through all the contexts and find the first unlocked one.
-    for (u32 i = 0; i < NumberOfTSDs; i++) {
-      TSD = &TSDs[i];
-      if (TSD->tryLock()) {
-        setCurrentTSD(TSD);
-        return TSD;
+    u32 RandState = static_cast<u32>(TSD->getPrecedence());
+    const u32 R = Rand(&RandState);
----------------
The comment from the change description as to why we use precedence as rand state should be duplicated as a comment here. This is confusing.

================
Comment at: lib/scudo/scudo_tsd_shared.cpp:73
+    const u32 R = Rand(&RandState);
+    const u32 Inc = CoPrimes[R % NumberOfCoPrimes];
+    u32 Index = R % NumberOfTSDs;
----------------
Do you see getTSDAndLockSlow in profiles? Here we have 2 divisions. Divisions are expensive. We could replace them with multiplication+shift if they show up in profiles.

================
Comment at: lib/scudo/scudo_tsd_shared.cpp:78
+    // Go randomly through at most 4 contexts and find a candidate.
+    for (u32 I = 0; I < Min(4U, NumberOfTSDs); I++) {
+      if (&TSDs[Index] == TSD)
----------------
Have you tried dropping precedence entirely and then just going over all TSDs pseudo-randomly once (or maybe twice) and tryLock'ing them? If all tryLock's fail then we could lock the old one, or maybe a random one.
That would shave few cycles from malloc/free fast paths. The benefit of using precedence scheme is still unproved, right?
Also, if we have a hundred of cores/caches, then giving up after just 4 can lead to unnecessary blocking (there still can be dozens of available caches).

Repository:
  rCRT Compiler Runtime

https://reviews.llvm.org/D47289