[PATCH] D31947: [scudo] Android support groundwork

Tue Apr 18 04:37:56 PDT 2017

dvyukov added a comment.

First of all, do you have any benchmarks? Single-threaded, multi-threaded with/without contention. How do profiles look like?

================
Comment at: lib/scudo/scudo_allocator.cpp:317
+                 bool ForceZeroContents = false) {
+    initThreadMaybe();
+    if (UNLIKELY(!IsPowerOfTwo(Alignment))) {
----------------
cryptoad wrote:
> eugenis wrote:
> > cryptoad wrote:
> > > eugenis wrote:
> > > > Calling initThreadMaybe() from all entry points looks error-prone. Is it ok (performance-wise) to do lazy initialization in getCurrentThreadContext?
> > > Things like BackendAllocator.ReturnNullOrDieOnBadRequest will require initialization, but are called prior to getCurrentThreadContext.
> > > I am also actually not super happy about the way this turned out. The alternative would be a preinit or equivalent?
> > Ideally access to the allocator cache should go through ThreadLocalContext as well. Maybe.
> > 
> > What do you mean by preinit?
> Access to the local cache & quarantine cache go through the thread context.
> The issue I think is more about the global init, which is currently called via a pthread_once in the thread init.
> So realistically, the global init doesn't happen until a thread has been initialized.
> Unfortunately, that means all the options for the backend haven't been parsed yet if we wait to get the thread context (unless I move it up the code).
> I was thinking like in ASan, where it could be initialized once and for all via the preinit_array?
Agree, this does not look good if we care about performance. We could use preinit_array (or some other android alternative), but we also need to init thread (do we?). So we could piggy-back on thread init check. Namely:

    ThreadContext* GetThreadContext() {
       ThreadContext* ctx = __get_tls()[TLS_TSAN_SLOT];
       if (LIKELY(ctx))
         return ctx; // implies that thread and global context were initialized
       return GetThreadContextSlow();
    }

    ThreadContext* GetThreadContextSlow() {
       init global context once;
       create and init thread context;
       store it into TLS_TSAN_SLOT and return;
    }

If we call GetThreadContext() once in each interface function and then explicitly propagate the object pointer, we don't do any excessive work at all.
This needs special care if some interface functions can be called from signal handlers.

================
Comment at: lib/scudo/scudo_thread_android.cpp:35
+#ifndef SCUDO_N_CONTEXTS
+# define SCUDO_N_CONTEXTS 4
+#endif
----------------
Humm... is the idea that you build scudo for each particular device knowing number of cores?
We can get severe performance drop if 2 or more active threads share the same context. We need to avoid contention at all costs. I am thinking of the following scheme.
 - obtain number of cores at runtime
 - create NCORES (or 2*NCORES) contexts
 - allow threads to dynamically re-associate with contexts to resolve contention
The re-association algorithm can be along the following lines. Remember the last used contenxt. Try to TryLock it. If TryLock succeeds, use it. If TryLock fails, try to find another unlocked context. If all contexts are locked, find the one that was unused for the longest time. The usage accounting can be done with only 1 additional check/store on fast path. Namely:

  struct Context {
    Mutex mu;
    atomic_uint64_t unused_since;
    ...
  };

  bool Context::TryLock() {
    if (mu.TryLock()) {
      atomic_store_relaxed(&unused_since, 0);
      return true;
    }
    if (atomic_load_relaxed(&unused_since) == 0)
      atomic_store_relaxed(&unused_since, now());
    return false;
  }

I think this should provide pretty good contention resolution with minimal cost and reasonable complexity.

https://reviews.llvm.org/D31947