[cfe-dev] Fwd: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Wed Oct 7 15:00:57 PDT 2015

Oops, I've sent it to the old list address!

---------- Forwarded message ----------
From: Ismail Pazarbasi <ismail.pazarbasi at gmail.com>
Date: Wed, Oct 7, 2015 at 11:58 PM
Subject: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV
To: Kostya Serebryany <kcc at google.com>, Dmitry Vyukov
<dvyukov at google.com>, Alexey Samsonov <vonosmas at gmail.com>, Anna Zaks
<ganna at apple.com>, Kuba Brecka <jbrecka at apple.com>
Cc: "cfe-dev at cs.uiuc.edu" <cfe-dev at cs.uiuc.edu>

Hi,

I've submitted some of the small patches to get TSan running on Mac OS
X. Time for the real challenge is approaching, and that needs some
discussion.

The biggest problem with TSan on Darwin is the way thread local
variables are created and destroyed. TSan uses a thread-local variable
called "cur_thread_placeholder" to store each thread's state. Thread
locals are initialized lazily on Darwin.

Darwin dynamic linker uses pthread keys to maintain lifetime and
storage for thread local variables. pthread keys are destroyed in the
order of their construction, not the opposite direction; from lowest
index to highest index. Per POSIX standard, this order is unspecified,
but dyld code (and comments) explain the order. We want TSan's
thread-locals to live until the end of execution, because we'll be
accessing "cur_thread_placeholder" when thread exits. However, since
TSan is initialized too early, it's taken down earlier than other
thread locals. This is problematic, because destructor of these
late-destroyed thread-locals may still call functions in TSan RTL
(interposed functions or functions called from instrumented code, e.g.
__tsan_func_entry). That either immediately crashes, because TSan is
in invalid state or tries to "resurrect" TLV, which generally creates
an infinite recursion, and end up with stack overflow.

As far as I understood, dynamic linker creates a pthread key acting
like a class "destructor" (quoting to distinguish it from pthread's
destructor). When dyld loads image, it creates another key, whose
destructor function frees the storage. At thread exit, it will call
destructor functions from first to last. Therefore, the first
"destructor" destructor function will be called before "finalize"
destructor. After finalize, storage is freed, but can be resurrected
if address of TLV is taken.

I have found a workaround (or a solution -- you name it) that aims to
keep early initialization behavior of TSan TLV keys, but postpone
their destruction as much as possible. It seems like the only way to
do this is to assign highest pthread key indices to dyld's TLV code
when it allocates thread-local variables for TSan, and preserve their
order. Because dyld will create keys in the right order; we just want
to move those keys to the end of allocatable range. We have N pthread
keys available in the range, and we need to allocate three keys at the
highest possible indices, and order them. Two of these keys are used
by TLV code (in dyld), and one is used internally in TSan. In this
case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
Destructor functions for these keys will be called in ascending order,
and TSan will work as expected in this case.

TSan wasn't interposing pthread_key_create, but I had to highjack it
on Darwin to achieve the behavior I explained. First, it identifies
caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
call "pthread_key_create_high" to allocate highest possible key
indices. It's basically a 'while' loop until real pthread_key_create
returns EAGAIN. It then frees all other keys it has previously
allocated. This places TSan's TLV keys to the end of destructor list.
There is one more thing this function does; it stores pointer to
original destructor function (passed in to pthread_key_create), and
replaces it with its own destructor function, which will call
'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
times to postpone destruction of TSan TLV even further (IIRC, this was
necessary). The replacement destructor function then calls the actual
destructor function.

As a side note, dyld uses libc++, and its loading the instrumented
version. This didn't seem to be creating any problem so far. I didn't
test this on a sufficiently large code base just yet. TSan tests
generally pass, but they didn't have this "late-destroyed thread
locals" problem I mentioned.

Does my approach sound wrong, flawed or too brittle? Is there any
other way to order these events in a sensible way? Can you help me
fixing some of the problems I mentioned below, especially the
'identify_caller' function?

The code is in a very dirty state now, so I refrain submitting it.
But, no worries; there's always pseudo code! I'd like to share the key
creation part, the rest of TSan patches are much simpler than this
part.

Thanks,
Ismail

#if SANITIZER_MAC
enum { kPthHighIndexTlvInitializer = 0,
kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
kPthHighIndexTsanFinalize,
kPthHighIndexCount };

// FIXME: This function is problematic. Can we avoid dladdr?
static int identify_caller(uptr pc) {
  if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
    return kPthHighIndexTsanFinalize;
  // There are 2 "hidden" callers
  Dl_info dli;
  if (!dladdr((const void*)pc, &dli))
    return -1;
  else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
    return kPthHighIndexTlvInitializer;
  else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
    return kPthHighIndexTlvLoadNotification;
  return -1;
}

TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
                 void (*dtor)(void*)) {
  if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
    SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
    return REAL(pthread_key_create)(key, dtor);
  }
  const uptr caller_pc = GET_CALLER_PC();
  const int caller_index = identify_caller(caller_pc);
  return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
                            : pthread_key_create_high(key, dtor, caller_index);
}

// 'index' represents caller index so that we allocate them
// in an order. We want to get the last N-index keys
// in a sensible order.
int __tsan::pthread_key_create_high(unsigned long *key,
                                    void (*dtor)(void *), int index) {
  int res = 0;
  original_dtors[index] = dtor;
  // Can be a function-local ulong[512]?
  unsigned long lowest_key, highest_key;

  // The following 2 loops may terminate early or later
  // depending on 'index'.
  do
    res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
  while (res != EAGAIN);
  for (; lowest_key < highest_key; ++lowest_key) {
    pthread_key_delete(lowest_key);
  }
  tlv_keys[index] = highest_key;
  return res;
}

// This will be called from _pthread_tsd_cleanup
static void replacement_dtor(void *p) {
  uptr cur_key;
  unsigned index = kPthHighIndexCount;
  // Can this make it through code review?
  __asm__ __volatile__ ("" : "=b" (cur_key));
  for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
    if (cur_key == tlv_keys[c]) {
      index = c;
      break;
    }
  }
  if (--tls_dtor_counters[index] == 1) {
    if (original_dtors[index])
      original_dtors[index](p);
    return;
  }
  pthread_setspecific(tlv_keys[index], p);
}
#endif