[cfe-dev] TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Ismail Pazarbasi via cfe-dev cfe-dev at lists.llvm.org
Thu Oct 29 13:19:49 PDT 2015


Hi Kuba,

Given enough hacks, there's always a workaround! I'll try to cleanup
the messy parts and put my branch on github so that we can share the
code. I haven't posted these parts for review, so I haven't spent
serious effort to make it pretty. It's ugly, but it'll give you the
idea.

I've started on 10.9. It took a while for Cisco IT to give the green
light for 10.10 upgrade, then it took a while for me to upgrade,
because I didn't want any glitch before my sanitizers presentation in
April 2014. I haven't tested on 10.11 yet, because Cisco IT gave green
light for the upgrade just a few days ago.

TL;DR; set a flag when dyld tells you that TLV has been created, and
use that flag in interceptor macros. Likewise, reset that flag when
dyld tells you that it's been deallocated. Note that there were some
interceptors (pthread_mutex_lock, I think, and some C++ ABI functions)
that needed explicit checks for Mac.

1. For Mac, I've made a function named InitCurThreadDarwin, and
__tsan_init calls __tsan::InitCurThreadDarwin()
void __tsan_init() {
#if SANITIZER_MAC
  __tsan::InitCurThreadDarwin();
#endif
  Initialize(cur_thread());
}

2. We'll create a pthread key that indicates that TSan's TLV has been
allocated, and TSan is ready to work. Interceptors will call
TsanIsTlsReady function (on Mac). This prevents interceptors to
intercept anything before TSan's TLV is initialized. As you can see,
it relies on pthread_key_create_high, which I explained in the first
email.

static unsigned long tls_init_key;
static pthread_once_t once_control = PTHREAD_ONCE_INIT;

void InitCurThreadDarwin() {
  pthread_once(&once_control, []{
    __tsan::pthread_key_create_high(&tls_init_key, nullptr, 2);
  });
  // Register handler first; will be called once TLS is allocated
  // FIXME: Must be done only once per thread!
  dyld_register_tlv_state_change_handler(
      dyld_tlv_state_allocated,
      ^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
        tls_state_notify(state, info);
      });
  dyld_register_tlv_state_change_handler(
      dyld_tlv_state_deallocated,
      ^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
        tls_state_notify(state, info);
      });

  // Taking address will force it to to be initialized.
  volatile void *tmp = &cur_thread_placeholder;
  (void)tmp;
  // enumerate TLS slots, if we are not initialized yet.
  if (!TsanTlsReady()) {
    dyld_enumerate_tlv_storage(
        ^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
          tls_state_notify(state, info);
        });
  }
}

3. TsanIsTlsReady function:
bool TsanIsTlsReady() {
  return !tls_init_key ?
    false : pthread_getspecific(tls_init_key) != nullptr;
}

4. Some macros!
#define ENSURE_TSAN_INITED() do { \
CHECK(!tsan_init_is_running); \
if (UNLIKELY(!tsan_inited)) { \
  TsanInitFromRtl(); \
} \
} while (0)

#define COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED \
  (!tsan_inited || ! TsanIsTlsReady() || !cur_thread()->is_inited)

5. dyld will call this function when TSan's TLV gets initialized.
Watch your eyes!
static void tls_state_notify(dyld_tlv_states state, const dyld_tlv_info *info) {
  const bool is_allocated = state == dyld_tlv_state_allocated;
  if (!is_allocated) {
    // If this key is freed before thread_finalize, ensure we stop it now.
    uptr AlignedSize = RoundUpToAlignment(sizeof(ThreadState), 64);
    // FIXME: Is this working by chance?
    uptr A = (uptr)((char *)info->tlv_addr) + info->info_size;
    A = RoundUpToAlignment(A, 64);
    // assert( that this is the pointer we're looking for -- but how? )
    // FIXME: Can pthread_self() be used as a key to lookup original address
    // of &cur_thread_placeholder so that we can compare it with 'A'?
    ThreadState *thr = reinterpret_cast<ThreadState *>(A);
    TsanForceFinalize(thr);
    pthread_setspecific(tls_init_key, nullptr);
    return;
  }
  const bool V = is_placeholder_addr(info);
  if (!pthread_getspecific(tls_init_key) && V)
    pthread_setspecific(tls_init_key, (void *)is_allocated);
}
// FIXME: Ensure that this function is not called after TLV is freed!
// Taking the address will resurrect it, and infinitely recurse.
ALWAYS_INLINE static bool is_placeholder_addr(const dyld_tlv_info *info) {
  return (info->tlv_addr <= &cur_thread_placeholder) &&
         (char *)&cur_thread_placeholder <
             ((char *)info->tlv_addr + info->tlv_size);
}

I hope that this takes a much better shape, and lands to trunk soon! I
would be very happy, if you could subscribe me to your related
patches.

Ismail

On Tue, Oct 27, 2015 at 5:13 PM, Kuba Brecka <jbrecka at apple.com> wrote:
> Hi Ismail and Dmitry,
>
> I’ve been following this (and related) conversations.  We’ve had a recent
> discussion with Kostya about the issues with TLVs and interceptors on OS X,
> as I am also experimenting with porting TSan to work on OS X.
>
> While I understand that this patch is about TLV destruction, I’m seeing a
> more imminent issue, which is the initialization of TLV during process
> startup and during new thread creation.  The issue is very similar, because
> TSan’s interceptors get called *before* TLVs are initialized by dyld, and by
> accessing cur_thread_placeholder the program either crashes or infinitely
> recurses.  Do you already have some workaround for this as well?
>
> On which version of OS X are you running your tests?
>
> Thanks,
> Kuba
>
>
> On 26 Oct 2015, at 17:45, Dmitry Vyukov <dvyukov at google.com> wrote:
>
> On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
> <ismail.pazarbasi at gmail.com> wrote:
>
> Hi,
>
> I've submitted some of the small patches to get TSan running on Mac OS
> X. Time for the real challenge is approaching, and that needs some
> discussion.
>
> The biggest problem with TSan on Darwin is the way thread local
> variables are created and destroyed. TSan uses a thread-local variable
> called "cur_thread_placeholder" to store each thread's state. Thread
> locals are initialized lazily on Darwin.
>
> Darwin dynamic linker uses pthread keys to maintain lifetime and
> storage for thread local variables. pthread keys are destroyed in the
> order of their construction, not the opposite direction; from lowest
> index to highest index. Per POSIX standard, this order is unspecified,
> but dyld code (and comments) explain the order. We want TSan's
> thread-locals to live until the end of execution, because we'll be
> accessing "cur_thread_placeholder" when thread exits. However, since
> TSan is initialized too early, it's taken down earlier than other
> thread locals. This is problematic, because destructor of these
> late-destroyed thread-locals may still call functions in TSan RTL
> (interposed functions or functions called from instrumented code, e.g.
> __tsan_func_entry). That either immediately crashes, because TSan is
> in invalid state or tries to "resurrect" TLV, which generally creates
> an infinite recursion, and end up with stack overflow.
>
> As far as I understood, dynamic linker creates a pthread key acting
> like a class "destructor" (quoting to distinguish it from pthread's
> destructor). When dyld loads image, it creates another key, whose
> destructor function frees the storage. At thread exit, it will call
> destructor functions from first to last. Therefore, the first
> "destructor" destructor function will be called before "finalize"
> destructor. After finalize, storage is freed, but can be resurrected
> if address of TLV is taken.
>
> I have found a workaround (or a solution -- you name it) that aims to
> keep early initialization behavior of TSan TLV keys, but postpone
> their destruction as much as possible. It seems like the only way to
> do this is to assign highest pthread key indices to dyld's TLV code
> when it allocates thread-local variables for TSan, and preserve their
> order. Because dyld will create keys in the right order; we just want
> to move those keys to the end of allocatable range. We have N pthread
> keys available in the range, and we need to allocate three keys at the
> highest possible indices, and order them. Two of these keys are used
> by TLV code (in dyld), and one is used internally in TSan. In this
> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
> Destructor functions for these keys will be called in ascending order,
> and TSan will work as expected in this case.
>
> TSan wasn't interposing pthread_key_create, but I had to highjack it
> on Darwin to achieve the behavior I explained. First, it identifies
> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
> call "pthread_key_create_high" to allocate highest possible key
> indices. It's basically a 'while' loop until real pthread_key_create
> returns EAGAIN. It then frees all other keys it has previously
> allocated. This places TSan's TLV keys to the end of destructor list.
> There is one more thing this function does; it stores pointer to
> original destructor function (passed in to pthread_key_create), and
> replaces it with its own destructor function, which will call
> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
> times to postpone destruction of TSan TLV even further (IIRC, this was
> necessary). The replacement destructor function then calls the actual
> destructor function.
>
> As a side note, dyld uses libc++, and its loading the instrumented
> version. This didn't seem to be creating any problem so far. I didn't
> test this on a sufficiently large code base just yet. TSan tests
> generally pass, but they didn't have this "late-destroyed thread
> locals" problem I mentioned.
>
> Does my approach sound wrong, flawed or too brittle? Is there any
> other way to order these events in a sensible way? Can you help me
> fixing some of the problems I mentioned below, especially the
> 'identify_caller' function?
>
> The code is in a very dirty state now, so I refrain submitting it.
> But, no worries; there's always pseudo code! I'd like to share the key
> creation part, the rest of TSan patches are much simpler than this
> part.
>
> Thanks,
> Ismail
>
> #if SANITIZER_MAC
> enum { kPthHighIndexTlvInitializer = 0,
> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
> kPthHighIndexTsanFinalize,
> kPthHighIndexCount };
>
> // FIXME: This function is problematic. Can we avoid dladdr?
> static int identify_caller(uptr pc) {
>  if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>    return kPthHighIndexTsanFinalize;
>  // There are 2 "hidden" callers
>  Dl_info dli;
>  if (!dladdr((const void*)pc, &dli))
>    return -1;
>  else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>    return kPthHighIndexTlvInitializer;
>  else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>    return kPthHighIndexTlvLoadNotification;
>  return -1;
> }
>
> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>                 void (*dtor)(void*)) {
>  if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>    SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>    return REAL(pthread_key_create)(key, dtor);
>  }
>  const uptr caller_pc = GET_CALLER_PC();
>  const int caller_index = identify_caller(caller_pc);
>  return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>                            : pthread_key_create_high(key, dtor,
> caller_index);
> }
>
> // 'index' represents caller index so that we allocate them
> // in an order. We want to get the last N-index keys
> // in a sensible order.
> int __tsan::pthread_key_create_high(unsigned long *key,
>                                    void (*dtor)(void *), int index) {
>  int res = 0;
>  original_dtors[index] = dtor;
>  // Can be a function-local ulong[512]?
>  unsigned long lowest_key, highest_key;
>
>  // The following 2 loops may terminate early or later
>  // depending on 'index'.
>  do
>    res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>  while (res != EAGAIN);
>  for (; lowest_key < highest_key; ++lowest_key) {
>    pthread_key_delete(lowest_key);
>  }
>  tlv_keys[index] = highest_key;
>  return res;
> }
>
> // This will be called from _pthread_tsd_cleanup
> static void replacement_dtor(void *p) {
>  uptr cur_key;
>  unsigned index = kPthHighIndexCount;
>  // Can this make it through code review?
>  __asm__ __volatile__ ("" : "=b" (cur_key));
>  for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++)
> {
>    if (cur_key == tlv_keys[c]) {
>      index = c;
>      break;
>    }
>  }
>  if (--tls_dtor_counters[index] == 1) {
>    if (original_dtors[index])
>      original_dtors[index](p);
>    return;
>  }
>  pthread_setspecific(tlv_keys[index], p);
> }
> #endif
>
>
>
>
> Hi Ismail,
>
> I have a counter-proposal that I wanted to implement for some time,
> but never actually get to it because it was low-priority.
> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
> postponing. There is other smart code that does the same and that also
> breaks the assumption that tsan thread finalization runs last.
> Another problem was with asan on mac. Essentially what you see, but in
> asan it is easier to work around. We just ignore all memory accesses
> that come after our thread destructor.
>
> The idea that I had is relatively clean, reasonably
> platform-independent and should solve _all_ problems of such kind once
> and for all.
> Namely: we execute asan/tsan/msan thread destructor when the thread
> _join_ returns. At that point there is definitely no thread code
> running. The high-level implementation plan: for joinable threads we
> hook into join (we already intercept it); for detached threads we
> start a helper background thread when thread function returns and join
> the user thread from that helper thread. I suspect that actual
> implementation will be messier than that: for example, a thread can
> call pthread_detach in a pthread_specific destructor after we've
> decided to _not_ start a helper thread; or a thread can call
> pthread_exit and does not return from thread function.
>
> What do you think about this idea?
>
>
> Another implementation plan can be to always join all threads
> ourselves. Then in pthread_detach merely remember that the thread is
> detached, but don't call REAL(pthread_detach). Then when we join a
> thread check whether it is detached or not; if it is detached then we
> can discard all thread state; if it is non-detached then we need to
> remember that the thread has finished and its return value, wait for
> user pthread_join call and satisfy it from our state.
>
> I don't know which plan will lead to simpler and more reliable
> implementation.
>
> I didn't think thoroughly about this case. This sounds easier to
> implement, but I am not sure if it'll work correctly (i.e. will it
> miss problems that happen after TSan TLV dies, or will it report
> false-positives, because TSan TLV dies too early). Still, we need to
>
>
>
> Can we somehow get access to ThreadState from cur_thread() until we
> join the thread? That would be ideal and prevent false positives and
> negatives. The simplest way I can think of is to consult
> ThreadRegistriy using pthread_self() as the key to find current
> ThreadState if pthread_getspecific() returns NULL.
>
>
> change a few things in TSan so that it tolerates missing
> cur_thread_state. I forgot many details of TSan, I haven't been
> hacking it for a long time.
>
> To make it clear, what I don't like in your proposal: it is somewhat
> platform-dependent, should be enabled only on mac (thus less tested)
> and does not solve all potential problems of such kind (e.g. on some
> other OS there can be some other magical way to run user code after
> tsan thread destructor).
>
> I agree; it's best to avoid platform-specific workarounds.
>
> Ismail
>
>



More information about the cfe-dev mailing list