[cfe-dev] TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Tue Oct 27 09:13:25 PDT 2015

Hi Ismail and Dmitry,

I’ve been following this (and related) conversations.  We’ve had a recent discussion with Kostya about the issues with TLVs and interceptors on OS X, as I am also experimenting with porting TSan to work on OS X.

While I understand that this patch is about TLV destruction, I’m seeing a more imminent issue, which is the initialization of TLV during process startup and during new thread creation.  The issue is very similar, because TSan’s interceptors get called *before* TLVs are initialized by dyld, and by accessing cur_thread_placeholder the program either crashes or infinitely recurses.  Do you already have some workaround for this as well?

On which version of OS X are you running your tests?

Thanks,
Kuba

> On 26 Oct 2015, at 17:45, Dmitry Vyukov <dvyukov at google.com> wrote:
> 
> On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
> <ismail.pazarbasi at gmail.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> I've submitted some of the small patches to get TSan running on Mac OS
>>>>> X. Time for the real challenge is approaching, and that needs some
>>>>> discussion.
>>>>> 
>>>>> The biggest problem with TSan on Darwin is the way thread local
>>>>> variables are created and destroyed. TSan uses a thread-local variable
>>>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>>>> locals are initialized lazily on Darwin.
>>>>> 
>>>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>>>> storage for thread local variables. pthread keys are destroyed in the
>>>>> order of their construction, not the opposite direction; from lowest
>>>>> index to highest index. Per POSIX standard, this order is unspecified,
>>>>> but dyld code (and comments) explain the order. We want TSan's
>>>>> thread-locals to live until the end of execution, because we'll be
>>>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>>>> TSan is initialized too early, it's taken down earlier than other
>>>>> thread locals. This is problematic, because destructor of these
>>>>> late-destroyed thread-locals may still call functions in TSan RTL
>>>>> (interposed functions or functions called from instrumented code, e.g.
>>>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>>>> an infinite recursion, and end up with stack overflow.
>>>>> 
>>>>> As far as I understood, dynamic linker creates a pthread key acting
>>>>> like a class "destructor" (quoting to distinguish it from pthread's
>>>>> destructor). When dyld loads image, it creates another key, whose
>>>>> destructor function frees the storage. At thread exit, it will call
>>>>> destructor functions from first to last. Therefore, the first
>>>>> "destructor" destructor function will be called before "finalize"
>>>>> destructor. After finalize, storage is freed, but can be resurrected
>>>>> if address of TLV is taken.
>>>>> 
>>>>> I have found a workaround (or a solution -- you name it) that aims to
>>>>> keep early initialization behavior of TSan TLV keys, but postpone
>>>>> their destruction as much as possible. It seems like the only way to
>>>>> do this is to assign highest pthread key indices to dyld's TLV code
>>>>> when it allocates thread-local variables for TSan, and preserve their
>>>>> order. Because dyld will create keys in the right order; we just want
>>>>> to move those keys to the end of allocatable range. We have N pthread
>>>>> keys available in the range, and we need to allocate three keys at the
>>>>> highest possible indices, and order them. Two of these keys are used
>>>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>>>> Destructor functions for these keys will be called in ascending order,
>>>>> and TSan will work as expected in this case.
>>>>> 
>>>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>>>> on Darwin to achieve the behavior I explained. First, it identifies
>>>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>>>> call "pthread_key_create_high" to allocate highest possible key
>>>>> indices. It's basically a 'while' loop until real pthread_key_create
>>>>> returns EAGAIN. It then frees all other keys it has previously
>>>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>>>> There is one more thing this function does; it stores pointer to
>>>>> original destructor function (passed in to pthread_key_create), and
>>>>> replaces it with its own destructor function, which will call
>>>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>>>> necessary). The replacement destructor function then calls the actual
>>>>> destructor function.
>>>>> 
>>>>> As a side note, dyld uses libc++, and its loading the instrumented
>>>>> version. This didn't seem to be creating any problem so far. I didn't
>>>>> test this on a sufficiently large code base just yet. TSan tests
>>>>> generally pass, but they didn't have this "late-destroyed thread
>>>>> locals" problem I mentioned.
>>>>> 
>>>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>>>> other way to order these events in a sensible way? Can you help me
>>>>> fixing some of the problems I mentioned below, especially the
>>>>> 'identify_caller' function?
>>>>> 
>>>>> The code is in a very dirty state now, so I refrain submitting it.
>>>>> But, no worries; there's always pseudo code! I'd like to share the key
>>>>> creation part, the rest of TSan patches are much simpler than this
>>>>> part.
>>>>> 
>>>>> Thanks,
>>>>> Ismail
>>>>> 
>>>>> #if SANITIZER_MAC
>>>>> enum { kPthHighIndexTlvInitializer = 0,
>>>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>>>> kPthHighIndexTsanFinalize,
>>>>> kPthHighIndexCount };
>>>>> 
>>>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>>>> static int identify_caller(uptr pc) {
>>>>>  if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>>>    return kPthHighIndexTsanFinalize;
>>>>>  // There are 2 "hidden" callers
>>>>>  Dl_info dli;
>>>>>  if (!dladdr((const void*)pc, &dli))
>>>>>    return -1;
>>>>>  else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>>>    return kPthHighIndexTlvInitializer;
>>>>>  else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>>>    return kPthHighIndexTlvLoadNotification;
>>>>>  return -1;
>>>>> }
>>>>> 
>>>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>>>                 void (*dtor)(void*)) {
>>>>>  if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>>>    SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>>>    return REAL(pthread_key_create)(key, dtor);
>>>>>  }
>>>>>  const uptr caller_pc = GET_CALLER_PC();
>>>>>  const int caller_index = identify_caller(caller_pc);
>>>>>  return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>>>                            : pthread_key_create_high(key, dtor, caller_index);
>>>>> }
>>>>> 
>>>>> // 'index' represents caller index so that we allocate them
>>>>> // in an order. We want to get the last N-index keys
>>>>> // in a sensible order.
>>>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>>>                                    void (*dtor)(void *), int index) {
>>>>>  int res = 0;
>>>>>  original_dtors[index] = dtor;
>>>>>  // Can be a function-local ulong[512]?
>>>>>  unsigned long lowest_key, highest_key;
>>>>> 
>>>>>  // The following 2 loops may terminate early or later
>>>>>  // depending on 'index'.
>>>>>  do
>>>>>    res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>>>  while (res != EAGAIN);
>>>>>  for (; lowest_key < highest_key; ++lowest_key) {
>>>>>    pthread_key_delete(lowest_key);
>>>>>  }
>>>>>  tlv_keys[index] = highest_key;
>>>>>  return res;
>>>>> }
>>>>> 
>>>>> // This will be called from _pthread_tsd_cleanup
>>>>> static void replacement_dtor(void *p) {
>>>>>  uptr cur_key;
>>>>>  unsigned index = kPthHighIndexCount;
>>>>>  // Can this make it through code review?
>>>>>  __asm__ __volatile__ ("" : "=b" (cur_key));
>>>>>  for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>>>    if (cur_key == tlv_keys[c]) {
>>>>>      index = c;
>>>>>      break;
>>>>>    }
>>>>>  }
>>>>>  if (--tls_dtor_counters[index] == 1) {
>>>>>    if (original_dtors[index])
>>>>>      original_dtors[index](p);
>>>>>    return;
>>>>>  }
>>>>>  pthread_setspecific(tlv_keys[index], p);
>>>>> }
>>>>> #endif
>>>> 
>>>> 
>>>> 
>>>> Hi Ismail,
>>>> 
>>>> I have a counter-proposal that I wanted to implement for some time,
>>>> but never actually get to it because it was low-priority.
>>>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>>>> postponing. There is other smart code that does the same and that also
>>>> breaks the assumption that tsan thread finalization runs last.
>>>> Another problem was with asan on mac. Essentially what you see, but in
>>>> asan it is easier to work around. We just ignore all memory accesses
>>>> that come after our thread destructor.
>>>> 
>>>> The idea that I had is relatively clean, reasonably
>>>> platform-independent and should solve _all_ problems of such kind once
>>>> and for all.
>>>> Namely: we execute asan/tsan/msan thread destructor when the thread
>>>> _join_ returns. At that point there is definitely no thread code
>>>> running. The high-level implementation plan: for joinable threads we
>>>> hook into join (we already intercept it); for detached threads we
>>>> start a helper background thread when thread function returns and join
>>>> the user thread from that helper thread. I suspect that actual
>>>> implementation will be messier than that: for example, a thread can
>>>> call pthread_detach in a pthread_specific destructor after we've
>>>> decided to _not_ start a helper thread; or a thread can call
>>>> pthread_exit and does not return from thread function.
>>>> 
>>>> What do you think about this idea?
>>> 
>>> Another implementation plan can be to always join all threads
>>> ourselves. Then in pthread_detach merely remember that the thread is
>>> detached, but don't call REAL(pthread_detach). Then when we join a
>>> thread check whether it is detached or not; if it is detached then we
>>> can discard all thread state; if it is non-detached then we need to
>>> remember that the thread has finished and its return value, wait for
>>> user pthread_join call and satisfy it from our state.
>>> 
>>> I don't know which plan will lead to simpler and more reliable implementation.
>>> 
>> I didn't think thoroughly about this case. This sounds easier to
>> implement, but I am not sure if it'll work correctly (i.e. will it
>> miss problems that happen after TSan TLV dies, or will it report
>> false-positives, because TSan TLV dies too early). Still, we need to
> 
> 
> Can we somehow get access to ThreadState from cur_thread() until we
> join the thread? That would be ideal and prevent false positives and
> negatives. The simplest way I can think of is to consult
> ThreadRegistriy using pthread_self() as the key to find current
> ThreadState if pthread_getspecific() returns NULL.
> 
> 
>> change a few things in TSan so that it tolerates missing
>> cur_thread_state. I forgot many details of TSan, I haven't been
>> hacking it for a long time.
>> 
>>> To make it clear, what I don't like in your proposal: it is somewhat
>>> platform-dependent, should be enabled only on mac (thus less tested)
>>> and does not solve all potential problems of such kind (e.g. on some
>>> other OS there can be some other magical way to run user code after
>>> tsan thread destructor).
>> I agree; it's best to avoid platform-specific workarounds.
>> 
>> Ismail

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151027/af81e099/attachment.html>