[cfe-dev] TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Mon Oct 26 17:45:19 PDT 2015

On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
<ismail.pazarbasi at gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I've submitted some of the small patches to get TSan running on Mac OS
>>>> X. Time for the real challenge is approaching, and that needs some
>>>> discussion.
>>>>
>>>> The biggest problem with TSan on Darwin is the way thread local
>>>> variables are created and destroyed. TSan uses a thread-local variable
>>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>>> locals are initialized lazily on Darwin.
>>>>
>>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>>> storage for thread local variables. pthread keys are destroyed in the
>>>> order of their construction, not the opposite direction; from lowest
>>>> index to highest index. Per POSIX standard, this order is unspecified,
>>>> but dyld code (and comments) explain the order. We want TSan's
>>>> thread-locals to live until the end of execution, because we'll be
>>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>>> TSan is initialized too early, it's taken down earlier than other
>>>> thread locals. This is problematic, because destructor of these
>>>> late-destroyed thread-locals may still call functions in TSan RTL
>>>> (interposed functions or functions called from instrumented code, e.g.
>>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>>> an infinite recursion, and end up with stack overflow.
>>>>
>>>> As far as I understood, dynamic linker creates a pthread key acting
>>>> like a class "destructor" (quoting to distinguish it from pthread's
>>>> destructor). When dyld loads image, it creates another key, whose
>>>> destructor function frees the storage. At thread exit, it will call
>>>> destructor functions from first to last. Therefore, the first
>>>> "destructor" destructor function will be called before "finalize"
>>>> destructor. After finalize, storage is freed, but can be resurrected
>>>> if address of TLV is taken.
>>>>
>>>> I have found a workaround (or a solution -- you name it) that aims to
>>>> keep early initialization behavior of TSan TLV keys, but postpone
>>>> their destruction as much as possible. It seems like the only way to
>>>> do this is to assign highest pthread key indices to dyld's TLV code
>>>> when it allocates thread-local variables for TSan, and preserve their
>>>> order. Because dyld will create keys in the right order; we just want
>>>> to move those keys to the end of allocatable range. We have N pthread
>>>> keys available in the range, and we need to allocate three keys at the
>>>> highest possible indices, and order them. Two of these keys are used
>>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>>> Destructor functions for these keys will be called in ascending order,
>>>> and TSan will work as expected in this case.
>>>>
>>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>>> on Darwin to achieve the behavior I explained. First, it identifies
>>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>>> call "pthread_key_create_high" to allocate highest possible key
>>>> indices. It's basically a 'while' loop until real pthread_key_create
>>>> returns EAGAIN. It then frees all other keys it has previously
>>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>>> There is one more thing this function does; it stores pointer to
>>>> original destructor function (passed in to pthread_key_create), and
>>>> replaces it with its own destructor function, which will call
>>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>>> necessary). The replacement destructor function then calls the actual
>>>> destructor function.
>>>>
>>>> As a side note, dyld uses libc++, and its loading the instrumented
>>>> version. This didn't seem to be creating any problem so far. I didn't
>>>> test this on a sufficiently large code base just yet. TSan tests
>>>> generally pass, but they didn't have this "late-destroyed thread
>>>> locals" problem I mentioned.
>>>>
>>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>>> other way to order these events in a sensible way? Can you help me
>>>> fixing some of the problems I mentioned below, especially the
>>>> 'identify_caller' function?
>>>>
>>>> The code is in a very dirty state now, so I refrain submitting it.
>>>> But, no worries; there's always pseudo code! I'd like to share the key
>>>> creation part, the rest of TSan patches are much simpler than this
>>>> part.
>>>>
>>>> Thanks,
>>>> Ismail
>>>>
>>>> #if SANITIZER_MAC
>>>> enum { kPthHighIndexTlvInitializer = 0,
>>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>>> kPthHighIndexTsanFinalize,
>>>> kPthHighIndexCount };
>>>>
>>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>>> static int identify_caller(uptr pc) {
>>>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>>     return kPthHighIndexTsanFinalize;
>>>>   // There are 2 "hidden" callers
>>>>   Dl_info dli;
>>>>   if (!dladdr((const void*)pc, &dli))
>>>>     return -1;
>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>>     return kPthHighIndexTlvInitializer;
>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>>     return kPthHighIndexTlvLoadNotification;
>>>>   return -1;
>>>> }
>>>>
>>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>>                  void (*dtor)(void*)) {
>>>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>>     return REAL(pthread_key_create)(key, dtor);
>>>>   }
>>>>   const uptr caller_pc = GET_CALLER_PC();
>>>>   const int caller_index = identify_caller(caller_pc);
>>>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>>                             : pthread_key_create_high(key, dtor, caller_index);
>>>> }
>>>>
>>>> // 'index' represents caller index so that we allocate them
>>>> // in an order. We want to get the last N-index keys
>>>> // in a sensible order.
>>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>>                                     void (*dtor)(void *), int index) {
>>>>   int res = 0;
>>>>   original_dtors[index] = dtor;
>>>>   // Can be a function-local ulong[512]?
>>>>   unsigned long lowest_key, highest_key;
>>>>
>>>>   // The following 2 loops may terminate early or later
>>>>   // depending on 'index'.
>>>>   do
>>>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>>   while (res != EAGAIN);
>>>>   for (; lowest_key < highest_key; ++lowest_key) {
>>>>     pthread_key_delete(lowest_key);
>>>>   }
>>>>   tlv_keys[index] = highest_key;
>>>>   return res;
>>>> }
>>>>
>>>> // This will be called from _pthread_tsd_cleanup
>>>> static void replacement_dtor(void *p) {
>>>>   uptr cur_key;
>>>>   unsigned index = kPthHighIndexCount;
>>>>   // Can this make it through code review?
>>>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>>     if (cur_key == tlv_keys[c]) {
>>>>       index = c;
>>>>       break;
>>>>     }
>>>>   }
>>>>   if (--tls_dtor_counters[index] == 1) {
>>>>     if (original_dtors[index])
>>>>       original_dtors[index](p);
>>>>     return;
>>>>   }
>>>>   pthread_setspecific(tlv_keys[index], p);
>>>> }
>>>> #endif
>>>
>>>
>>>
>>> Hi Ismail,
>>>
>>> I have a counter-proposal that I wanted to implement for some time,
>>> but never actually get to it because it was low-priority.
>>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>>> postponing. There is other smart code that does the same and that also
>>> breaks the assumption that tsan thread finalization runs last.
>>> Another problem was with asan on mac. Essentially what you see, but in
>>> asan it is easier to work around. We just ignore all memory accesses
>>> that come after our thread destructor.
>>>
>>> The idea that I had is relatively clean, reasonably
>>> platform-independent and should solve _all_ problems of such kind once
>>> and for all.
>>> Namely: we execute asan/tsan/msan thread destructor when the thread
>>> _join_ returns. At that point there is definitely no thread code
>>> running. The high-level implementation plan: for joinable threads we
>>> hook into join (we already intercept it); for detached threads we
>>> start a helper background thread when thread function returns and join
>>> the user thread from that helper thread. I suspect that actual
>>> implementation will be messier than that: for example, a thread can
>>> call pthread_detach in a pthread_specific destructor after we've
>>> decided to _not_ start a helper thread; or a thread can call
>>> pthread_exit and does not return from thread function.
>>>
>>> What do you think about this idea?
>>
>> Another implementation plan can be to always join all threads
>> ourselves. Then in pthread_detach merely remember that the thread is
>> detached, but don't call REAL(pthread_detach). Then when we join a
>> thread check whether it is detached or not; if it is detached then we
>> can discard all thread state; if it is non-detached then we need to
>> remember that the thread has finished and its return value, wait for
>> user pthread_join call and satisfy it from our state.
>>
>> I don't know which plan will lead to simpler and more reliable implementation.
>>
> I didn't think thoroughly about this case. This sounds easier to
> implement, but I am not sure if it'll work correctly (i.e. will it
> miss problems that happen after TSan TLV dies, or will it report
> false-positives, because TSan TLV dies too early). Still, we need to

Can we somehow get access to ThreadState from cur_thread() until we
join the thread? That would be ideal and prevent false positives and
negatives. The simplest way I can think of is to consult
ThreadRegistriy using pthread_self() as the key to find current
ThreadState if pthread_getspecific() returns NULL.

> change a few things in TSan so that it tolerates missing
> cur_thread_state. I forgot many details of TSan, I haven't been
> hacking it for a long time.
>
>> To make it clear, what I don't like in your proposal: it is somewhat
>> platform-dependent, should be enabled only on mac (thus less tested)
>> and does not solve all potential problems of such kind (e.g. on some
>> other OS there can be some other magical way to run user code after
>> tsan thread destructor).
> I agree; it's best to avoid platform-specific workarounds.
>
> Ismail