[cfe-dev] TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Thu Oct 8 01:41:56 PDT 2015

On Thu, Oct 8, 2015 at 10:20 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
> On Wed, Oct 7, 2015 at 11:58 PM, Ismail Pazarbasi
> <ismail.pazarbasi at gmail.com> wrote:
>> Hi,
>>
>> I've submitted some of the small patches to get TSan running on Mac OS
>> X. Time for the real challenge is approaching, and that needs some
>> discussion.
>>
>> The biggest problem with TSan on Darwin is the way thread local
>> variables are created and destroyed. TSan uses a thread-local variable
>> called "cur_thread_placeholder" to store each thread's state. Thread
>> locals are initialized lazily on Darwin.
>>
>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>> storage for thread local variables. pthread keys are destroyed in the
>> order of their construction, not the opposite direction; from lowest
>> index to highest index. Per POSIX standard, this order is unspecified,
>> but dyld code (and comments) explain the order. We want TSan's
>> thread-locals to live until the end of execution, because we'll be
>> accessing "cur_thread_placeholder" when thread exits. However, since
>> TSan is initialized too early, it's taken down earlier than other
>> thread locals. This is problematic, because destructor of these
>> late-destroyed thread-locals may still call functions in TSan RTL
>> (interposed functions or functions called from instrumented code, e.g.
>> __tsan_func_entry). That either immediately crashes, because TSan is
>> in invalid state or tries to "resurrect" TLV, which generally creates
>> an infinite recursion, and end up with stack overflow.
>>
>> As far as I understood, dynamic linker creates a pthread key acting
>> like a class "destructor" (quoting to distinguish it from pthread's
>> destructor). When dyld loads image, it creates another key, whose
>> destructor function frees the storage. At thread exit, it will call
>> destructor functions from first to last. Therefore, the first
>> "destructor" destructor function will be called before "finalize"
>> destructor. After finalize, storage is freed, but can be resurrected
>> if address of TLV is taken.
>>
>> I have found a workaround (or a solution -- you name it) that aims to
>> keep early initialization behavior of TSan TLV keys, but postpone
>> their destruction as much as possible. It seems like the only way to
>> do this is to assign highest pthread key indices to dyld's TLV code
>> when it allocates thread-local variables for TSan, and preserve their
>> order. Because dyld will create keys in the right order; we just want
>> to move those keys to the end of allocatable range. We have N pthread
>> keys available in the range, and we need to allocate three keys at the
>> highest possible indices, and order them. Two of these keys are used
>> by TLV code (in dyld), and one is used internally in TSan. In this
>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>> Destructor functions for these keys will be called in ascending order,
>> and TSan will work as expected in this case.
>>
>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>> on Darwin to achieve the behavior I explained. First, it identifies
>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>> call "pthread_key_create_high" to allocate highest possible key
>> indices. It's basically a 'while' loop until real pthread_key_create
>> returns EAGAIN. It then frees all other keys it has previously
>> allocated. This places TSan's TLV keys to the end of destructor list.
>> There is one more thing this function does; it stores pointer to
>> original destructor function (passed in to pthread_key_create), and
>> replaces it with its own destructor function, which will call
>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>> times to postpone destruction of TSan TLV even further (IIRC, this was
>> necessary). The replacement destructor function then calls the actual
>> destructor function.
>>
>> As a side note, dyld uses libc++, and its loading the instrumented
>> version. This didn't seem to be creating any problem so far. I didn't
>> test this on a sufficiently large code base just yet. TSan tests
>> generally pass, but they didn't have this "late-destroyed thread
>> locals" problem I mentioned.
>>
>> Does my approach sound wrong, flawed or too brittle? Is there any
>> other way to order these events in a sensible way? Can you help me
>> fixing some of the problems I mentioned below, especially the
>> 'identify_caller' function?
>>
>> The code is in a very dirty state now, so I refrain submitting it.
>> But, no worries; there's always pseudo code! I'd like to share the key
>> creation part, the rest of TSan patches are much simpler than this
>> part.
>>
>> Thanks,
>> Ismail
>>
>> #if SANITIZER_MAC
>> enum { kPthHighIndexTlvInitializer = 0,
>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>> kPthHighIndexTsanFinalize,
>> kPthHighIndexCount };
>>
>> // FIXME: This function is problematic. Can we avoid dladdr?
>> static int identify_caller(uptr pc) {
>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>     return kPthHighIndexTsanFinalize;
>>   // There are 2 "hidden" callers
>>   Dl_info dli;
>>   if (!dladdr((const void*)pc, &dli))
>>     return -1;
>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>     return kPthHighIndexTlvInitializer;
>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>     return kPthHighIndexTlvLoadNotification;
>>   return -1;
>> }
>>
>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>                  void (*dtor)(void*)) {
>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>     return REAL(pthread_key_create)(key, dtor);
>>   }
>>   const uptr caller_pc = GET_CALLER_PC();
>>   const int caller_index = identify_caller(caller_pc);
>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>                             : pthread_key_create_high(key, dtor, caller_index);
>> }
>>
>> // 'index' represents caller index so that we allocate them
>> // in an order. We want to get the last N-index keys
>> // in a sensible order.
>> int __tsan::pthread_key_create_high(unsigned long *key,
>>                                     void (*dtor)(void *), int index) {
>>   int res = 0;
>>   original_dtors[index] = dtor;
>>   // Can be a function-local ulong[512]?
>>   unsigned long lowest_key, highest_key;
>>
>>   // The following 2 loops may terminate early or later
>>   // depending on 'index'.
>>   do
>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>   while (res != EAGAIN);
>>   for (; lowest_key < highest_key; ++lowest_key) {
>>     pthread_key_delete(lowest_key);
>>   }
>>   tlv_keys[index] = highest_key;
>>   return res;
>> }
>>
>> // This will be called from _pthread_tsd_cleanup
>> static void replacement_dtor(void *p) {
>>   uptr cur_key;
>>   unsigned index = kPthHighIndexCount;
>>   // Can this make it through code review?
>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>     if (cur_key == tlv_keys[c]) {
>>       index = c;
>>       break;
>>     }
>>   }
>>   if (--tls_dtor_counters[index] == 1) {
>>     if (original_dtors[index])
>>       original_dtors[index](p);
>>     return;
>>   }
>>   pthread_setspecific(tlv_keys[index], p);
>> }
>> #endif
>
>
>
> Hi Ismail,
>
> I have a counter-proposal that I wanted to implement for some time,
> but never actually get to it because it was low-priority.
> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
> postponing. There is other smart code that does the same and that also
> breaks the assumption that tsan thread finalization runs last.
> Another problem was with asan on mac. Essentially what you see, but in
> asan it is easier to work around. We just ignore all memory accesses
> that come after our thread destructor.
>
> The idea that I had is relatively clean, reasonably
> platform-independent and should solve _all_ problems of such kind once
> and for all.
> Namely: we execute asan/tsan/msan thread destructor when the thread
> _join_ returns. At that point there is definitely no thread code
> running. The high-level implementation plan: for joinable threads we
> hook into join (we already intercept it); for detached threads we
> start a helper background thread when thread function returns and join
> the user thread from that helper thread. I suspect that actual
> implementation will be messier than that: for example, a thread can
> call pthread_detach in a pthread_specific destructor after we've
> decided to _not_ start a helper thread; or a thread can call
> pthread_exit and does not return from thread function.
>
> What do you think about this idea?

Another implementation plan can be to always join all threads
ourselves. Then in pthread_detach merely remember that the thread is
detached, but don't call REAL(pthread_detach). Then when we join a
thread check whether it is detached or not; if it is detached then we
can discard all thread state; if it is non-detached then we need to
remember that the thread has finished and its return value, wait for
user pthread_join call and satisfy it from our state.

I don't know which plan will lead to simpler and more reliable implementation.

To make it clear, what I don't like in your proposal: it is somewhat
platform-dependent, should be enabled only on mac (thus less tested)
and does not solve all potential problems of such kind (e.g. on some
other OS there can be some other magical way to run user code after
tsan thread destructor).