[cfe-dev] TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Ismail Pazarbasi via cfe-dev cfe-dev at lists.llvm.org
Mon Oct 26 12:41:48 PDT 2015


On Thu, Oct 8, 2015 at 10:41 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
> On Thu, Oct 8, 2015 at 10:20 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
>> On Wed, Oct 7, 2015 at 11:58 PM, Ismail Pazarbasi
>> <ismail.pazarbasi at gmail.com> wrote:
>>> Hi,
>>>
>>> I've submitted some of the small patches to get TSan running on Mac OS
>>> X. Time for the real challenge is approaching, and that needs some
>>> discussion.
>>>
>>> The biggest problem with TSan on Darwin is the way thread local
>>> variables are created and destroyed. TSan uses a thread-local variable
>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>> locals are initialized lazily on Darwin.
>>>
>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>> storage for thread local variables. pthread keys are destroyed in the
>>> order of their construction, not the opposite direction; from lowest
>>> index to highest index. Per POSIX standard, this order is unspecified,
>>> but dyld code (and comments) explain the order. We want TSan's
>>> thread-locals to live until the end of execution, because we'll be
>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>> TSan is initialized too early, it's taken down earlier than other
>>> thread locals. This is problematic, because destructor of these
>>> late-destroyed thread-locals may still call functions in TSan RTL
>>> (interposed functions or functions called from instrumented code, e.g.
>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>> an infinite recursion, and end up with stack overflow.
>>>
>>> As far as I understood, dynamic linker creates a pthread key acting
>>> like a class "destructor" (quoting to distinguish it from pthread's
>>> destructor). When dyld loads image, it creates another key, whose
>>> destructor function frees the storage. At thread exit, it will call
>>> destructor functions from first to last. Therefore, the first
>>> "destructor" destructor function will be called before "finalize"
>>> destructor. After finalize, storage is freed, but can be resurrected
>>> if address of TLV is taken.
>>>
>>> I have found a workaround (or a solution -- you name it) that aims to
>>> keep early initialization behavior of TSan TLV keys, but postpone
>>> their destruction as much as possible. It seems like the only way to
>>> do this is to assign highest pthread key indices to dyld's TLV code
>>> when it allocates thread-local variables for TSan, and preserve their
>>> order. Because dyld will create keys in the right order; we just want
>>> to move those keys to the end of allocatable range. We have N pthread
>>> keys available in the range, and we need to allocate three keys at the
>>> highest possible indices, and order them. Two of these keys are used
>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>> Destructor functions for these keys will be called in ascending order,
>>> and TSan will work as expected in this case.
>>>
>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>> on Darwin to achieve the behavior I explained. First, it identifies
>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>> call "pthread_key_create_high" to allocate highest possible key
>>> indices. It's basically a 'while' loop until real pthread_key_create
>>> returns EAGAIN. It then frees all other keys it has previously
>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>> There is one more thing this function does; it stores pointer to
>>> original destructor function (passed in to pthread_key_create), and
>>> replaces it with its own destructor function, which will call
>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>> necessary). The replacement destructor function then calls the actual
>>> destructor function.
>>>
>>> As a side note, dyld uses libc++, and its loading the instrumented
>>> version. This didn't seem to be creating any problem so far. I didn't
>>> test this on a sufficiently large code base just yet. TSan tests
>>> generally pass, but they didn't have this "late-destroyed thread
>>> locals" problem I mentioned.
>>>
>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>> other way to order these events in a sensible way? Can you help me
>>> fixing some of the problems I mentioned below, especially the
>>> 'identify_caller' function?
>>>
>>> The code is in a very dirty state now, so I refrain submitting it.
>>> But, no worries; there's always pseudo code! I'd like to share the key
>>> creation part, the rest of TSan patches are much simpler than this
>>> part.
>>>
>>> Thanks,
>>> Ismail
>>>
>>> #if SANITIZER_MAC
>>> enum { kPthHighIndexTlvInitializer = 0,
>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>> kPthHighIndexTsanFinalize,
>>> kPthHighIndexCount };
>>>
>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>> static int identify_caller(uptr pc) {
>>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>     return kPthHighIndexTsanFinalize;
>>>   // There are 2 "hidden" callers
>>>   Dl_info dli;
>>>   if (!dladdr((const void*)pc, &dli))
>>>     return -1;
>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>     return kPthHighIndexTlvInitializer;
>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>     return kPthHighIndexTlvLoadNotification;
>>>   return -1;
>>> }
>>>
>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>                  void (*dtor)(void*)) {
>>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>     return REAL(pthread_key_create)(key, dtor);
>>>   }
>>>   const uptr caller_pc = GET_CALLER_PC();
>>>   const int caller_index = identify_caller(caller_pc);
>>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>                             : pthread_key_create_high(key, dtor, caller_index);
>>> }
>>>
>>> // 'index' represents caller index so that we allocate them
>>> // in an order. We want to get the last N-index keys
>>> // in a sensible order.
>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>                                     void (*dtor)(void *), int index) {
>>>   int res = 0;
>>>   original_dtors[index] = dtor;
>>>   // Can be a function-local ulong[512]?
>>>   unsigned long lowest_key, highest_key;
>>>
>>>   // The following 2 loops may terminate early or later
>>>   // depending on 'index'.
>>>   do
>>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>   while (res != EAGAIN);
>>>   for (; lowest_key < highest_key; ++lowest_key) {
>>>     pthread_key_delete(lowest_key);
>>>   }
>>>   tlv_keys[index] = highest_key;
>>>   return res;
>>> }
>>>
>>> // This will be called from _pthread_tsd_cleanup
>>> static void replacement_dtor(void *p) {
>>>   uptr cur_key;
>>>   unsigned index = kPthHighIndexCount;
>>>   // Can this make it through code review?
>>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>     if (cur_key == tlv_keys[c]) {
>>>       index = c;
>>>       break;
>>>     }
>>>   }
>>>   if (--tls_dtor_counters[index] == 1) {
>>>     if (original_dtors[index])
>>>       original_dtors[index](p);
>>>     return;
>>>   }
>>>   pthread_setspecific(tlv_keys[index], p);
>>> }
>>> #endif
>>
>>
>>
>> Hi Ismail,
>>
>> I have a counter-proposal that I wanted to implement for some time,
>> but never actually get to it because it was low-priority.
>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>> postponing. There is other smart code that does the same and that also
>> breaks the assumption that tsan thread finalization runs last.
>> Another problem was with asan on mac. Essentially what you see, but in
>> asan it is easier to work around. We just ignore all memory accesses
>> that come after our thread destructor.
>>
>> The idea that I had is relatively clean, reasonably
>> platform-independent and should solve _all_ problems of such kind once
>> and for all.
>> Namely: we execute asan/tsan/msan thread destructor when the thread
>> _join_ returns. At that point there is definitely no thread code
>> running. The high-level implementation plan: for joinable threads we
>> hook into join (we already intercept it); for detached threads we
>> start a helper background thread when thread function returns and join
>> the user thread from that helper thread. I suspect that actual
>> implementation will be messier than that: for example, a thread can
>> call pthread_detach in a pthread_specific destructor after we've
>> decided to _not_ start a helper thread; or a thread can call
>> pthread_exit and does not return from thread function.
>>
>> What do you think about this idea?
>
> Another implementation plan can be to always join all threads
> ourselves. Then in pthread_detach merely remember that the thread is
> detached, but don't call REAL(pthread_detach). Then when we join a
> thread check whether it is detached or not; if it is detached then we
> can discard all thread state; if it is non-detached then we need to
> remember that the thread has finished and its return value, wait for
> user pthread_join call and satisfy it from our state.
>
> I don't know which plan will lead to simpler and more reliable implementation.
>
I didn't think thoroughly about this case. This sounds easier to
implement, but I am not sure if it'll work correctly (i.e. will it
miss problems that happen after TSan TLV dies, or will it report
false-positives, because TSan TLV dies too early). Still, we need to
change a few things in TSan so that it tolerates missing
cur_thread_state. I forgot many details of TSan, I haven't been
hacking it for a long time.

> To make it clear, what I don't like in your proposal: it is somewhat
> platform-dependent, should be enabled only on mac (thus less tested)
> and does not solve all potential problems of such kind (e.g. on some
> other OS there can be some other magical way to run user code after
> tsan thread destructor).
I agree; it's best to avoid platform-specific workarounds.

Ismail



More information about the cfe-dev mailing list