[cfe-dev] TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Thu Oct 29 13:45:31 PDT 2015

We can access cur_thread until _pthread_tsd_cleanup is called, which
is called after tsan finishes with the thread entry point, and returns
control to system. After that point, calling cur_thread will
"resurrect" that storage (re-allocated and filled with 0 -- like,
resurrecting in another body?). Accessing the storage through another
pointer aliasing now-freed cur_thread is undefined behavior (because
dyld frees the storage). Accessing cur_thread from within pthread_join
will work, because thread is still alive.

I've tried following so far, but not spent significant time on them.
1. Use malloc, not TLV. This wasn't an impressive way to resolve the
problem, and was also hacky. That required other machinery to make it
thread-safe and thread-specific. System's malloc is initialized a
little too late, IIRC. I also recall a problem with the instrumented
libc++ loaded early on by dyld. Finally, there's was a more serious
problem with freeing the storage; when and which function should free
it?
2. Keep using TLV as-is. Find "the right moment" and move cur_thread
to a dynamically allocated area that lives long enough. Use
pthread_self as key, because it's still valid. Unfortunately, moving
ThreadState didn't work, because there were non-moveable types along
the chain. Again, freeing malloc'ed storage will become a problem.

I gave up dynamic memory allocation for cur_thread, because I couldn't
resolve the problem with lifetime management of that storage. dyld
already manages that memory, so I thought working around its
limitations is a better solution.

Ismail

On Tue, Oct 27, 2015 at 1:45 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
> On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
> <ismail.pazarbasi at gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I've submitted some of the small patches to get TSan running on Mac OS
>>>>> X. Time for the real challenge is approaching, and that needs some
>>>>> discussion.
>>>>>
>>>>> The biggest problem with TSan on Darwin is the way thread local
>>>>> variables are created and destroyed. TSan uses a thread-local variable
>>>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>>>> locals are initialized lazily on Darwin.
>>>>>
>>>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>>>> storage for thread local variables. pthread keys are destroyed in the
>>>>> order of their construction, not the opposite direction; from lowest
>>>>> index to highest index. Per POSIX standard, this order is unspecified,
>>>>> but dyld code (and comments) explain the order. We want TSan's
>>>>> thread-locals to live until the end of execution, because we'll be
>>>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>>>> TSan is initialized too early, it's taken down earlier than other
>>>>> thread locals. This is problematic, because destructor of these
>>>>> late-destroyed thread-locals may still call functions in TSan RTL
>>>>> (interposed functions or functions called from instrumented code, e.g.
>>>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>>>> an infinite recursion, and end up with stack overflow.
>>>>>
>>>>> As far as I understood, dynamic linker creates a pthread key acting
>>>>> like a class "destructor" (quoting to distinguish it from pthread's
>>>>> destructor). When dyld loads image, it creates another key, whose
>>>>> destructor function frees the storage. At thread exit, it will call
>>>>> destructor functions from first to last. Therefore, the first
>>>>> "destructor" destructor function will be called before "finalize"
>>>>> destructor. After finalize, storage is freed, but can be resurrected
>>>>> if address of TLV is taken.
>>>>>
>>>>> I have found a workaround (or a solution -- you name it) that aims to
>>>>> keep early initialization behavior of TSan TLV keys, but postpone
>>>>> their destruction as much as possible. It seems like the only way to
>>>>> do this is to assign highest pthread key indices to dyld's TLV code
>>>>> when it allocates thread-local variables for TSan, and preserve their
>>>>> order. Because dyld will create keys in the right order; we just want
>>>>> to move those keys to the end of allocatable range. We have N pthread
>>>>> keys available in the range, and we need to allocate three keys at the
>>>>> highest possible indices, and order them. Two of these keys are used
>>>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>>>> Destructor functions for these keys will be called in ascending order,
>>>>> and TSan will work as expected in this case.
>>>>>
>>>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>>>> on Darwin to achieve the behavior I explained. First, it identifies
>>>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>>>> call "pthread_key_create_high" to allocate highest possible key
>>>>> indices. It's basically a 'while' loop until real pthread_key_create
>>>>> returns EAGAIN. It then frees all other keys it has previously
>>>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>>>> There is one more thing this function does; it stores pointer to
>>>>> original destructor function (passed in to pthread_key_create), and
>>>>> replaces it with its own destructor function, which will call
>>>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>>>> necessary). The replacement destructor function then calls the actual
>>>>> destructor function.
>>>>>
>>>>> As a side note, dyld uses libc++, and its loading the instrumented
>>>>> version. This didn't seem to be creating any problem so far. I didn't
>>>>> test this on a sufficiently large code base just yet. TSan tests
>>>>> generally pass, but they didn't have this "late-destroyed thread
>>>>> locals" problem I mentioned.
>>>>>
>>>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>>>> other way to order these events in a sensible way? Can you help me
>>>>> fixing some of the problems I mentioned below, especially the
>>>>> 'identify_caller' function?
>>>>>
>>>>> The code is in a very dirty state now, so I refrain submitting it.
>>>>> But, no worries; there's always pseudo code! I'd like to share the key
>>>>> creation part, the rest of TSan patches are much simpler than this
>>>>> part.
>>>>>
>>>>> Thanks,
>>>>> Ismail
>>>>>
>>>>> #if SANITIZER_MAC
>>>>> enum { kPthHighIndexTlvInitializer = 0,
>>>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>>>> kPthHighIndexTsanFinalize,
>>>>> kPthHighIndexCount };
>>>>>
>>>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>>>> static int identify_caller(uptr pc) {
>>>>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>>>     return kPthHighIndexTsanFinalize;
>>>>>   // There are 2 "hidden" callers
>>>>>   Dl_info dli;
>>>>>   if (!dladdr((const void*)pc, &dli))
>>>>>     return -1;
>>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>>>     return kPthHighIndexTlvInitializer;
>>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>>>     return kPthHighIndexTlvLoadNotification;
>>>>>   return -1;
>>>>> }
>>>>>
>>>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>>>                  void (*dtor)(void*)) {
>>>>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>>>     return REAL(pthread_key_create)(key, dtor);
>>>>>   }
>>>>>   const uptr caller_pc = GET_CALLER_PC();
>>>>>   const int caller_index = identify_caller(caller_pc);
>>>>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>>>                             : pthread_key_create_high(key, dtor, caller_index);
>>>>> }
>>>>>
>>>>> // 'index' represents caller index so that we allocate them
>>>>> // in an order. We want to get the last N-index keys
>>>>> // in a sensible order.
>>>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>>>                                     void (*dtor)(void *), int index) {
>>>>>   int res = 0;
>>>>>   original_dtors[index] = dtor;
>>>>>   // Can be a function-local ulong[512]?
>>>>>   unsigned long lowest_key, highest_key;
>>>>>
>>>>>   // The following 2 loops may terminate early or later
>>>>>   // depending on 'index'.
>>>>>   do
>>>>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>>>   while (res != EAGAIN);
>>>>>   for (; lowest_key < highest_key; ++lowest_key) {
>>>>>     pthread_key_delete(lowest_key);
>>>>>   }
>>>>>   tlv_keys[index] = highest_key;
>>>>>   return res;
>>>>> }
>>>>>
>>>>> // This will be called from _pthread_tsd_cleanup
>>>>> static void replacement_dtor(void *p) {
>>>>>   uptr cur_key;
>>>>>   unsigned index = kPthHighIndexCount;
>>>>>   // Can this make it through code review?
>>>>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>>>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>>>     if (cur_key == tlv_keys[c]) {
>>>>>       index = c;
>>>>>       break;
>>>>>     }
>>>>>   }
>>>>>   if (--tls_dtor_counters[index] == 1) {
>>>>>     if (original_dtors[index])
>>>>>       original_dtors[index](p);
>>>>>     return;
>>>>>   }
>>>>>   pthread_setspecific(tlv_keys[index], p);
>>>>> }
>>>>> #endif
>>>>
>>>>
>>>>
>>>> Hi Ismail,
>>>>
>>>> I have a counter-proposal that I wanted to implement for some time,
>>>> but never actually get to it because it was low-priority.
>>>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>>>> postponing. There is other smart code that does the same and that also
>>>> breaks the assumption that tsan thread finalization runs last.
>>>> Another problem was with asan on mac. Essentially what you see, but in
>>>> asan it is easier to work around. We just ignore all memory accesses
>>>> that come after our thread destructor.
>>>>
>>>> The idea that I had is relatively clean, reasonably
>>>> platform-independent and should solve _all_ problems of such kind once
>>>> and for all.
>>>> Namely: we execute asan/tsan/msan thread destructor when the thread
>>>> _join_ returns. At that point there is definitely no thread code
>>>> running. The high-level implementation plan: for joinable threads we
>>>> hook into join (we already intercept it); for detached threads we
>>>> start a helper background thread when thread function returns and join
>>>> the user thread from that helper thread. I suspect that actual
>>>> implementation will be messier than that: for example, a thread can
>>>> call pthread_detach in a pthread_specific destructor after we've
>>>> decided to _not_ start a helper thread; or a thread can call
>>>> pthread_exit and does not return from thread function.
>>>>
>>>> What do you think about this idea?
>>>
>>> Another implementation plan can be to always join all threads
>>> ourselves. Then in pthread_detach merely remember that the thread is
>>> detached, but don't call REAL(pthread_detach). Then when we join a
>>> thread check whether it is detached or not; if it is detached then we
>>> can discard all thread state; if it is non-detached then we need to
>>> remember that the thread has finished and its return value, wait for
>>> user pthread_join call and satisfy it from our state.
>>>
>>> I don't know which plan will lead to simpler and more reliable implementation.
>>>
>> I didn't think thoroughly about this case. This sounds easier to
>> implement, but I am not sure if it'll work correctly (i.e. will it
>> miss problems that happen after TSan TLV dies, or will it report
>> false-positives, because TSan TLV dies too early). Still, we need to
>
>
> Can we somehow get access to ThreadState from cur_thread() until we
> join the thread? That would be ideal and prevent false positives and
> negatives. The simplest way I can think of is to consult
> ThreadRegistriy using pthread_self() as the key to find current
> ThreadState if pthread_getspecific() returns NULL.
>
>
>> change a few things in TSan so that it tolerates missing
>> cur_thread_state. I forgot many details of TSan, I haven't been
>> hacking it for a long time.
>>
>>> To make it clear, what I don't like in your proposal: it is somewhat
>>> platform-dependent, should be enabled only on mac (thus less tested)
>>> and does not solve all potential problems of such kind (e.g. on some
>>> other OS there can be some other magical way to run user code after
>>> tsan thread destructor).
>> I agree; it's best to avoid platform-specific workarounds.
>>
>> Ismail