<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Hi Ismail and Dmitry,</div><div class=""><br class=""></div><div class="">I’ve been following this (and related) conversations.  We’ve had a recent discussion with Kostya about the issues with TLVs and interceptors on OS X, as I am also experimenting with porting TSan to work on OS X.</div><div class=""><br class=""></div><div class="">While I understand that this patch is about TLV destruction, I’m seeing a more imminent issue, which is the initialization of TLV during process startup and during new thread creation.  The issue is very similar, because TSan’s interceptors get called *before* TLVs are initialized by dyld, and by accessing cur_thread_placeholder the program either crashes or infinitely recurses.  Do you already have some workaround for this as well?</div><div class=""><br class=""></div><div class="">On which version of OS X are you running your tests?</div><div class=""><br class=""></div><div class="">Thanks,</div><div class="">Kuba</div><div class=""><br class=""></div><br class=""><div><blockquote type="cite" class=""><div class="">On 26 Oct 2015, at 17:45, Dmitry Vyukov <<a href="mailto:dvyukov@google.com" class="">dvyukov@google.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""><<a href="mailto:ismail.pazarbasi@gmail.com" class="">ismail.pazarbasi@gmail.com</a>> wrote:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" class=""><blockquote type="cite" class=""><blockquote type="cite" class="">Hi,<br class=""><br class="">I've submitted some of the small patches to get TSan running on Mac OS<br class="">X. Time for the real challenge is approaching, and that needs some<br class="">discussion.<br class=""><br class="">The biggest problem with TSan on Darwin is the way thread local<br class="">variables are created and destroyed. TSan uses a thread-local variable<br class="">called "cur_thread_placeholder" to store each thread's state. Thread<br class="">locals are initialized lazily on Darwin.<br class=""><br class="">Darwin dynamic linker uses pthread keys to maintain lifetime and<br class="">storage for thread local variables. pthread keys are destroyed in the<br class="">order of their construction, not the opposite direction; from lowest<br class="">index to highest index. Per POSIX standard, this order is unspecified,<br class="">but dyld code (and comments) explain the order. We want TSan's<br class="">thread-locals to live until the end of execution, because we'll be<br class="">accessing "cur_thread_placeholder" when thread exits. However, since<br class="">TSan is initialized too early, it's taken down earlier than other<br class="">thread locals. This is problematic, because destructor of these<br class="">late-destroyed thread-locals may still call functions in TSan RTL<br class="">(interposed functions or functions called from instrumented code, e.g.<br class="">__tsan_func_entry). That either immediately crashes, because TSan is<br class="">in invalid state or tries to "resurrect" TLV, which generally creates<br class="">an infinite recursion, and end up with stack overflow.<br class=""><br class="">As far as I understood, dynamic linker creates a pthread key acting<br class="">like a class "destructor" (quoting to distinguish it from pthread's<br class="">destructor). When dyld loads image, it creates another key, whose<br class="">destructor function frees the storage. At thread exit, it will call<br class="">destructor functions from first to last. Therefore, the first<br class="">"destructor" destructor function will be called before "finalize"<br class="">destructor. After finalize, storage is freed, but can be resurrected<br class="">if address of TLV is taken.<br class=""><br class="">I have found a workaround (or a solution -- you name it) that aims to<br class="">keep early initialization behavior of TSan TLV keys, but postpone<br class="">their destruction as much as possible. It seems like the only way to<br class="">do this is to assign highest pthread key indices to dyld's TLV code<br class="">when it allocates thread-local variables for TSan, and preserve their<br class="">order. Because dyld will create keys in the right order; we just want<br class="">to move those keys to the end of allocatable range. We have N pthread<br class="">keys available in the range, and we need to allocate three keys at the<br class="">highest possible indices, and order them. Two of these keys are used<br class="">by TLV code (in dyld), and one is used internally in TSan. In this<br class="">case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.<br class="">Destructor functions for these keys will be called in ascending order,<br class="">and TSan will work as expected in this case.<br class=""><br class="">TSan wasn't interposing pthread_key_create, but I had to highjack it<br class="">on Darwin to achieve the behavior I explained. First, it identifies<br class="">caller. If it's one of dyld's TLV (Thread Local Variable) functions, I<br class="">call "pthread_key_create_high" to allocate highest possible key<br class="">indices. It's basically a 'while' loop until real pthread_key_create<br class="">returns EAGAIN. It then frees all other keys it has previously<br class="">allocated. This places TSan's TLV keys to the end of destructor list.<br class="">There is one more thing this function does; it stores pointer to<br class="">original destructor function (passed in to pthread_key_create), and<br class="">replaces it with its own destructor function, which will call<br class="">'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1<br class="">times to postpone destruction of TSan TLV even further (IIRC, this was<br class="">necessary). The replacement destructor function then calls the actual<br class="">destructor function.<br class=""><br class="">As a side note, dyld uses libc++, and its loading the instrumented<br class="">version. This didn't seem to be creating any problem so far. I didn't<br class="">test this on a sufficiently large code base just yet. TSan tests<br class="">generally pass, but they didn't have this "late-destroyed thread<br class="">locals" problem I mentioned.<br class=""><br class="">Does my approach sound wrong, flawed or too brittle? Is there any<br class="">other way to order these events in a sensible way? Can you help me<br class="">fixing some of the problems I mentioned below, especially the<br class="">'identify_caller' function?<br class=""><br class="">The code is in a very dirty state now, so I refrain submitting it.<br class="">But, no worries; there's always pseudo code! I'd like to share the key<br class="">creation part, the rest of TSan patches are much simpler than this<br class="">part.<br class=""><br class="">Thanks,<br class="">Ismail<br class=""><br class="">#if SANITIZER_MAC<br class="">enum { kPthHighIndexTlvInitializer = 0,<br class="">kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,<br class="">kPthHighIndexTsanFinalize,<br class="">kPthHighIndexCount };<br class=""><br class="">// FIXME: This function is problematic. Can we avoid dladdr?<br class="">static int identify_caller(uptr pc) {<br class=""> if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)<br class="">   return kPthHighIndexTsanFinalize;<br class=""> // There are 2 "hidden" callers<br class=""> Dl_info dli;<br class=""> if (!dladdr((const void*)pc, &dli))<br class="">   return -1;<br class=""> else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))<br class="">   return kPthHighIndexTlvInitializer;<br class=""> else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))<br class="">   return kPthHighIndexTlvLoadNotification;<br class=""> return -1;<br class="">}<br class=""><br class="">TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,<br class="">                void (*dtor)(void*)) {<br class=""> if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {<br class="">   SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);<br class="">   return REAL(pthread_key_create)(key, dtor);<br class=""> }<br class=""> const uptr caller_pc = GET_CALLER_PC();<br class=""> const int caller_index = identify_caller(caller_pc);<br class=""> return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)<br class="">                           : pthread_key_create_high(key, dtor, caller_index);<br class="">}<br class=""><br class="">// 'index' represents caller index so that we allocate them<br class="">// in an order. We want to get the last N-index keys<br class="">// in a sensible order.<br class="">int __tsan::pthread_key_create_high(unsigned long *key,<br class="">                                   void (*dtor)(void *), int index) {<br class=""> int res = 0;<br class=""> original_dtors[index] = dtor;<br class=""> // Can be a function-local ulong[512]?<br class=""> unsigned long lowest_key, highest_key;<br class=""><br class=""> // The following 2 loops may terminate early or later<br class=""> // depending on 'index'.<br class=""> do<br class="">   res = REAL(pthread_key_create)(&highest_key, replacement_dtor);<br class=""> while (res != EAGAIN);<br class=""> for (; lowest_key < highest_key; ++lowest_key) {<br class="">   pthread_key_delete(lowest_key);<br class=""> }<br class=""> tlv_keys[index] = highest_key;<br class=""> return res;<br class="">}<br class=""><br class="">// This will be called from _pthread_tsd_cleanup<br class="">static void replacement_dtor(void *p) {<br class=""> uptr cur_key;<br class=""> unsigned index = kPthHighIndexCount;<br class=""> // Can this make it through code review?<br class=""> __asm__ __volatile__ ("" : "=b" (cur_key));<br class=""> for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {<br class="">   if (cur_key == tlv_keys[c]) {<br class="">     index = c;<br class="">     break;<br class="">   }<br class=""> }<br class=""> if (--tls_dtor_counters[index] == 1) {<br class="">   if (original_dtors[index])<br class="">     original_dtors[index](p);<br class="">   return;<br class=""> }<br class=""> pthread_setspecific(tlv_keys[index], p);<br class="">}<br class="">#endif<br class=""></blockquote><br class=""><br class=""><br class="">Hi Ismail,<br class=""><br class="">I have a counter-proposal that I wanted to implement for some time,<br class="">but never actually get to it because it was low-priority.<br class="">The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1<br class="">postponing. There is other smart code that does the same and that also<br class="">breaks the assumption that tsan thread finalization runs last.<br class="">Another problem was with asan on mac. Essentially what you see, but in<br class="">asan it is easier to work around. We just ignore all memory accesses<br class="">that come after our thread destructor.<br class=""><br class="">The idea that I had is relatively clean, reasonably<br class="">platform-independent and should solve _all_ problems of such kind once<br class="">and for all.<br class="">Namely: we execute asan/tsan/msan thread destructor when the thread<br class="">_join_ returns. At that point there is definitely no thread code<br class="">running. The high-level implementation plan: for joinable threads we<br class="">hook into join (we already intercept it); for detached threads we<br class="">start a helper background thread when thread function returns and join<br class="">the user thread from that helper thread. I suspect that actual<br class="">implementation will be messier than that: for example, a thread can<br class="">call pthread_detach in a pthread_specific destructor after we've<br class="">decided to _not_ start a helper thread; or a thread can call<br class="">pthread_exit and does not return from thread function.<br class=""><br class="">What do you think about this idea?<br class=""></blockquote><br class="">Another implementation plan can be to always join all threads<br class="">ourselves. Then in pthread_detach merely remember that the thread is<br class="">detached, but don't call REAL(pthread_detach). Then when we join a<br class="">thread check whether it is detached or not; if it is detached then we<br class="">can discard all thread state; if it is non-detached then we need to<br class="">remember that the thread has finished and its return value, wait for<br class="">user pthread_join call and satisfy it from our state.<br class=""><br class="">I don't know which plan will lead to simpler and more reliable implementation.<br class=""><br class=""></blockquote>I didn't think thoroughly about this case. This sounds easier to<br class="">implement, but I am not sure if it'll work correctly (i.e. will it<br class="">miss problems that happen after TSan TLV dies, or will it report<br class="">false-positives, because TSan TLV dies too early). Still, we need to<br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Can we somehow get access to ThreadState from cur_thread() until we</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">join the thread? That would be ideal and prevent false positives and</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">negatives. The simplest way I can think of is to consult</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">ThreadRegistriy using pthread_self() as the key to find current</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">ThreadState if pthread_getspecific() returns NULL.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">change a few things in TSan so that it tolerates missing<br class="">cur_thread_state. I forgot many details of TSan, I haven't been<br class="">hacking it for a long time.<br class=""><br class=""><blockquote type="cite" class="">To make it clear, what I don't like in your proposal: it is somewhat<br class="">platform-dependent, should be enabled only on mac (thus less tested)<br class="">and does not solve all potential problems of such kind (e.g. on some<br class="">other OS there can be some other magical way to run user code after<br class="">tsan thread destructor).<br class=""></blockquote>I agree; it's best to avoid platform-specific workarounds.<br class=""><br class="">Ismail</blockquote></div></blockquote></div><br class=""></body></html>