[PATCH] D38073: [XRay][compiler-rt] Use pthread for initializing thread-local data

Thu Sep 28 21:09:56 PDT 2017

On Fri, Sep 29, 2017 at 4:11 AM David Blaikie <dblaikie at gmail.com> wrote:

> On Wed, Sep 27, 2017 at 5:36 PM Dean Michael Berris <dberris at google.com>
> wrote:
>
>> On 28 Sep 2017, at 03:47, David Blaikie <dblaikie at gmail.com> wrote:
>>
>>
>>
>> On Wed, Sep 20, 2017 at 6:15 PM Dean Michael Berris via Phabricator <
>> reviews at reviews.llvm.org> wrote:
>>
>>> dberris added a comment.
>>>
>>> Thanks for the review!
>>>
>>>
>>>
>>> ================
>>> Comment at: lib/xray/xray_fdr_logging_impl.h:110
>>>  // call so that it can be initialized on first use instead of as a
>>> global.
>>> -struct ThreadLocalData {
>>> +struct alignas(64) ThreadLocalData {
>>>    BufferQueue::Buffer Buffer;
>>> ----------------
>>> dblaikie wrote:
>>> > Presumably this should use LLVM_ALIGNAS?
>>> >
>>> > Is this related to the rest of the patch, if so, how? Is it for
>>> correctness or performance?
>>> It's related in that we'd like to use aligned uninitialised storage to
>>> store one of these objects. We'd the buffer to not only be correctly sized
>>> but also aligned correctly,
>>
>> since we're going to interpret the buffer's address as a different type
>>> (punning? aliasing?
>>
>>
>> Not sure I follow, the buffer:
>>
>>  thread_local std::aligned_storage<sizeof(ThreadLocalData),
>>                                     alignof(ThreadLocalData)>::type
>> TLSBuffer;
>>
>> is requested to be of the appropriate alignment for ThreadLocalData? So
>> why would ThreadLocalData need to request any specific alignment itself?
>>
>>
>> I thought you'd need the specific alignment so that pointers of the type
>> would be properly aligned
>>
>
> Why do you need a specific pointer alignment? Are you using the low bits
> for something like PointerIntPair?
>

For this object in particular, it's accessed in the hot path of every FDR
mode handler. Aligning to 32-byte boundaries on x86 gives us a better
chance of having this code be in one of two possible cache line positions.
If/when the compiler learns to align 64-byte objects to cache line
alignment, then the hot path gets to access the same cache line for any
thread-local access for the currently running thread.

This means any operations that will be performed on any of the members can
use aligned versions of instructions -- things like aligned mov operations,
and synthesized memcpy intrinsics that work better with aligned addresses.

>
>
>> -- thus helping the compiler decide more appropriate aligned versions of
>> the operations on members. If that's not the case, then that's fine.
>>
>
> Not sure I understand. If the members have certain minimum alignment
> requirements naturally, the compiler will ensure they're aligned
> sufficiently and depend on that for operations on them, etc.
>
>

It's not just the members we care about, it's also where the base pointer
is located (cache-line aligned, etc.). If the object and pointers to it
will be always aligned (with forced alignment) then the natural member
alignments will also be aligned.

>
>> Or do you mean the contents of the Buffer member will be used to store
>> other things? Then perhaps it'd be more appropriate to put the alignment
>> attribute on the member? (& an assert somewhere that compares the alignment
>> of the thing being stored with the alignment of the buffer itself).
>>
>>
>>> I can never get the right terminology for that). The performance benefit
>>> is a side-effect of the correctness issue (since we have non-trivially
>>> destructible things in it).
>>>
>>> BTW, I can't use LLVM_ALIGNAS because that seems to be undefined in
>>> compiler-rt. :/
>>>
>>
>> Is using the attribute directly sufficiently portable for compiler-rt?
>> Are there other uses of alignment in compiler-rt to compare/check against?
>>
>>
>>
>> It's C++11, and all of compiler-rt seems to be written with the
>> requirement of C++11 so it's suitably portable that way. Removed it anyway.
>>
>>
>>>
>>> ================
>>> Comment at: lib/xray/xray_fdr_logging_impl.h:145
>>> +// the FDR logging implementation. The implementation details require a
>>> bit of
>>> +// care to maintain, so document the semantics that we want to achieve
>>> here.
>>> +//
>>> ----------------
>>> dblaikie wrote:
>>> > "so document the semantics that we want to achieve here" seems weird -
>>> that sounds like a patch note or feature request (& somewhat redundant, it
>>> feels like - the reader can probably see that the semantics are documented
>>> below), rather than a statement of fact.
>>> >
>>> > "the semantics are documented below" or maybe "The following semantics
>>> are intended:" or skip this comment entirely? Not sure.
>>> I like skipping it. Stream of consciousness comments at the end of the
>>> day don't come out very well. :)
>>>
>>>
>>> ================
>>> Comment at: lib/xray/xray_fdr_logging_impl.h:175-183
>>> +// We're doing this to avoid using a `thread_local` object that has a
>>> +// non-trivial destructor, because the C++ runtime might call
>>> std::malloc(...)
>>> +// to register calls to destructors. Deadlocks may arise when, for
>>> example, an
>>> +// externally provided malloc implementation is XRay instrumented, and
>>> +// initializing the thread-locals involves calling into malloc. A malloc
>>> +// implementation that does global synchronization might be holding a
>>> lock for a
>>> +// critical section, calling a function that might be XRay instrumented
>>> (and
>>> ----------------
>>> dblaikie wrote:
>>> > Maybe this goes into more detail than needed if the XRay runtime in
>>> general shouldn't be using the standard library for all these reasons -
>>> that's a general requirement that wouldn't need to be rejustified at every
>>> use case/implementation detail?
>>> This is specifically for the function, since the original implementation
>>> (and the potentially more obvious solution) uses a thread-exit guard type
>>> (i.e. RAII but using a thread-local). Other parts of the implementation
>>> have avoided using memory allocation routines because of performance
>>> reasons but not entirely for correctness.
>>>
>>>
>>> ================
>>> Comment at: lib/xray/xray_fdr_logging_impl.h:190-191
>>> +  thread_local pthread_once_t key_once = PTHREAD_ONCE_INIT;
>>> +  thread_local std::aligned_storage<sizeof(ThreadLocalData),
>>> +                                    alignof(ThreadLocalData)>::type
>>> TLSBuffer;
>>> +  // Ensure that we only actually ever do the pthread initialization
>>> once.
>>> ----------------
>>> dblaikie wrote:
>>> > Seems the goal in general was/is to move away from C++ standard
>>> library usage in the runtime library, right? Is there an attempt to pick
>>> and choose which parts can be relied upon, or is the overall goal to
>>> actually remove C++ standard library usage entirely, like other parts of
>>> compiler-rt?
>>> >
>>> > If not, then maybe std::aligned_union<1, ThreadLocalData> might be
>>> marginally simpler here, but only marginally. (& I wouldn't say it's
>>> strictly better - since a union of only one thing is a bit strange)
>>> I think I should document why we're using a thread-local uninitialized
>>> memory at all in this situation. :)
>>>
>>> First, the type ThreadLocalData isn't trivially destructable. If this
>>> was a thread_local object,
>>
>>
>> I wasn't intending to suggest using a thread_local object, but replacing
>> the aligned_storage with aligned_union?
>>
>>
>>
>> Just read up on std::aligned_union, it seems it's overkill for this case
>> as it's equivalent to just the aligned storage. Used std::aligned_union
>> anyway if that's more common to use than std::aligned_storage.
>>
>> we end up with the same problem we had in the first place -- having a
>>> thread_local object with a non-trivial destructor may cause the C++ runtime
>>> to call std::malloc(...) and introduce a deadlock when initialising the
>>> implementation as well as when destroying the object at thread-exit time.
>>>
>>> Second, having uninitialised buffer space makes the initialisation of
>>> the memory trivial. We do want uninitialised bytes because we don't need to
>>> zero out the bytes, since we're going to initialize it once using placement
>>> new anyway.
>>>
>>
>> I would imagine the union's probably uninitialized too?
>>
>>
>>
>> The docs on std::aligned_union suggest you're right. But now it does look
>> a little weird that it's a union with just one type.
>>
>
> *nod* whichever seems better - just a thought.
>
>
>>
>>
>>> So unfortunately, this is the solution that works and fits within the
>>> requirements we laid out. Updated with a bit more documentation on
>>> specifically why we're using an uninitialised aligned buffer. :)
>>>
>>> Removing the dependency on standard C++ library things that have runtime
>>> issues I think is a good goal, but is orthogonal to this particular change
>>> (since really we're just avoiding the use of `thread_local` non-trivially
>>> destructible objects).
>>>
>>
>> Is that philosophy consistent with the rest of compiler-rt? Picking bits
>> of the standard library that are known/expected not to allocate or use
>> other 'interesting' functions?
>>
>>
>>
>> I'm not sure whether that's written down somewhere that I can read as
>> policy (I sure haven't seen anything to that effect) or whether that's
>> something I'd need to infer myself.
>>
>
> Infer to some degree, but probably also talk to the rest of the folks
> (sanitizers) in compiler-rt (I'd suggest you start with Kostya) to have a
> good idea of the goals, gotchas, etc, and maybe write it down somewhere.
>

Agreed. Probably not in the scope of this patch? :)

>
>
>> For XRay, it's one of convenience -- if we can use something in the
>> standard library for making things simpler, i.e. like using the algorithms
>> and some utility types, then that should be fair game.
>>
>> Systematically removing the dependency on containers or things that have
>> runtime components seem to be the minimum required to make this work
>> without severely limiting the available tools we have from the STL. This
>> change doesn't involve much of the STL aside from the aligned storage
>> types. The avoidance of nontrivially destructible thread_local objects has
>> more to do with the deadlock avoidance than anything else.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170929/c244c6b8/attachment.html>