[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

Mon Sep 4 11:12:05 PDT 2017

On Mon, Sep 4, 2017 at 9:17 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> On 09/04/2017 03:57 AM, Chandler Carruth wrote:
>
> On Sun, Sep 3, 2017 at 10:41 PM Hal Finkel via llvm-commits <
> llvm-commits at lists.llvm.org> wrote:
>
>> Nevertheless, I think that you've convinced me that this is a least-bad
>> solution. I'll want a flag preserving the old behavior. Something like
>> -fwiden-bitfield-accesses (modulo bikeshedding).
>>
> Several different approaches have been discussed in this thread, I'm not
> sure what you mean about "least-bad solution"...
>
> I remain completely unconvinced that we should change the default
> behavior. At most, I'm not strongly opposed to adding an attribute that
> indicates "please try to use narrow loads for this bitfield member" and is
> an error if applied to a misaligned or non-byte-sized bitfield member.
>
>
> I like this solution too (modulo the fact that I dislike all of these
> solutions). Restricting this only to affecting the loads, and not the
> stores, is also an interesting thought. The attribute could also be on the
> access itself (which, at least from the theoretical perspective, I'd
> prefer).
>
> But I remain strongly opposed to changing the default behavior. We have
> one or two cases that regress and are easily addressed by source changes
> (either to not use bitfields or to use an attribute). I don't think that is
> cause to change the lowering Clang has used for years and potentially
> regress many other use cases.
>
>
> I have mixed feelings about all of the potential fixes here. To walk
> through my thoughts on this:
>
>  1. I don't like any solutions that require changes affecting source level
> semantics. Something that the compiler does automatically is fine, as is an
> attribute.
>
>  2. Next, regarding default behavior, we have a trade off:
>
>    A. Breaking apart the accesses, as proposed here, falls into the
> category of "generally, it makes everything a little bit slower." But it's
> worse than that because it comes on a spectrum. I can easily construct
> variants of the provided benchmark which make the separate loads have a bad
> performance penalty. For example:
>
> $ cat ~/tmp/3m.cpp
> class A {
> public:
> #ifdef BF
>   unsigned long f7:8;
>   unsigned long f6:8;
>   unsigned long f5:8;
>   unsigned long f4:8;
>   unsigned long f3:8;
>   unsigned long f2:8;
>   unsigned long f1:8;
>   unsigned long f0:8;
> #else
>   unsigned char f7;
>   unsigned char f6;
>   unsigned char f5;
>   unsigned char f4;
>   unsigned char f3;
>   unsigned char f2;
>   unsigned char f1;
>   unsigned char f0;
> #endif
> };
> A a;
> bool b;
> unsigned long N = 1000000000;
>
> __attribute__((noinline))
> void foo() {
>   a.f4 = 3;
> }
>
> __attribute__((noinline))
> void goo() {
>   b = (a.f0 == 0 && a.f1 == (unsigned char)-1 &&
>        a.f2 == 0 && a.f3 == 0 && a.f4 == 0 && a.f5 == 0 &&
>        a.f6 == 0 && a.f7 == (unsigned char)-1);
> }
>
> int main() {
>   unsigned long i;
>   foo();
>   for (i = 0; i < N; i++) {
>     goo();
>   }
> }
>
>     Run this and you'll find that our current scheme, on Haswell, beats
> the separate-loads scheme by nearly 60% (e.g., 2.77s for separate loads vs.
> 1.75s for the current bitfield lowering).
>
>     So, how common is it to have a bitfield with a large number of fields
> that could be loaded separately (i.e. have the right size and alignment)
> and have code that loads a large number of them like this (i.e. without
> other work to hide the relative costs)? I have no idea, but regardless,
> there is definitely a high-cost end to this spectrum.
>

Not sure about the answer. The test case itself is a little extreme. There
are many ways to change it slightly to tilt the balance. For instance
1) add one more field with store forwarding issue
2) reduce the number of tested fields in goo by 2 or 3 or more
3) having 2 or 3 more field stores in foo()
4) store non zero value to a.f0 so that short circuiting can happen without
load widening.
or
5) having more stores in foo, and enabling inlining and more aggressive
constant/copy propagation

>
>   B. However, our current scheme can trigger expensive architectural
> hazards. Moreover, there's no local after-the-fact analysis that can fix
> this consistently. I think that Wei has convincingly demonstrated both of
> these things. How common is this? I have no idea. More specifically, how do
> the relative costs of hitting these hazards compare to the costs of the
> increased number of loads under the proposed scheme? I have no idea (and
> this certainly has a workload-dependent answer).
>
>  C. This situation seems unlikely to change in the future: it seems like a
> physics problem. The data surrounding the narrower store is simply not in
> the pipeline to be matched with the wider load. Keeping the data in the
> pipeline would have extra costs, perhaps significant. I'm guessing the
> basic structure of this hazard is here to stay.
>
>  D. In the long run, could this be a PGO-driven decision? Yes, and this
> seems optimal. It would depend on infrastructure enhancements, and
> critically, the hardware having the right performance counter(s) to sample.
>
> So, as to the question of what to do right now, I have two thoughts: 1)
> All of the solutions will be bad for someone. 2) Which is a least-bad
> default depends on the workload. Your colleagues seem to be asserting that,
> for Google, the separate loads are least bad (and, FWIW, you're more likely
> to have hot code like this than I am). This is definitely an issue on which
> reasonable people can disagree. In the end, I'll begrudgingly agree that
> this should be an empirical decision. We should have some
> flag/pragma/attribute/etc. to allow selection of the other method.
>

This is a good summary.

thanks,

David

>
> Regardless, we should continue to use the current lowering when a
> sanitizer is enabled.
>
> Any, hey, the rest of cfe-dev, please feel fee to express some opinion on
> the matter!
>
>  -Hal
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170904/63b6f79e/attachment.html>