[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

Sun Sep 3 22:12:47 PDT 2017

On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> On 09/03/2017 11:06 PM, Xinliang David Li wrote:
>
> I think we can think this in another way.
>
> For modern CPU architectures which supports store forwarding with store
> queues, it is generally not "safe" to blindly do local optimizations to
> widen the load/stores
>
>
> Why not widen stores? Generally the problem with store forwarding is where
> the load is wider than the store (plus alignment issues).
>
>
True, but probably with some caveats which are target dependent.  Store
widening also requires additional bit operations (and possibly addition
load), so the it is tricky to model the the overall benefit.

> without sophisticated inter-procedural analysis. Doing so will run the
> risk of greatly reduce performance of some programs. Keep the naturally
> aligned load/store using its natural type is safer.
>
> Does it make sense?
>
>
> It makes sense. I could, however, say the same thing about inlining. We
> need to make inlining decisions locally, but they have global impact.
> Nevertheless, we need to make inlining decisions, and there's no practical
> way to make them in a truly non-local way.
>

Speaking of inlining, we are actually thinking of ways to make the
decisions more globally optimal, but that is off topic.

>
> We also don't pessimize common cases to improve performance in rare cases.
> In the common case, reducing pressure on the memory units, and reducing the
> critical path, seem likely to me to be optimal. If that's not true, or
> doing otherwise has negligible cost (but can prevent rare downsides), we
> should certainly consider those options.
>

Since we don't do load widening for non-bitfield cases (but the only the
very limited case of naturally aligned bitfields) so it is hard to say we
pessimize common cases for rare cases:

1) the upside doing widening such access is not high from experience with
other compiler (which does not do so)
2) there is obvious downside of introducing additional extract instructions
which degrades performance
3) there is obvious downside of severely degrading performance when store
forwarding is blocked.

> And none of this answers the question of whether it's better to have the
> store wider or the load split and narrow.
>

It seems safer to do store widening more aggressively to avoid store
forwarding stall issue, but doing this aggressively may also mean other
runtime overhead introduced (extra load, data merge etc).

Thanks,

David

>
> Thanks again,
> Hal
>
>
>
> David
>
>
>
> On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>>
>> Store forwarding stall cost is usually much higher compared with a load
>> hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
>> the store forwarding stalls cost about 10 cycles more than a successful
>> store forwarding, which is roughly 15 cycles. In some scenarios, the store
>> forwarding stalls can be as high as 50 cycles. See Agner's documentation.
>>
>>
>> I understand. As I understand it, there are two potential ways to fix
>> this problem:
>>
>>  1. You can make the store wider (to match the size of the wide load,
>> thus permitting forwarding).
>>  2. You can make the load smaller (to match the size of the small store,
>> thus permitting forwarding).
>>
>> At least in this benchmark, which is a better solution?
>>
>> Thanks again,
>> Hal
>>
>>
>>
>> In other words, the optimizer needs to be taught to avoid defeating  the
>> HW pipeline feature as much as possible.
>>
>> David
>>
>> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>
>>>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>
>>>>> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>>>>>
>>>>>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <
>>>>>> davidxl at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>>>>>> <reviews at reviews.llvm.org> wrote:
>>>>>>>
>>>>>>>> chandlerc added a comment.
>>>>>>>>
>>>>>>>> I'm really not a fan of the degree of complexity and subtlety that
>>>>>>>> this
>>>>>>>> introduces into the frontend, all to allow particular backend
>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> I feel like this is Clang working around a fundamental deficiency in
>>>>>>>> LLVM
>>>>>>>> and we should instead find a way to fix this in LLVM itself.
>>>>>>>>
>>>>>>>> As has been pointed out before, user code can synthesize large
>>>>>>>> integers
>>>>>>>> that small bit sequences are extracted from, and Clang and LLVM
>>>>>>>> should
>>>>>>>> handle those just as well as actual bitfields.
>>>>>>>>
>>>>>>>> Can we see how far we can push the LLVM side before we add
>>>>>>>> complexity to
>>>>>>>> Clang here? I understand that there remain challenges to LLVM's
>>>>>>>> stuff,
>>>>>>>> but I
>>>>>>>> don't think those challenges make *all* of the LLVM improvements
>>>>>>>> off the
>>>>>>>> table, I don't think we've exhausted all ways of improving the LLVM
>>>>>>>> changes
>>>>>>>> being proposed, and I think we should still land all of those and
>>>>>>>> re-evaluate how important these issues are when all of that is in
>>>>>>>> place.
>>>>>>>>
>>>>>>>
>>>>>>> The main challenge of doing  this in LLVM is that inter-procedural
>>>>>>> analysis
>>>>>>> (and possibly cross module) is needed (for store forwarding issues).
>>>>>>>
>>>>>>> Wei, perhaps you can provide concrete test case to illustrate the
>>>>>>> issue
>>>>>>> so
>>>>>>> that reviewers have a good understanding.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>> Here is a runable testcase:
>>>>>> -------------------- 1.cc ------------------------
>>>>>> class A {
>>>>>> public:
>>>>>>     unsigned long f1:2;
>>>>>>     unsigned long f2:6;
>>>>>>     unsigned long f3:8;
>>>>>>     unsigned long f4:4;
>>>>>> };
>>>>>> A a;
>>>>>> unsigned long b;
>>>>>> unsigned long N = 1000000000;
>>>>>>
>>>>>> __attribute__((noinline))
>>>>>> void foo() {
>>>>>>     a.f3 = 3;
>>>>>> }
>>>>>>
>>>>>> __attribute__((noinline))
>>>>>> void goo() {
>>>>>>     b = a.f3;
>>>>>> }
>>>>>>
>>>>>> int main() {
>>>>>>     unsigned long i;
>>>>>>     for (i = 0; i < N; i++) {
>>>>>>       foo();
>>>>>>       goo();
>>>>>>     }
>>>>>> }
>>>>>> ------------------------------------------------------------
>>>>>> Now trunk takes about twice running time compared with trunk + this
>>>>>> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
>>>>>> DagCombiner) but not shrink the load of a.f3 in goo, so store
>>>>>> forwarding will be blocked.
>>>>>>
>>>>>
>>>>> I can confirm that this is true on Haswell and also on an POWER8.
>>>>> Interestingly, on a POWER7, the performance is the same either way (on
>>>>> the
>>>>> good side). I ran the test case as presented and where I replaced f3
>>>>> with a
>>>>> non-bitfield unsigned char member. Thinking that the POWER7 result
>>>>> might be
>>>>> because it's big-Endian, I flipped the order of the fields, and found
>>>>> that
>>>>> the version where f3 is not a bitfield is faster than otherwise, but
>>>>> only by
>>>>> 12.5%.
>>>>>
>>>>> Why, in this case, don't we shrink the load? It seems like we should
>>>>> (and it
>>>>> seems like a straightforward case).
>>>>>
>>>>> Thanks again,
>>>>> Hal
>>>>>
>>>>> Hal, thanks for trying the test.
>>>>
>>>> Yes, it is straightforward to shrink the load in the test. I can
>>>> change the testcase a little to show why it is sometime difficult to
>>>> shrink the load:
>>>>
>>>> class A {
>>>> public:
>>>>    unsigned long f1:16;
>>>>    unsigned long f2:16;
>>>>    unsigned long f3:16;
>>>>    unsigned long f4:8;
>>>> };
>>>> A a;
>>>> bool b;
>>>> unsigned long N = 1000000000;
>>>>
>>>> __attribute__((noinline))
>>>> void foo() {
>>>>    a.f4 = 3;
>>>> }
>>>>
>>>> __attribute__((noinline))
>>>> void goo() {
>>>>    b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
>>>> }
>>>>
>>>> int main() {
>>>>    unsigned long i;
>>>>    for (i = 0; i < N; i++) {
>>>>      foo();
>>>>      goo();
>>>>    }
>>>> }
>>>>
>>>> For the load a.f4 in goo, it is diffcult to motivate its shrink after
>>>> instcombine because the comparison with a.f3 and the comparison with
>>>> a.f4 are merged:
>>>>
>>>> define void @_Z3goov() local_unnamed_addr #0 {
>>>>    %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
>>>>    %2 = and i64 %1, 0xffffff00000000
>>>>    %3 = icmp eq i64 %2, 0xffff00000000
>>>>    %4 = zext i1 %3 to i8
>>>>    store i8 %4, i8* @b, align 1, !tbaa !2
>>>>    ret void
>>>> }
>>>>
>>>
>>> Exactly. But this behavior is desirable, locally. There's no good answer
>>> here: We either generate extra load traffic here (because we need to load
>>> the fields separately), or we widen the store (which generates extra load
>>> traffic there). Do you know, in terms of performance, which is better in
>>> this case (i.e., is it better to widen the store or split the load)?
>>>
>>>  -Hal
>>>
>>>
>>>
>>>> Thanks,
>>>> Wei.
>>>>
>>>> The testcases shows the potential problem of store shrinking. Before
>>>>>> we decide to do store shrinking, we need to know all the related loads
>>>>>> will be shrunk,  and that requires IPA analysis. Otherwise, when load
>>>>>> shrinking was blocked for some difficult case (Like the instcombine
>>>>>> case described in
>>>>>> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
>>>>>> ),
>>>>>> performance regression will happen.
>>>>>>
>>>>>> Wei.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Repository:
>>>>>>>>     rL LLVM
>>>>>>>>
>>>>>>>> https://reviews.llvm.org/D36562
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20170903/6fa84f68/attachment-0001.html>