[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

Sun Sep 3 22:46:15 PDT 2017

On 09/03/2017 11:22 PM, Wei Mi wrote:
> On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>>
>> Store forwarding stall cost is usually much higher compared with a load
>> hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
>> the store forwarding stalls cost about 10 cycles more than a successful
>> store forwarding, which is roughly 15 cycles. In some scenarios, the store
>> forwarding stalls can be as high as 50 cycles. See Agner's documentation.
>>
>>
>> I understand. As I understand it, there are two potential ways to fix this
>> problem:
>>
>>   1. You can make the store wider (to match the size of the wide load, thus
>> permitting forwarding).
>>   2. You can make the load smaller (to match the size of the small store,
>> thus permitting forwarding).
>>
>> At least in this benchmark, which is a better solution?
>>
>> Thanks again,
>> Hal
>>
> For this benchmark, smaller load is better. On my sandybridge desktop,
> wider store is 3.77s, smaller load is 3.45s. If store forwarding is
> blocked, it costs 6.9s.
>
> However, we don't have good way to narrow the load matching the store
> shrinking because the field information has been lost.  For the IR
> below:
>
> define void @_Z3goov() local_unnamed_addr #0 {
>    %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
>    %2 = and i64 %1, 0xffffff00000000
>    %3 = icmp eq i64 %2, 0xffff00000000
>    %4 = zext i1 %3 to i8
>    store i8 %4, i8* @b, align 1, !tbaa !2
>    ret void
> }
>
> We know the 24bits range from bit 32 to bit 56 of @a are accessed, but
> we don't know whether the 24bits ranges contain 8bits + 16bits
> bitfields, or 16bits + 8bits bitfields, or 8bit + 8bit + 8bit
> bitfields. Once the load shrinking done locally is inconsistent with
> store shrinking, we will have store forwarding issue and will suffer
> from huge regression.

Understood. This is a convincing argument. The cost of splitting the 
loads seems not high, at least not in isolation.

Thanks again,
Hal

>
> Thanks,
> Wei.
>
>
>
>>
>> In other words, the optimizer needs to be taught to avoid defeating  the HW
>> pipeline feature as much as possible.
>>
>> David
>>
>> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>
>>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>>>>>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <davidxl at google.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>>>>>> <reviews at reviews.llvm.org> wrote:
>>>>>>>> chandlerc added a comment.
>>>>>>>>
>>>>>>>> I'm really not a fan of the degree of complexity and subtlety that
>>>>>>>> this
>>>>>>>> introduces into the frontend, all to allow particular backend
>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> I feel like this is Clang working around a fundamental deficiency in
>>>>>>>> LLVM
>>>>>>>> and we should instead find a way to fix this in LLVM itself.
>>>>>>>>
>>>>>>>> As has been pointed out before, user code can synthesize large
>>>>>>>> integers
>>>>>>>> that small bit sequences are extracted from, and Clang and LLVM
>>>>>>>> should
>>>>>>>> handle those just as well as actual bitfields.
>>>>>>>>
>>>>>>>> Can we see how far we can push the LLVM side before we add complexity
>>>>>>>> to
>>>>>>>> Clang here? I understand that there remain challenges to LLVM's
>>>>>>>> stuff,
>>>>>>>> but I
>>>>>>>> don't think those challenges make *all* of the LLVM improvements off
>>>>>>>> the
>>>>>>>> table, I don't think we've exhausted all ways of improving the LLVM
>>>>>>>> changes
>>>>>>>> being proposed, and I think we should still land all of those and
>>>>>>>> re-evaluate how important these issues are when all of that is in
>>>>>>>> place.
>>>>>>>
>>>>>>> The main challenge of doing  this in LLVM is that inter-procedural
>>>>>>> analysis
>>>>>>> (and possibly cross module) is needed (for store forwarding issues).
>>>>>>>
>>>>>>> Wei, perhaps you can provide concrete test case to illustrate the
>>>>>>> issue
>>>>>>> so
>>>>>>> that reviewers have a good understanding.
>>>>>>>
>>>>>>> David
>>>>>> Here is a runable testcase:
>>>>>> -------------------- 1.cc ------------------------
>>>>>> class A {
>>>>>> public:
>>>>>>      unsigned long f1:2;
>>>>>>      unsigned long f2:6;
>>>>>>      unsigned long f3:8;
>>>>>>      unsigned long f4:4;
>>>>>> };
>>>>>> A a;
>>>>>> unsigned long b;
>>>>>> unsigned long N = 1000000000;
>>>>>>
>>>>>> __attribute__((noinline))
>>>>>> void foo() {
>>>>>>      a.f3 = 3;
>>>>>> }
>>>>>>
>>>>>> __attribute__((noinline))
>>>>>> void goo() {
>>>>>>      b = a.f3;
>>>>>> }
>>>>>>
>>>>>> int main() {
>>>>>>      unsigned long i;
>>>>>>      for (i = 0; i < N; i++) {
>>>>>>        foo();
>>>>>>        goo();
>>>>>>      }
>>>>>> }
>>>>>> ------------------------------------------------------------
>>>>>> Now trunk takes about twice running time compared with trunk + this
>>>>>> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
>>>>>> DagCombiner) but not shrink the load of a.f3 in goo, so store
>>>>>> forwarding will be blocked.
>>>>>
>>>>> I can confirm that this is true on Haswell and also on an POWER8.
>>>>> Interestingly, on a POWER7, the performance is the same either way (on
>>>>> the
>>>>> good side). I ran the test case as presented and where I replaced f3
>>>>> with a
>>>>> non-bitfield unsigned char member. Thinking that the POWER7 result might
>>>>> be
>>>>> because it's big-Endian, I flipped the order of the fields, and found
>>>>> that
>>>>> the version where f3 is not a bitfield is faster than otherwise, but
>>>>> only by
>>>>> 12.5%.
>>>>>
>>>>> Why, in this case, don't we shrink the load? It seems like we should
>>>>> (and it
>>>>> seems like a straightforward case).
>>>>>
>>>>> Thanks again,
>>>>> Hal
>>>>>
>>>> Hal, thanks for trying the test.
>>>>
>>>> Yes, it is straightforward to shrink the load in the test. I can
>>>> change the testcase a little to show why it is sometime difficult to
>>>> shrink the load:
>>>>
>>>> class A {
>>>> public:
>>>>     unsigned long f1:16;
>>>>     unsigned long f2:16;
>>>>     unsigned long f3:16;
>>>>     unsigned long f4:8;
>>>> };
>>>> A a;
>>>> bool b;
>>>> unsigned long N = 1000000000;
>>>>
>>>> __attribute__((noinline))
>>>> void foo() {
>>>>     a.f4 = 3;
>>>> }
>>>>
>>>> __attribute__((noinline))
>>>> void goo() {
>>>>     b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
>>>> }
>>>>
>>>> int main() {
>>>>     unsigned long i;
>>>>     for (i = 0; i < N; i++) {
>>>>       foo();
>>>>       goo();
>>>>     }
>>>> }
>>>>
>>>> For the load a.f4 in goo, it is diffcult to motivate its shrink after
>>>> instcombine because the comparison with a.f3 and the comparison with
>>>> a.f4 are merged:
>>>>
>>>> define void @_Z3goov() local_unnamed_addr #0 {
>>>>     %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
>>>>     %2 = and i64 %1, 0xffffff00000000
>>>>     %3 = icmp eq i64 %2, 0xffff00000000
>>>>     %4 = zext i1 %3 to i8
>>>>     store i8 %4, i8* @b, align 1, !tbaa !2
>>>>     ret void
>>>> }
>>>
>>> Exactly. But this behavior is desirable, locally. There's no good answer
>>> here: We either generate extra load traffic here (because we need to load
>>> the fields separately), or we widen the store (which generates extra load
>>> traffic there). Do you know, in terms of performance, which is better in
>>> this case (i.e., is it better to widen the store or split the load)?
>>>
>>>   -Hal
>>>
>>>
>>>> Thanks,
>>>> Wei.
>>>>
>>>>>> The testcases shows the potential problem of store shrinking. Before
>>>>>> we decide to do store shrinking, we need to know all the related loads
>>>>>> will be shrunk,  and that requires IPA analysis. Otherwise, when load
>>>>>> shrinking was blocked for some difficult case (Like the instcombine
>>>>>> case described in
>>>>>> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
>>>>>> performance regression will happen.
>>>>>>
>>>>>> Wei.
>>>>>>
>>>>>>
>>>>>>>> Repository:
>>>>>>>>      rL LLVM
>>>>>>>>
>>>>>>>> https://reviews.llvm.org/D36562
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory