[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type
Hal Finkel via cfe-commits
cfe-commits at lists.llvm.org
Sun Sep 3 22:46:15 PDT 2017
On 09/03/2017 11:22 PM, Wei Mi wrote:
> On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>>
>> Store forwarding stall cost is usually much higher compared with a load
>> hitting L1 cache. For instance, on Haswell, the latter is ~4 cycles, while
>> the store forwarding stalls cost about 10 cycles more than a successful
>> store forwarding, which is roughly 15 cycles. In some scenarios, the store
>> forwarding stalls can be as high as 50 cycles. See Agner's documentation.
>>
>>
>> I understand. As I understand it, there are two potential ways to fix this
>> problem:
>>
>> 1. You can make the store wider (to match the size of the wide load, thus
>> permitting forwarding).
>> 2. You can make the load smaller (to match the size of the small store,
>> thus permitting forwarding).
>>
>> At least in this benchmark, which is a better solution?
>>
>> Thanks again,
>> Hal
>>
> For this benchmark, smaller load is better. On my sandybridge desktop,
> wider store is 3.77s, smaller load is 3.45s. If store forwarding is
> blocked, it costs 6.9s.
>
> However, we don't have good way to narrow the load matching the store
> shrinking because the field information has been lost. For the IR
> below:
>
> define void @_Z3goov() local_unnamed_addr #0 {
> %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
> %2 = and i64 %1, 0xffffff00000000
> %3 = icmp eq i64 %2, 0xffff00000000
> %4 = zext i1 %3 to i8
> store i8 %4, i8* @b, align 1, !tbaa !2
> ret void
> }
>
> We know the 24bits range from bit 32 to bit 56 of @a are accessed, but
> we don't know whether the 24bits ranges contain 8bits + 16bits
> bitfields, or 16bits + 8bits bitfields, or 8bit + 8bit + 8bit
> bitfields. Once the load shrinking done locally is inconsistent with
> store shrinking, we will have store forwarding issue and will suffer
> from huge regression.
Understood. This is a convincing argument. The cost of splitting the
loads seems not high, at least not in isolation.
Thanks again,
Hal
>
> Thanks,
> Wei.
>
>
>
>>
>> In other words, the optimizer needs to be taught to avoid defeating the HW
>> pipeline feature as much as possible.
>>
>> David
>>
>> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>
>>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>>>>>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <davidxl at google.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>>>>>> <reviews at reviews.llvm.org> wrote:
>>>>>>>> chandlerc added a comment.
>>>>>>>>
>>>>>>>> I'm really not a fan of the degree of complexity and subtlety that
>>>>>>>> this
>>>>>>>> introduces into the frontend, all to allow particular backend
>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> I feel like this is Clang working around a fundamental deficiency in
>>>>>>>> LLVM
>>>>>>>> and we should instead find a way to fix this in LLVM itself.
>>>>>>>>
>>>>>>>> As has been pointed out before, user code can synthesize large
>>>>>>>> integers
>>>>>>>> that small bit sequences are extracted from, and Clang and LLVM
>>>>>>>> should
>>>>>>>> handle those just as well as actual bitfields.
>>>>>>>>
>>>>>>>> Can we see how far we can push the LLVM side before we add complexity
>>>>>>>> to
>>>>>>>> Clang here? I understand that there remain challenges to LLVM's
>>>>>>>> stuff,
>>>>>>>> but I
>>>>>>>> don't think those challenges make *all* of the LLVM improvements off
>>>>>>>> the
>>>>>>>> table, I don't think we've exhausted all ways of improving the LLVM
>>>>>>>> changes
>>>>>>>> being proposed, and I think we should still land all of those and
>>>>>>>> re-evaluate how important these issues are when all of that is in
>>>>>>>> place.
>>>>>>>
>>>>>>> The main challenge of doing this in LLVM is that inter-procedural
>>>>>>> analysis
>>>>>>> (and possibly cross module) is needed (for store forwarding issues).
>>>>>>>
>>>>>>> Wei, perhaps you can provide concrete test case to illustrate the
>>>>>>> issue
>>>>>>> so
>>>>>>> that reviewers have a good understanding.
>>>>>>>
>>>>>>> David
>>>>>> Here is a runable testcase:
>>>>>> -------------------- 1.cc ------------------------
>>>>>> class A {
>>>>>> public:
>>>>>> unsigned long f1:2;
>>>>>> unsigned long f2:6;
>>>>>> unsigned long f3:8;
>>>>>> unsigned long f4:4;
>>>>>> };
>>>>>> A a;
>>>>>> unsigned long b;
>>>>>> unsigned long N = 1000000000;
>>>>>>
>>>>>> __attribute__((noinline))
>>>>>> void foo() {
>>>>>> a.f3 = 3;
>>>>>> }
>>>>>>
>>>>>> __attribute__((noinline))
>>>>>> void goo() {
>>>>>> b = a.f3;
>>>>>> }
>>>>>>
>>>>>> int main() {
>>>>>> unsigned long i;
>>>>>> for (i = 0; i < N; i++) {
>>>>>> foo();
>>>>>> goo();
>>>>>> }
>>>>>> }
>>>>>> ------------------------------------------------------------
>>>>>> Now trunk takes about twice running time compared with trunk + this
>>>>>> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
>>>>>> DagCombiner) but not shrink the load of a.f3 in goo, so store
>>>>>> forwarding will be blocked.
>>>>>
>>>>> I can confirm that this is true on Haswell and also on an POWER8.
>>>>> Interestingly, on a POWER7, the performance is the same either way (on
>>>>> the
>>>>> good side). I ran the test case as presented and where I replaced f3
>>>>> with a
>>>>> non-bitfield unsigned char member. Thinking that the POWER7 result might
>>>>> be
>>>>> because it's big-Endian, I flipped the order of the fields, and found
>>>>> that
>>>>> the version where f3 is not a bitfield is faster than otherwise, but
>>>>> only by
>>>>> 12.5%.
>>>>>
>>>>> Why, in this case, don't we shrink the load? It seems like we should
>>>>> (and it
>>>>> seems like a straightforward case).
>>>>>
>>>>> Thanks again,
>>>>> Hal
>>>>>
>>>> Hal, thanks for trying the test.
>>>>
>>>> Yes, it is straightforward to shrink the load in the test. I can
>>>> change the testcase a little to show why it is sometime difficult to
>>>> shrink the load:
>>>>
>>>> class A {
>>>> public:
>>>> unsigned long f1:16;
>>>> unsigned long f2:16;
>>>> unsigned long f3:16;
>>>> unsigned long f4:8;
>>>> };
>>>> A a;
>>>> bool b;
>>>> unsigned long N = 1000000000;
>>>>
>>>> __attribute__((noinline))
>>>> void foo() {
>>>> a.f4 = 3;
>>>> }
>>>>
>>>> __attribute__((noinline))
>>>> void goo() {
>>>> b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
>>>> }
>>>>
>>>> int main() {
>>>> unsigned long i;
>>>> for (i = 0; i < N; i++) {
>>>> foo();
>>>> goo();
>>>> }
>>>> }
>>>>
>>>> For the load a.f4 in goo, it is diffcult to motivate its shrink after
>>>> instcombine because the comparison with a.f3 and the comparison with
>>>> a.f4 are merged:
>>>>
>>>> define void @_Z3goov() local_unnamed_addr #0 {
>>>> %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
>>>> %2 = and i64 %1, 0xffffff00000000
>>>> %3 = icmp eq i64 %2, 0xffff00000000
>>>> %4 = zext i1 %3 to i8
>>>> store i8 %4, i8* @b, align 1, !tbaa !2
>>>> ret void
>>>> }
>>>
>>> Exactly. But this behavior is desirable, locally. There's no good answer
>>> here: We either generate extra load traffic here (because we need to load
>>> the fields separately), or we widen the store (which generates extra load
>>> traffic there). Do you know, in terms of performance, which is better in
>>> this case (i.e., is it better to widen the store or split the load)?
>>>
>>> -Hal
>>>
>>>
>>>> Thanks,
>>>> Wei.
>>>>
>>>>>> The testcases shows the potential problem of store shrinking. Before
>>>>>> we decide to do store shrinking, we need to know all the related loads
>>>>>> will be shrunk, and that requires IPA analysis. Otherwise, when load
>>>>>> shrinking was blocked for some difficult case (Like the instcombine
>>>>>> case described in
>>>>>> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
>>>>>> performance regression will happen.
>>>>>>
>>>>>> Wei.
>>>>>>
>>>>>>
>>>>>>>> Repository:
>>>>>>>> rL LLVM
>>>>>>>>
>>>>>>>> https://reviews.llvm.org/D36562
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
More information about the cfe-commits
mailing list