[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

Sat Sep 2 18:04:17 PDT 2017

On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <davidxl at google.com> wrote:
>>
>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>> <reviews at reviews.llvm.org> wrote:
>>> chandlerc added a comment.
>>>
>>> I'm really not a fan of the degree of complexity and subtlety that this
>>> introduces into the frontend, all to allow particular backend optimizations.
>>>
>>> I feel like this is Clang working around a fundamental deficiency in LLVM
>>> and we should instead find a way to fix this in LLVM itself.
>>>
>>> As has been pointed out before, user code can synthesize large integers
>>> that small bit sequences are extracted from, and Clang and LLVM should
>>> handle those just as well as actual bitfields.
>>>
>>> Can we see how far we can push the LLVM side before we add complexity to
>>> Clang here? I understand that there remain challenges to LLVM's stuff, but I
>>> don't think those challenges make *all* of the LLVM improvements off the
>>> table, I don't think we've exhausted all ways of improving the LLVM changes
>>> being proposed, and I think we should still land all of those and
>>> re-evaluate how important these issues are when all of that is in place.
>>
>> The main challenge of doing  this in LLVM is that inter-procedural analysis
>> (and possibly cross module) is needed (for store forwarding issues).
>>
>> Wei, perhaps you can provide concrete test case to illustrate the issue so
>> that reviewers have a good understanding.
>>
>> David
> Here is a runable testcase:
> -------------------- 1.cc ------------------------
> class A {
> public:
>    unsigned long f1:2;
>    unsigned long f2:6;
>    unsigned long f3:8;
>    unsigned long f4:4;
> };
> A a;
> unsigned long b;
> unsigned long N = 1000000000;
>
> __attribute__((noinline))
> void foo() {
>    a.f3 = 3;
> }
>
> __attribute__((noinline))
> void goo() {
>    b = a.f3;
> }
>
> int main() {
>    unsigned long i;
>    for (i = 0; i < N; i++) {
>      foo();
>      goo();
>    }
> }
> ------------------------------------------------------------
> Now trunk takes about twice running time compared with trunk + this
> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
> DagCombiner) but not shrink the load of a.f3 in goo, so store
> forwarding will be blocked.

I can confirm that this is true on Haswell and also on an POWER8. 
Interestingly, on a POWER7, the performance is the same either way (on 
the good side). I ran the test case as presented and where I replaced f3 
with a non-bitfield unsigned char member. Thinking that the POWER7 
result might be because it's big-Endian, I flipped the order of the 
fields, and found that the version where f3 is not a bitfield is faster 
than otherwise, but only by 12.5%.

Why, in this case, don't we shrink the load? It seems like we should 
(and it seems like a straightforward case).

Thanks again,
Hal

>
> The testcases shows the potential problem of store shrinking. Before
> we decide to do store shrinking, we need to know all the related loads
> will be shrunk,  and that requires IPA analysis. Otherwise, when load
> shrinking was blocked for some difficult case (Like the instcombine
> case described in
> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
> performance regression will happen.
>
> Wei.
>
>
>>>
>>>
>>> Repository:
>>>    rL LLVM
>>>
>>> https://reviews.llvm.org/D36562
>>>
>>>
>>>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory