[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

Sun Sep 3 13:44:41 PDT 2017

On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>>
>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>>
>>>
>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>> <reviews at reviews.llvm.org> wrote:
>>>>
>>>> chandlerc added a comment.
>>>>
>>>> I'm really not a fan of the degree of complexity and subtlety that this
>>>> introduces into the frontend, all to allow particular backend
>>>> optimizations.
>>>>
>>>> I feel like this is Clang working around a fundamental deficiency in
>>>> LLVM
>>>> and we should instead find a way to fix this in LLVM itself.
>>>>
>>>> As has been pointed out before, user code can synthesize large integers
>>>> that small bit sequences are extracted from, and Clang and LLVM should
>>>> handle those just as well as actual bitfields.
>>>>
>>>> Can we see how far we can push the LLVM side before we add complexity to
>>>> Clang here? I understand that there remain challenges to LLVM's stuff,
>>>> but I
>>>> don't think those challenges make *all* of the LLVM improvements off the
>>>> table, I don't think we've exhausted all ways of improving the LLVM
>>>> changes
>>>> being proposed, and I think we should still land all of those and
>>>> re-evaluate how important these issues are when all of that is in place.
>>>
>>>
>>> The main challenge of doing  this in LLVM is that inter-procedural
>>> analysis
>>> (and possibly cross module) is needed (for store forwarding issues).
>>>
>>> Wei, perhaps you can provide concrete test case to illustrate the issue
>>> so
>>> that reviewers have a good understanding.
>>>
>>> David
>>
>> Here is a runable testcase:
>> -------------------- 1.cc ------------------------
>> class A {
>> public:
>>    unsigned long f1:2;
>>    unsigned long f2:6;
>>    unsigned long f3:8;
>>    unsigned long f4:4;
>> };
>> A a;
>> unsigned long b;
>> unsigned long N = 1000000000;
>>
>> __attribute__((noinline))
>> void foo() {
>>    a.f3 = 3;
>> }
>>
>> __attribute__((noinline))
>> void goo() {
>>    b = a.f3;
>> }
>>
>> int main() {
>>    unsigned long i;
>>    for (i = 0; i < N; i++) {
>>      foo();
>>      goo();
>>    }
>> }
>> ------------------------------------------------------------
>> Now trunk takes about twice running time compared with trunk + this
>> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
>> DagCombiner) but not shrink the load of a.f3 in goo, so store
>> forwarding will be blocked.
>
>
> I can confirm that this is true on Haswell and also on an POWER8.
> Interestingly, on a POWER7, the performance is the same either way (on the
> good side). I ran the test case as presented and where I replaced f3 with a
> non-bitfield unsigned char member. Thinking that the POWER7 result might be
> because it's big-Endian, I flipped the order of the fields, and found that
> the version where f3 is not a bitfield is faster than otherwise, but only by
> 12.5%.
>
> Why, in this case, don't we shrink the load? It seems like we should (and it
> seems like a straightforward case).
>
> Thanks again,
> Hal
>

Hal, thanks for trying the test.

Yes, it is straightforward to shrink the load in the test. I can
change the testcase a little to show why it is sometime difficult to
shrink the load:

class A {
public:
  unsigned long f1:16;
  unsigned long f2:16;
  unsigned long f3:16;
  unsigned long f4:8;
};
A a;
bool b;
unsigned long N = 1000000000;

__attribute__((noinline))
void foo() {
  a.f4 = 3;
}

__attribute__((noinline))
void goo() {
  b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
}

int main() {
  unsigned long i;
  for (i = 0; i < N; i++) {
    foo();
    goo();
  }
}

For the load a.f4 in goo, it is diffcult to motivate its shrink after
instcombine because the comparison with a.f3 and the comparison with
a.f4 are merged:

define void @_Z3goov() local_unnamed_addr #0 {
  %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
  %2 = and i64 %1, 0xffffff00000000
  %3 = icmp eq i64 %2, 0xffff00000000
  %4 = zext i1 %3 to i8
  store i8 %4, i8* @b, align 1, !tbaa !2
  ret void
}

Thanks,
Wei.

>>
>> The testcases shows the potential problem of store shrinking. Before
>> we decide to do store shrinking, we need to know all the related loads
>> will be shrunk,  and that requires IPA analysis. Otherwise, when load
>> shrinking was blocked for some difficult case (Like the instcombine
>> case described in
>> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
>> performance regression will happen.
>>
>> Wei.
>>
>>
>>>>
>>>>
>>>> Repository:
>>>>    rL LLVM
>>>>
>>>> https://reviews.llvm.org/D36562
>>>>
>>>>
>>>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>