[PATCH] D30416: [InstCombine] Redo reduceLoadOpStoreWidth in instcombine for bitfield store optimization.

Tue Mar 7 10:51:59 PST 2017

On Tue, Mar 7, 2017 at 9:36 AM, Friedman, Eli <efriedma at codeaurora.org>
wrote:

> On 3/7/2017 9:15 AM, Wei Mi wrote:
>
>
>
> On Mon, Mar 6, 2017 at 8:50 PM, Wei Mi <wmi at google.com> wrote:
>
>>
>>
>> On Mon, Mar 6, 2017 at 4:36 PM, Friedman, Eli <efriedma at codeaurora.org>
>> wrote:
>>
>>> On 3/6/2017 11:06 AM, Wei Mi wrote:
>>>
>>>> After looking at the benchmark motivating the test, I find it is much
>>>> harder for codegenprepare to catch all the shrinking opportunities compared
>>>> with instcombine, because of two things:
>>>>
>>>> * loadpre may merge the original load in a shrinking pattern with other
>>>> load in a very early position, and there maybe other stores in between. The
>>>> safety check for the codegenprepare now simply scan mayWriteToMemory insns
>>>> between load and store in the candidate shrinking pattern, and it will fail
>>>> because of those stores in between.
>>>>
>>>
>>> It would be easy to make the check a lot more powerful using MemorySSA;
>>> I guess you don't have access to that in the passes where you're doing the
>>> transform?
>>
>>
>> Yes, if MemorySSA + alias query can be used, it should solve the problem.
>> I didn't find existing usage of MemorySSA in instcombine or CGP so I am not
>> sure whether it can be used.
>>
>>
>>>
>>>
>>> * we may see the following pattern. a.f1 is a bitfield. Ideally, we
>>>> should do the shrinking for all the three stores. But codegenprepare works
>>>> in top-down order. After the first store is shrinked, it is hard for the
>>>> second and the third to know %bf.set is the same as the value loaded from
>>>> %0. If we do this in instcombine, it is much easier because before
>>>> load/store elimination, %bf.set will be loaded from %0 respecitively in
>>>> if.then9 and in if.end41.
>>>>
>>>
>>> I'm not following how it helps to do this in instcombine?  Normally,
>>> EarlyCSE runs first.  (A complete C testcase might be useful here.)
>>
>>
>> It is not earlycse. It is also loadpre to cause the load to be replaced
>> by %bf.set.
>>
>> Before global value numbering:
>>
>> entry:
>>   %0 = bitcast %class.B* %this to i64*
>>   %bf.load = load i64, i64* %0, align 8
>>   %dec = add i64 %bf.load, 65535
>>   %bf.value = and i64 %dec, 65535
>>   %bf.clear5 = and i64 %bf.load, -65536
>>   %bf.set = or i64 %bf.value, %bf.clear5
>>   store i64 %bf.set, i64* %0, align 8
>>   ...
>> if.then9:
>>   %bf.load1 = load i64, i64* %0, align 8                      //
>> %bf.load1 is helpful for the patch to recognize the load-and-or-store
>> pattern.
>>   %inc79 = add i64 %bf.load1, 281474976710656
>>   %bf.shl = and i64 %inc79, 71776119061217280
>>   %bf.clear18 = and i64 %bf.load1, -71776119061217281
>>   %bf.set19 = or i64 %bf.shl, %bf.clear18
>>   store i64 %bf.set19, i64* %0, align 8
>>   ...
>> if.end41:
>>   %bf.load2 = load i64, i64* %0, align 8                      //
>> %bf.load2 is helpful for the patch to recognize the load-and-or-store
>> pattern.
>>   %inc4578 = add i64 %bf.load2, 65536
>>   %bf.shl48 = and i64 %inc4578, 4294901760
>>   %bf.clear49 = and i64 %bf.load2, -4294901761
>>   %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>   store i64 %bf.set50, i64* %0, align 8
>>
>> After global value numbering:
>>
>> entry:
>>   %0 = bitcast %class.B* %this to i64*
>>   %bf.load = load i64, i64* %0, align 8
>>   %dec = add i64 %bf.load, 65535
>>   %bf.value = and i64 %dec, 65535
>>   %bf.clear5 = and i64 %bf.load, -65536
>>   %bf.set = or i64 %bf.value, %bf.clear5
>>   store i64 %bf.set, i64* %0, align 8
>>   ...
>> if.then9:
>>   %inc79 = add i64 %bf.set, 281474976710656       // we hope to know
>> %bf.set is the same as load i64, i64* %0, align 8.
>>   %bf.shl = and i64 %inc79, 71776119061217280
>>   %bf.clear18 = and i64 %bf.set, -71776119061217281
>>   %bf.set19 = or i64 %bf.shl, %bf.clear18
>>   store i64 %bf.set19, i64* %0, align 8
>>   ...
>> if.end41:
>>   %inc4578 = add i64 %bf.set, 65536                        // we hope to
>> know %bf.set is the same as load i64, i64* %0, align 8.
>>   %bf.shl48 = and i64 %inc4578, 4294901760
>>   %bf.clear49 = and i64 %bf.set, -4294901761
>>   %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>   store i64 %bf.set50, i64* %0, align 8
>>
>
> I don't like depending on pass ordering like this... I mean, I can't
> really say without the original C testcase, but at first glance, EarlyCSE
> should be catching this, and you're just getting lucky.
>

For the testcase above, EarlyCSE does't replace the load with previous
store value because there are calls and other stores in between (in the
ellipsis). You are right, generally, unless we add another pass of
instcombine before earlycse, doing the shrinking in instcombine after
earlycse can face the same problem. I havn't got an idea to solve it
generally except that to add another pass of instcombine. Still thinking...

I attach a reduced testcase we want to optimize. We expect all the bitfield
stores in the testcase are shrinked.

>
>  Or can I move all the types of shrinking to instcombine? Since I only do
> the shrinking to i8, i16, i32, I think almost all the general architectures
> support i8/i16/i32 load/store directly?
>
> It should be safe to assume every architecture has a reasonable lowering
> for i8, i16, and i32 load/store.
>
> -Eli
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170307/23ad5702/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.cc
Type: text/x-c++src
Size: 1703 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170307/23ad5702/attachment-0001.cc>