[PATCH] D30416: [InstCombine] Redo reduceLoadOpStoreWidth in instcombine for bitfield store optimization.

Tue Mar 7 13:57:42 PST 2017

On Tue, Mar 7, 2017 at 10:51 AM, Wei Mi <wmi at google.com> wrote:

>
>
> On Tue, Mar 7, 2017 at 9:36 AM, Friedman, Eli <efriedma at codeaurora.org>
> wrote:
>
>> On 3/7/2017 9:15 AM, Wei Mi wrote:
>>
>>
>>
>> On Mon, Mar 6, 2017 at 8:50 PM, Wei Mi <wmi at google.com> wrote:
>>
>>>
>>>
>>> On Mon, Mar 6, 2017 at 4:36 PM, Friedman, Eli <efriedma at codeaurora.org>
>>> wrote:
>>>
>>>> On 3/6/2017 11:06 AM, Wei Mi wrote:
>>>>
>>>>> After looking at the benchmark motivating the test, I find it is much
>>>>> harder for codegenprepare to catch all the shrinking opportunities compared
>>>>> with instcombine, because of two things:
>>>>>
>>>>> * loadpre may merge the original load in a shrinking pattern with
>>>>> other load in a very early position, and there maybe other stores in
>>>>> between. The safety check for the codegenprepare now simply scan
>>>>> mayWriteToMemory insns between load and store in the candidate shrinking
>>>>> pattern, and it will fail because of those stores in between.
>>>>>
>>>>
>>>> It would be easy to make the check a lot more powerful using MemorySSA;
>>>> I guess you don't have access to that in the passes where you're doing the
>>>> transform?
>>>
>>>
>>> Yes, if MemorySSA + alias query can be used, it should solve the
>>> problem. I didn't find existing usage of MemorySSA in instcombine or CGP so
>>> I am not sure whether it can be used.
>>>
>>>
>>>>
>>>>
>>>> * we may see the following pattern. a.f1 is a bitfield. Ideally, we
>>>>> should do the shrinking for all the three stores. But codegenprepare works
>>>>> in top-down order. After the first store is shrinked, it is hard for the
>>>>> second and the third to know %bf.set is the same as the value loaded from
>>>>> %0. If we do this in instcombine, it is much easier because before
>>>>> load/store elimination, %bf.set will be loaded from %0 respecitively in
>>>>> if.then9 and in if.end41.
>>>>>
>>>>
>>>> I'm not following how it helps to do this in instcombine?  Normally,
>>>> EarlyCSE runs first.  (A complete C testcase might be useful here.)
>>>
>>>
>>> It is not earlycse. It is also loadpre to cause the load to be replaced
>>> by %bf.set.
>>>
>>> Before global value numbering:
>>>
>>> entry:
>>>   %0 = bitcast %class.B* %this to i64*
>>>   %bf.load = load i64, i64* %0, align 8
>>>   %dec = add i64 %bf.load, 65535
>>>   %bf.value = and i64 %dec, 65535
>>>   %bf.clear5 = and i64 %bf.load, -65536
>>>   %bf.set = or i64 %bf.value, %bf.clear5
>>>   store i64 %bf.set, i64* %0, align 8
>>>   ...
>>> if.then9:
>>>   %bf.load1 = load i64, i64* %0, align 8                      //
>>> %bf.load1 is helpful for the patch to recognize the load-and-or-store
>>> pattern.
>>>   %inc79 = add i64 %bf.load1, 281474976710656
>>>   %bf.shl = and i64 %inc79, 71776119061217280
>>>   %bf.clear18 = and i64 %bf.load1, -71776119061217281
>>>   %bf.set19 = or i64 %bf.shl, %bf.clear18
>>>   store i64 %bf.set19, i64* %0, align 8
>>>   ...
>>> if.end41:
>>>   %bf.load2 = load i64, i64* %0, align 8                      //
>>> %bf.load2 is helpful for the patch to recognize the load-and-or-store
>>> pattern.
>>>   %inc4578 = add i64 %bf.load2, 65536
>>>   %bf.shl48 = and i64 %inc4578, 4294901760
>>>   %bf.clear49 = and i64 %bf.load2, -4294901761
>>>   %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>>   store i64 %bf.set50, i64* %0, align 8
>>>
>>> After global value numbering:
>>>
>>> entry:
>>>   %0 = bitcast %class.B* %this to i64*
>>>   %bf.load = load i64, i64* %0, align 8
>>>   %dec = add i64 %bf.load, 65535
>>>   %bf.value = and i64 %dec, 65535
>>>   %bf.clear5 = and i64 %bf.load, -65536
>>>   %bf.set = or i64 %bf.value, %bf.clear5
>>>   store i64 %bf.set, i64* %0, align 8
>>>   ...
>>> if.then9:
>>>   %inc79 = add i64 %bf.set, 281474976710656       // we hope to know
>>> %bf.set is the same as load i64, i64* %0, align 8.
>>>   %bf.shl = and i64 %inc79, 71776119061217280
>>>   %bf.clear18 = and i64 %bf.set, -71776119061217281
>>>   %bf.set19 = or i64 %bf.shl, %bf.clear18
>>>   store i64 %bf.set19, i64* %0, align 8
>>>   ...
>>> if.end41:
>>>   %inc4578 = add i64 %bf.set, 65536                        // we hope to
>>> know %bf.set is the same as load i64, i64* %0, align 8.
>>>   %bf.shl48 = and i64 %inc4578, 4294901760
>>>   %bf.clear49 = and i64 %bf.set, -4294901761
>>>   %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>>   store i64 %bf.set50, i64* %0, align 8
>>>
>>
>> I don't like depending on pass ordering like this... I mean, I can't
>> really say without the original C testcase, but at first glance, EarlyCSE
>> should be catching this, and you're just getting lucky.
>>
>
> For the testcase above, EarlyCSE does't replace the load with previous
> store value because there are calls and other stores in between (in the
> ellipsis). You are right, generally, unless we add another pass of
> instcombine before earlycse, doing the shrinking in instcombine after
> earlycse can face the same problem. I havn't got an idea to solve it
> generally except that to add another pass of instcombine. Still thinking...
>

To catch all the bitfield shrinking opportunities, seems the optimization
should be placed before the first pass of earlycse. But I don't think
adding another pass of instcombine before earlycse is a good idea, because
I vaguely remember we saw case requiring earlycse to happen before
instcombine.

Is it ok to add a separate pass before earlycse for bitfield related
shrinking? This is the best way I can think of for now. Suggestions are
welcomed.

Thanks,
Wei.

>
> I attach a reduced testcase we want to optimize. We expect all the
> bitfield stores in the testcase are shrinked.
>
>
>
>>
>>  Or can I move all the types of shrinking to instcombine? Since I only do
>> the shrinking to i8, i16, i32, I think almost all the general architectures
>> support i8/i16/i32 load/store directly?
>>
>> It should be safe to assume every architecture has a reasonable lowering
>> for i8, i16, and i32 load/store.
>>
>> -Eli
>>
>> --
>> Employee of Qualcomm Innovation Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170307/124f8e3e/attachment.html>