[PATCH] D30416: [InstCombine] Redo reduceLoadOpStoreWidth in instcombine for bitfield store optimization.

Wed Mar 8 12:26:21 PST 2017

Eli, thank you for pointing me to the thread. Very helpful to understand
the background.

Now I have better understanding about the problem. The change in the
thread wrapping consecutive bitfields into a group has the benefit for
loads of different bitfields to be merged. If the bitfield acceses are not
legal types like i8/i16/i32, the merge will be helpful. If the bitfield
acceses are legal type, we want to do the shrinking here.

However, for bitfield load/store merge (earlycse, gvn and even instcombine
can do such merge, but gvn does it more thoroughly) and legal type bitfield
accesses shrinking, no matter which one happens first, it can make the
other one more difficult. Like if we do legal type bitfield accesses
shrinking first, the shrinked load/store will block other bitfield
load/store from the same bitfield groups to merge. If we do bitfield
load/store merge first, we will face the problems described previously: 1.
we need memoryssa in safety check. 2. The pattern matching becomes more
difficult.

After some thought, I think it is still better to do the bitfield shrinking
later. To make the shrinking still effective, for the case below, 1. I need
to use memoryssa to check there is no alias mem write between the load and
the store below. 2. I need to extend the existing pattern match to handle
store(or(and(or(and...or(and(load Ptr, Value), Value1)..., ValueN), Ptr).

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
.....

%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

For 2, we know that the extended pattern above is to load the bitfield
group first, and the several "or+and" operations are to adjusting different
parts of the load. If only the last "or + and" touch different bits with
all previous "or + and" operations, it is good to separate the store in the
end to two stores and do the shrinking for the second store, like below. I
think it just adds more complexity to the existing patch, but should still
be doable.

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8
.....

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8

For 1, probably I need to add a new pass for it because I cannot put it in
CGP anymore (There is no dominator tree to use in CGP. The dominator tree
update there is broken, which was described in
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20140120/202291.html.
The thread is three years ago, but from my looking at the code, the
conclusion still holds).

Comments are welcomed.

Thanks,
Wei.

On Tue, Mar 7, 2017 at 3:26 PM, Friedman, Eli <efriedma at codeaurora.org>
wrote:

> On 3/7/2017 1:57 PM, Wei Mi wrote:
>
>
>
> On Tue, Mar 7, 2017 at 10:51 AM, Wei Mi <wmi at google.com> wrote:
>
>>
>>
>> On Tue, Mar 7, 2017 at 9:36 AM, Friedman, Eli <efriedma at codeaurora.org>
>> wrote:
>>
>>> On 3/7/2017 9:15 AM, Wei Mi wrote:
>>>
>>>
>>>
>>> On Mon, Mar 6, 2017 at 8:50 PM, Wei Mi <wmi at google.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Mar 6, 2017 at 4:36 PM, Friedman, Eli <efriedma at codeaurora.org>
>>>> wrote:
>>>>
>>>>> On 3/6/2017 11:06 AM, Wei Mi wrote:
>>>>>
>>>>>> After looking at the benchmark motivating the test, I find it is much
>>>>>> harder for codegenprepare to catch all the shrinking opportunities compared
>>>>>> with instcombine, because of two things:
>>>>>>
>>>>>> * loadpre may merge the original load in a shrinking pattern with
>>>>>> other load in a very early position, and there maybe other stores in
>>>>>> between. The safety check for the codegenprepare now simply scan
>>>>>> mayWriteToMemory insns between load and store in the candidate shrinking
>>>>>> pattern, and it will fail because of those stores in between.
>>>>>>
>>>>>
>>>>> It would be easy to make the check a lot more powerful using
>>>>> MemorySSA; I guess you don't have access to that in the passes where you're
>>>>> doing the transform?
>>>>
>>>>
>>>> Yes, if MemorySSA + alias query can be used, it should solve the
>>>> problem. I didn't find existing usage of MemorySSA in instcombine or CGP so
>>>> I am not sure whether it can be used.
>>>>
>>>>
>>>>>
>>>>>
>>>>> * we may see the following pattern. a.f1 is a bitfield. Ideally, we
>>>>>> should do the shrinking for all the three stores. But codegenprepare works
>>>>>> in top-down order. After the first store is shrinked, it is hard for the
>>>>>> second and the third to know %bf.set is the same as the value loaded from
>>>>>> %0. If we do this in instcombine, it is much easier because before
>>>>>> load/store elimination, %bf.set will be loaded from %0 respecitively in
>>>>>> if.then9 and in if.end41.
>>>>>>
>>>>>
>>>>> I'm not following how it helps to do this in instcombine?  Normally,
>>>>> EarlyCSE runs first.  (A complete C testcase might be useful here.)
>>>>
>>>>
>>>> It is not earlycse. It is also loadpre to cause the load to be replaced
>>>> by %bf.set.
>>>>
>>>> Before global value numbering:
>>>>
>>>> entry:
>>>>   %0 = bitcast %class.B* %this to i64*
>>>>   %bf.load = load i64, i64* %0, align 8
>>>>   %dec = add i64 %bf.load, 65535
>>>>   %bf.value = and i64 %dec, 65535
>>>>   %bf.clear5 = and i64 %bf.load, -65536
>>>>   %bf.set = or i64 %bf.value, %bf.clear5
>>>>   store i64 %bf.set, i64* %0, align 8
>>>>   ...
>>>> if.then9:
>>>>   %bf.load1 = load i64, i64* %0, align 8                      //
>>>> %bf.load1 is helpful for the patch to recognize the load-and-or-store
>>>> pattern.
>>>>   %inc79 = add i64 %bf.load1, 281474976710656
>>>>   %bf.shl = and i64 %inc79, 71776119061217280
>>>>   %bf.clear18 = and i64 %bf.load1, -71776119061217281
>>>>   %bf.set19 = or i64 %bf.shl, %bf.clear18
>>>>   store i64 %bf.set19, i64* %0, align 8
>>>>   ...
>>>> if.end41:
>>>>   %bf.load2 = load i64, i64* %0, align 8                      //
>>>> %bf.load2 is helpful for the patch to recognize the load-and-or-store
>>>> pattern.
>>>>   %inc4578 = add i64 %bf.load2, 65536
>>>>   %bf.shl48 = and i64 %inc4578, 4294901760
>>>>   %bf.clear49 = and i64 %bf.load2, -4294901761
>>>>   %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>>>   store i64 %bf.set50, i64* %0, align 8
>>>>
>>>> After global value numbering:
>>>>
>>>> entry:
>>>>   %0 = bitcast %class.B* %this to i64*
>>>>   %bf.load = load i64, i64* %0, align 8
>>>>   %dec = add i64 %bf.load, 65535
>>>>   %bf.value = and i64 %dec, 65535
>>>>   %bf.clear5 = and i64 %bf.load, -65536
>>>>   %bf.set = or i64 %bf.value, %bf.clear5
>>>>   store i64 %bf.set, i64* %0, align 8
>>>>   ...
>>>> if.then9:
>>>>   %inc79 = add i64 %bf.set, 281474976710656       // we hope to know
>>>> %bf.set is the same as load i64, i64* %0, align 8.
>>>>   %bf.shl = and i64 %inc79, 71776119061217280
>>>>   %bf.clear18 = and i64 %bf.set, -71776119061217281
>>>>   %bf.set19 = or i64 %bf.shl, %bf.clear18
>>>>   store i64 %bf.set19, i64* %0, align 8
>>>>   ...
>>>> if.end41:
>>>>   %inc4578 = add i64 %bf.set, 65536                        // we hope
>>>> to know %bf.set is the same as load i64, i64* %0, align 8.
>>>>   %bf.shl48 = and i64 %inc4578, 4294901760
>>>>   %bf.clear49 = and i64 %bf.set, -4294901761
>>>>   %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>>>   store i64 %bf.set50, i64* %0, align 8
>>>>
>>>
>>> I don't like depending on pass ordering like this... I mean, I can't
>>> really say without the original C testcase, but at first glance, EarlyCSE
>>> should be catching this, and you're just getting lucky.
>>>
>>
>> For the testcase above, EarlyCSE does't replace the load with previous
>> store value because there are calls and other stores in between (in the
>> ellipsis). You are right, generally, unless we add another pass of
>> instcombine before earlycse, doing the shrinking in instcombine after
>> earlycse can face the same problem. I havn't got an idea to solve it
>> generally except that to add another pass of instcombine. Still thinking...
>>
>
> To catch all the bitfield shrinking opportunities, seems the optimization
> should be placed before the first pass of earlycse. But I don't think
> adding another pass of instcombine before earlycse is a good idea, because
> I vaguely remember we saw case requiring earlycse to happen before
> instcombine.
>
> Is it ok to add a separate pass before earlycse for bitfield related
> shrinking? This is the best way I can think of for now. Suggestions are
> welcomed.
>
>
> You might want to skim the thread http://lists.llvm.org/
> pipermail/cfe-commits/Week-of-Mon-20120827/063200.html for some
> background on our current treatment of bitfields.
>
> I'm not sure what the right answer here looks like, given the way it
> interacts with other memory transforms like GVN. Could you send an email to
> llvm-dev with a description of what we're looking at, to see if anyone else
> has suggestions?
>
> -Eli
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170308/d4918423/attachment.html>