[PATCH] D30416: [InstCombine] Redo reduceLoadOpStoreWidth in instcombine for bitfield store optimization.

Tue Mar 7 15:26:17 PST 2017

On 3/7/2017 1:57 PM, Wei Mi wrote:
>
>
> On Tue, Mar 7, 2017 at 10:51 AM, Wei Mi <wmi at google.com 
> <mailto:wmi at google.com>> wrote:
>
>
>
>     On Tue, Mar 7, 2017 at 9:36 AM, Friedman, Eli
>     <efriedma at codeaurora.org <mailto:efriedma at codeaurora.org>> wrote:
>
>         On 3/7/2017 9:15 AM, Wei Mi wrote:
>>
>>
>>         On Mon, Mar 6, 2017 at 8:50 PM, Wei Mi <wmi at google.com
>>         <mailto:wmi at google.com>> wrote:
>>
>>
>>
>>             On Mon, Mar 6, 2017 at 4:36 PM, Friedman, Eli
>>             <efriedma at codeaurora.org
>>             <mailto:efriedma at codeaurora.org>> wrote:
>>
>>                 On 3/6/2017 11:06 AM, Wei Mi wrote:
>>
>>                     After looking at the benchmark motivating the
>>                     test, I find it is much harder for codegenprepare
>>                     to catch all the shrinking opportunities compared
>>                     with instcombine, because of two things:
>>
>>                     * loadpre may merge the original load in a
>>                     shrinking pattern with other load in a very early
>>                     position, and there maybe other stores in
>>                     between. The safety check for the codegenprepare
>>                     now simply scan mayWriteToMemory insns between
>>                     load and store in the candidate shrinking
>>                     pattern, and it will fail because of those stores
>>                     in between.
>>
>>
>>                 It would be easy to make the check a lot more
>>                 powerful using MemorySSA; I guess you don't have
>>                 access to that in the passes where you're doing the
>>                 transform?
>>
>>
>>             Yes, if MemorySSA + alias query can be used, it should
>>             solve the problem. I didn't find existing usage of
>>             MemorySSA in instcombine or CGP so I am not sure whether
>>             it can be used.
>>
>>
>>
>>                     * we may see the following pattern. a.f1 is a
>>                     bitfield. Ideally, we should do the shrinking for
>>                     all the three stores. But codegenprepare works in
>>                     top-down order. After the first store is
>>                     shrinked, it is hard for the second and the third
>>                     to know %bf.set is the same as the value loaded
>>                     from %0. If we do this in instcombine, it is much
>>                     easier because before load/store elimination,
>>                     %bf.set will be loaded from %0 respecitively in
>>                     if.then9 and in if.end41.
>>
>>
>>                 I'm not following how it helps to do this in
>>                 instcombine? Normally, EarlyCSE runs first.  (A
>>                 complete C testcase might be useful here.)
>>
>>
>>             It is not earlycse. It is also loadpre to cause the load
>>             to be replaced by %bf.set.
>>
>>             Before global value numbering:
>>
>>             entry:
>>             %0 = bitcast %class.B* %this to i64*
>>             %bf.load = load i64, i64* %0, align 8
>>               %dec = add i64 %bf.load, 65535
>>               %bf.value = and i64 %dec, 65535
>>               %bf.clear5 = and i64 %bf.load, -65536
>>               %bf.set = or i64 %bf.value, %bf.clear5
>>               store i64 %bf.set, i64* %0, align 8
>>             ...
>>             if.then9:
>>             %bf.load1 = load i64, i64* %0, align 8              //
>>             %bf.load1 is helpful for the patch to recognize the
>>             load-and-or-store pattern.
>>               %inc79 = add i64 %bf.load1, 281474976710656
>>               %bf.shl = and i64 %inc79, 71776119061217280
>>               %bf.clear18 = and i64 %bf.load1, -71776119061217281
>>               %bf.set19 = or i64 %bf.shl, %bf.clear18
>>               store i64 %bf.set19, i64* %0, align 8
>>             ...
>>             if.end41:
>>             %bf.load2 = load i64, i64* %0, align 8 // %bf.load2 is
>>             helpful for the patch to recognize the load-and-or-store
>>             pattern.
>>               %inc4578 = add i64 %bf.load2, 65536
>>               %bf.shl48 = and i64 %inc4578, 4294901760
>>               %bf.clear49 = and i64 %bf.load2, -4294901761
>>               %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>               store i64 %bf.set50, i64* %0, align 8
>>             After global value numbering:
>>
>>             entry:
>>             %0 = bitcast %class.B* %this to i64*
>>             %bf.load = load i64, i64* %0, align 8
>>               %dec = add i64 %bf.load, 65535
>>               %bf.value = and i64 %dec, 65535
>>               %bf.clear5 = and i64 %bf.load, -65536
>>               %bf.set = or i64 %bf.value, %bf.clear5
>>               store i64 %bf.set, i64* %0, align 8
>>             ...
>>             if.then9:
>>               %inc79 = add i64 %bf.set, 281474976710656       // we
>>             hope to know %bf.set is the same as load i64, i64* %0,
>>             align 8.
>>               %bf.shl = and i64 %inc79, 71776119061217280
>>               %bf.clear18 = and i64 %bf.set, -71776119061217281
>>               %bf.set19 = or i64 %bf.shl, %bf.clear18
>>               store i64 %bf.set19, i64* %0, align 8
>>             ...
>>             if.end41:
>>               %inc4578 = add i64 %bf.set, 65536              // we
>>             hope to know %bf.set is the same as load i64, i64* %0,
>>             align 8.
>>               %bf.shl48 = and i64 %inc4578, 4294901760
>>               %bf.clear49 = and i64 %bf.set, -4294901761
>>               %bf.set50 = or i64 %bf.shl48, %bf.clear49
>>               store i64 %bf.set50, i64* %0, align 8
>>
>
>         I don't like depending on pass ordering like this... I mean, I
>         can't really say without the original C testcase, but at first
>         glance, EarlyCSE should be catching this, and you're just
>         getting lucky.
>
>
>     For the testcase above, EarlyCSE does't replace the load with
>     previous store value because there are calls and other stores in
>     between (in the ellipsis). You are right, generally, unless we add
>     another pass of instcombine before earlycse, doing the shrinking
>     in instcombine after earlycse can face the same problem. I havn't
>     got an idea to solve it generally except that to add another pass
>     of instcombine. Still thinking...
>
>
> To catch all the bitfield shrinking opportunities, seems the 
> optimization should be placed before the first pass of earlycse. But I 
> don't think adding another pass of instcombine before earlycse is a 
> good idea, because I vaguely remember we saw case requiring earlycse 
> to happen before instcombine.
>
> Is it ok to add a separate pass before earlycse for bitfield related 
> shrinking? This is the best way I can think of for now. Suggestions 
> are welcomed.
>

You might want to skim the thread 
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html 
for some background on our current treatment of bitfields.

I'm not sure what the right answer here looks like, given the way it 
interacts with other memory transforms like GVN. Could you send an email 
to llvm-dev with a description of what we're looking at, to see if 
anyone else has suggestions?

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170307/12050d00/attachment.html>