[PATCH] D30416: [InstCombine] Redo reduceLoadOpStoreWidth in instcombine for bitfield store optimization.

Tue Mar 7 09:36:16 PST 2017

On 3/7/2017 9:15 AM, Wei Mi wrote:
>
>
> On Mon, Mar 6, 2017 at 8:50 PM, Wei Mi <wmi at google.com 
> <mailto:wmi at google.com>> wrote:
>
>
>
>     On Mon, Mar 6, 2017 at 4:36 PM, Friedman, Eli
>     <efriedma at codeaurora.org <mailto:efriedma at codeaurora.org>> wrote:
>
>         On 3/6/2017 11:06 AM, Wei Mi wrote:
>
>             After looking at the benchmark motivating the test, I find
>             it is much harder for codegenprepare to catch all the
>             shrinking opportunities compared with instcombine, because
>             of two things:
>
>             * loadpre may merge the original load in a shrinking
>             pattern with other load in a very early position, and
>             there maybe other stores in between. The safety check for
>             the codegenprepare now simply scan mayWriteToMemory insns
>             between load and store in the candidate shrinking pattern,
>             and it will fail because of those stores in between.
>
>
>         It would be easy to make the check a lot more powerful using
>         MemorySSA; I guess you don't have access to that in the passes
>         where you're doing the transform?
>
>
>     Yes, if MemorySSA + alias query can be used, it should solve the
>     problem. I didn't find existing usage of MemorySSA in instcombine
>     or CGP so I am not sure whether it can be used.
>
>
>
>             * we may see the following pattern. a.f1 is a bitfield.
>             Ideally, we should do the shrinking for all the three
>             stores. But codegenprepare works in top-down order. After
>             the first store is shrinked, it is hard for the second and
>             the third to know %bf.set is the same as the value loaded
>             from %0. If we do this in instcombine, it is much easier
>             because before load/store elimination, %bf.set will be
>             loaded from %0 respecitively in if.then9 and in if.end41.
>
>
>         I'm not following how it helps to do this in instcombine? 
>         Normally, EarlyCSE runs first.  (A complete C testcase might
>         be useful here.)
>
>
>     It is not earlycse. It is also loadpre to cause the load to be
>     replaced by %bf.set.
>
>     Before global value numbering:
>
>     entry:
>       %0 = bitcast %class.B* %this to i64*
>       %bf.load = load i64, i64* %0, align 8
>       %dec = add i64 %bf.load, 65535
>       %bf.value = and i64 %dec, 65535
>       %bf.clear5 = and i64 %bf.load, -65536
>       %bf.set = or i64 %bf.value, %bf.clear5
>       store i64 %bf.set, i64* %0, align 8
>       ...
>     if.then9:
>     %bf.load1 = load i64, i64* %0, align 8                      //
>     %bf.load1 is helpful for the patch to recognize the
>     load-and-or-store pattern.
>       %inc79 = add i64 %bf.load1, 281474976710656
>       %bf.shl = and i64 %inc79, 71776119061217280
>       %bf.clear18 = and i64 %bf.load1, -71776119061217281
>       %bf.set19 = or i64 %bf.shl, %bf.clear18
>       store i64 %bf.set19, i64* %0, align 8
>       ...
>     if.end41:
>     %bf.load2 = load i64, i64* %0, align 8 // %bf.load2 is helpful for
>     the patch to recognize the load-and-or-store pattern.
>       %inc4578 = add i64 %bf.load2, 65536
>       %bf.shl48 = and i64 %inc4578, 4294901760
>       %bf.clear49 = and i64 %bf.load2, -4294901761
>       %bf.set50 = or i64 %bf.shl48, %bf.clear49
>       store i64 %bf.set50, i64* %0, align 8
>     After global value numbering:
>
>     entry:
>       %0 = bitcast %class.B* %this to i64*
>       %bf.load = load i64, i64* %0, align 8
>       %dec = add i64 %bf.load, 65535
>       %bf.value = and i64 %dec, 65535
>       %bf.clear5 = and i64 %bf.load, -65536
>       %bf.set = or i64 %bf.value, %bf.clear5
>       store i64 %bf.set, i64* %0, align 8
>       ...
>     if.then9:
>       %inc79 = add i64 %bf.set, 281474976710656       // we hope to
>     know %bf.set is the same as load i64, i64* %0, align 8.
>       %bf.shl = and i64 %inc79, 71776119061217280
>       %bf.clear18 = and i64 %bf.set, -71776119061217281
>       %bf.set19 = or i64 %bf.shl, %bf.clear18
>       store i64 %bf.set19, i64* %0, align 8
>       ...
>     if.end41:
>       %inc4578 = add i64 %bf.set, 65536                  // we hope to
>     know %bf.set is the same as load i64, i64* %0, align 8.
>       %bf.shl48 = and i64 %inc4578, 4294901760
>       %bf.clear49 = and i64 %bf.set, -4294901761
>       %bf.set50 = or i64 %bf.shl48, %bf.clear49
>       store i64 %bf.set50, i64* %0, align 8
>

I don't like depending on pass ordering like this... I mean, I can't 
really say without the original C testcase, but at first glance, 
EarlyCSE should be catching this, and you're just getting lucky.
>  Or can I move all the types of shrinking to instcombine? Since I only 
> do the shrinking to i8, i16, i32, I think almost all the general 
> architectures support i8/i16/i32 load/store directly?
It should be safe to assume every architecture has a reasonable lowering 
for i8, i16, and i32 load/store.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170307/40695028/attachment.html>