[llvm-dev] [RFC] bitfield access shrinking

Sun Mar 12 18:27:14 PDT 2017

On 03/10/2017 02:26 AM, Chandler Carruth wrote:
> On Thu, Mar 9, 2017 at 3:55 PM Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 03/09/2017 03:49 PM, Wei Mi wrote:
>     > On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov
>     <mailto:hfinkel at anl.gov>> wrote:
>     >> On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>     >>> In
>     >>>
>     http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
>     >>> consecutive bitfields are wrapped as a group and represented as a
>     >>> large integer and emits loads stores and bit operations
>     appropriate
>     >>> for extracting bits from within it. It fixes the problem of
>     violating
>     >>> C++11 memory model that original widen load/store of bitfield was
>     >>> facing. It also brings more coalescing opportunities across
>     bitfields.
>     >>>
>     >>> If some bitfields are natural aligned and their num of bits
>     can form a
>     >>> legal integer type that the target supports, it is more
>     efficient to
>     >>> access them directly than doing a large integer load/store plus a
>     >>> series of bit operations. We call such reverse transformation
>     legal
>     >>> type bitfield shrinking. Currently, llvm depends on
>     DAGCombiner to do
>     >>> such shrinking.
>     >>>
>     >>> However, DAGCombiner has the one-basic-block-a-time
>     limitation, so we
>     >>> started to implement a new shrinking optimization in
>     >>> https://reviews.llvm.org/D30416, and initially put it in
>     instcombine,
>     >>> then moved it to CGP because we want to use some TargetLowering
>     >>> information.
>     >>>
>     >>> The initial implementation in D30416 only supports
>     load-and-or-store
>     >>> pattern matching, and it uses a inst-by-inst scan as a safety
>     check to
>     >>> make sure there is no other memory write insn between the load and
>     >>> store (no alias query is done).  After getting the initial
>     >>> implementation, we found more problems: EarlyCSE, LoadPRE and even
>     >>> InstCombine itself can do coalescing before the shrinking (LoadPRE
>     >>> does it the most thoroughly).
>     >>> The coalescing can move the load many BasicBlocks earlier and make
>     >>> simple inst-by-inst scan unscalable and oftentimes fail. It also
>     >>> breaks the load-and-or-store pattern. An example is below:
>     >>>
>     >>> Before coalescing done by earlycse/loadpre:
>     >>> %bf.load = load i64, i64* %0, align 8
>     >>> %bf.clear = and i64 %bf.load, -65536
>     >>> %bf.set = or i64 %bf.value, %bf.clear
>     >>> store i64 %bf.set2, i64* %9, align 8
>     >>> .....
>     >>> %bf.load1 = load i64, i64* %0, align 8
>     >>> %bf.clear1 = and i64 %bf.load1, -4294901761
>     >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>     >>> store i64 %bf.set2, i64* %9, align 8
>     >>> .....
>     >>> %bf.load2 = load i64, i64* %0, align 8
>     >>> %bf.clear2 = and i64 %bf.load2, -4294901761
>     >>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>     >>> store i64 %bf.set2, i64* %9, align 8
>     >>>
>     >>> After coalescing, it will become:
>     >>> %bf.load = load i64, i64* %0, align 8
>     >>> %bf.clear = and i64 %bf.load, -65536
>     >>> %bf.set = or i64 %bf.value, %bf.clear
>     >>> .....
>     >>> %bf.clear1 = and i64 %bf.set, -4294901761
>     >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>     >>> .....
>     >>> %bf.clear2 = and i64 %bf.set1, -4294901761
>     >>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>     >>> store i64 %bf.set2, i64* %9, align 8
>     >>>
>     >>> After load-store coalescing, %bf.load now is far away from the
>     store,
>     >>> and the previous load-and-or-store pattern disappears.
>     >>>
>     >>> A simple idea to fix it is to move the shrinking in a very
>     early pass
>     >>> before the first pass of EarlyCSE. However, if we move shrinking
>     >>> earlier, it is possible to block the coalescing of other
>     ilegal type
>     >>> bitfields which can not be shrinked. So for coalescing and
>     shrinking,
>     >>> no matter which one is first, it will block the other one.
>     >>>
>     >>> After some discussions with Eli and Michael, I come up with
>     another
>     >>> idea to let shrinking stay in the late pipeline. It needs two
>     changes
>     >>> to the current patch:
>     >>>
>     >>> 1. extending the pattern match to handle store(or(and(or(and...))
>     >>> pattern above. It needs to analyze the bit mask of every
>     and-or pairs.
>     >>> If the last and-or pair touch different section with the other
>     and-or
>     >>> pairs, we can split the original store into two, and do the
>     shrinking
>     >>> for the second store, like below
>     >>>
>     >>> %bf.load = load i64, i64* %0, align 8
>     >>> %bf.clear = and i64 %bf.load, -65536
>     >>> %bf.set = or i64 %bf.value, %bf.clear
>     >>> .....
>     >>>
>     >>> %bf.clear1 = and i64 %bf.set, -4294901761
>     >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>     >>> store i64 %bf.set1, i64* %0, align 8              // the first
>     >>> store.
>     >>> .....
>     >>>
>     >>> %bf.value2.shrinked = shrink_op %bf.value2
>     >>> store i16 %bf.value2.shrinked, i64* %0, align 8       //
>     shrinked store.
>     >>>
>     >>> 2. use memoryssa + alias query to do the safety check. Because
>     >>> dominator tree is not properly updated in CGP, I have to
>     create a new
>     >>> pass and put it before CGP, in order to use memoryssa.
>     >>
>     >> This makes sense to me. Should we just fix CGP to update the DT
>     instead of
>     >> working around it?
>     >>
>     >>   -Hal
>     > I am not familiar enough with CGP to tell. I just notice
>     ModifiedDT is
>     > modified at several places in CGP, which indicates there are a few
>     > transformations needing their specific dominator tree maintainance
>     > work. And I remember simplifyCFG also doesn't use dominator tree to
>     > save the effort to maintain it on the fly. So maybe it is easier to
>     > separate CGP into two parts: which does need dominator tree and
>     > doesn't. Those transformations which don't need dominator tree can
>     > stay together.
>     >
>     > Currently, I know consthoisting is another potential CGP
>     > transformation but left out because of its need of dominator
>     tree. But
>     > consthoisting has already evolved to contain a substantial amount of
>     > code so may be better to stay as a separate pass.
>
>     That might very well be the case here too (the matching logic
>     might grow
>     in complexity). CGP has now grown to over 5K lines, and there are
>     definitely pieces that can get split out. There is a big chunk
>     that runs
>     iteratively, however, so that might not buy us as much as we might
>     hope.
>
>
> I think a lot of the reasons why we used CGP as a grabbag was because 
> for a while we didn't really have a compelling model for *why* we were 
> moving out of canonical form. But we've really clarified the pass 
> pipeline structure now so I'd love to try and build reasonable 
> "optimization" (as opposed to "canonicalization" or "simplification") 
> passes and organize them nicely.

Sounds good (although we still need to be careful not to lose the 
iterative nature of the current optimizations in cases where that's 
important).

>
> One perhaps especially notable thing: this will often end up 
> *narrowing*. I think that it would be very valuable to try and place 
> this or other passes that effectively narrow data access types *before 
> the vectorizers*. This seems important to enable efficient packing of 
> element types.

I think it would be useful to work through some examples in this 
context. It might help vectorization to narrow beforehand, or it might 
end up requiring more scatter/gather load/stores (whereas, in the 
context of vectorization, the shifts and masks might have been more 
efficient).

  -Hal

>
> Past that, I generally like the plan of a dedicated pass to narrow 
> integer stuff across basic blocks, etc., using MemorySSA and other 
> efficient tools. =]
>
> -Chandler
>
>
>       -Hal
>
>     >
>     > Thanks,
>     > Wei.
>     >
>     >>> Eli suggested me to ask for more opinions before start
>     writting code.
>     >>> I think it is a good idea and here is the post. Comments are
>     >>> appreciated.
>     >>>
>     >>> Thanks,
>     >>> Wei.
>     >>> _______________________________________________
>     >>> LLVM Developers mailing list
>     >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     >>
>     >> --
>     >> Hal Finkel
>     >> Lead, Compiler Technology and Programming Languages
>     >> Leadership Computing Facility
>     >> Argonne National Laboratory
>     >>
>
>     --
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170312/f5d8a00e/attachment.html>