[llvm-dev] [RFC] bitfield access shrinking
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Sun Mar 12 18:27:14 PDT 2017
On 03/10/2017 02:26 AM, Chandler Carruth wrote:
> On Thu, Mar 9, 2017 at 3:55 PM Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
>
>
> On 03/09/2017 03:49 PM, Wei Mi wrote:
> > On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
> >> On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
> >>> In
> >>>
> http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
> >>> consecutive bitfields are wrapped as a group and represented as a
> >>> large integer and emits loads stores and bit operations
> appropriate
> >>> for extracting bits from within it. It fixes the problem of
> violating
> >>> C++11 memory model that original widen load/store of bitfield was
> >>> facing. It also brings more coalescing opportunities across
> bitfields.
> >>>
> >>> If some bitfields are natural aligned and their num of bits
> can form a
> >>> legal integer type that the target supports, it is more
> efficient to
> >>> access them directly than doing a large integer load/store plus a
> >>> series of bit operations. We call such reverse transformation
> legal
> >>> type bitfield shrinking. Currently, llvm depends on
> DAGCombiner to do
> >>> such shrinking.
> >>>
> >>> However, DAGCombiner has the one-basic-block-a-time
> limitation, so we
> >>> started to implement a new shrinking optimization in
> >>> https://reviews.llvm.org/D30416, and initially put it in
> instcombine,
> >>> then moved it to CGP because we want to use some TargetLowering
> >>> information.
> >>>
> >>> The initial implementation in D30416 only supports
> load-and-or-store
> >>> pattern matching, and it uses a inst-by-inst scan as a safety
> check to
> >>> make sure there is no other memory write insn between the load and
> >>> store (no alias query is done). After getting the initial
> >>> implementation, we found more problems: EarlyCSE, LoadPRE and even
> >>> InstCombine itself can do coalescing before the shrinking (LoadPRE
> >>> does it the most thoroughly).
> >>> The coalescing can move the load many BasicBlocks earlier and make
> >>> simple inst-by-inst scan unscalable and oftentimes fail. It also
> >>> breaks the load-and-or-store pattern. An example is below:
> >>>
> >>> Before coalescing done by earlycse/loadpre:
> >>> %bf.load = load i64, i64* %0, align 8
> >>> %bf.clear = and i64 %bf.load, -65536
> >>> %bf.set = or i64 %bf.value, %bf.clear
> >>> store i64 %bf.set2, i64* %9, align 8
> >>> .....
> >>> %bf.load1 = load i64, i64* %0, align 8
> >>> %bf.clear1 = and i64 %bf.load1, -4294901761
> >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >>> store i64 %bf.set2, i64* %9, align 8
> >>> .....
> >>> %bf.load2 = load i64, i64* %0, align 8
> >>> %bf.clear2 = and i64 %bf.load2, -4294901761
> >>> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >>> store i64 %bf.set2, i64* %9, align 8
> >>>
> >>> After coalescing, it will become:
> >>> %bf.load = load i64, i64* %0, align 8
> >>> %bf.clear = and i64 %bf.load, -65536
> >>> %bf.set = or i64 %bf.value, %bf.clear
> >>> .....
> >>> %bf.clear1 = and i64 %bf.set, -4294901761
> >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >>> .....
> >>> %bf.clear2 = and i64 %bf.set1, -4294901761
> >>> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >>> store i64 %bf.set2, i64* %9, align 8
> >>>
> >>> After load-store coalescing, %bf.load now is far away from the
> store,
> >>> and the previous load-and-or-store pattern disappears.
> >>>
> >>> A simple idea to fix it is to move the shrinking in a very
> early pass
> >>> before the first pass of EarlyCSE. However, if we move shrinking
> >>> earlier, it is possible to block the coalescing of other
> ilegal type
> >>> bitfields which can not be shrinked. So for coalescing and
> shrinking,
> >>> no matter which one is first, it will block the other one.
> >>>
> >>> After some discussions with Eli and Michael, I come up with
> another
> >>> idea to let shrinking stay in the late pipeline. It needs two
> changes
> >>> to the current patch:
> >>>
> >>> 1. extending the pattern match to handle store(or(and(or(and...))
> >>> pattern above. It needs to analyze the bit mask of every
> and-or pairs.
> >>> If the last and-or pair touch different section with the other
> and-or
> >>> pairs, we can split the original store into two, and do the
> shrinking
> >>> for the second store, like below
> >>>
> >>> %bf.load = load i64, i64* %0, align 8
> >>> %bf.clear = and i64 %bf.load, -65536
> >>> %bf.set = or i64 %bf.value, %bf.clear
> >>> .....
> >>>
> >>> %bf.clear1 = and i64 %bf.set, -4294901761
> >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >>> store i64 %bf.set1, i64* %0, align 8 // the first
> >>> store.
> >>> .....
> >>>
> >>> %bf.value2.shrinked = shrink_op %bf.value2
> >>> store i16 %bf.value2.shrinked, i64* %0, align 8 //
> shrinked store.
> >>>
> >>> 2. use memoryssa + alias query to do the safety check. Because
> >>> dominator tree is not properly updated in CGP, I have to
> create a new
> >>> pass and put it before CGP, in order to use memoryssa.
> >>
> >> This makes sense to me. Should we just fix CGP to update the DT
> instead of
> >> working around it?
> >>
> >> -Hal
> > I am not familiar enough with CGP to tell. I just notice
> ModifiedDT is
> > modified at several places in CGP, which indicates there are a few
> > transformations needing their specific dominator tree maintainance
> > work. And I remember simplifyCFG also doesn't use dominator tree to
> > save the effort to maintain it on the fly. So maybe it is easier to
> > separate CGP into two parts: which does need dominator tree and
> > doesn't. Those transformations which don't need dominator tree can
> > stay together.
> >
> > Currently, I know consthoisting is another potential CGP
> > transformation but left out because of its need of dominator
> tree. But
> > consthoisting has already evolved to contain a substantial amount of
> > code so may be better to stay as a separate pass.
>
> That might very well be the case here too (the matching logic
> might grow
> in complexity). CGP has now grown to over 5K lines, and there are
> definitely pieces that can get split out. There is a big chunk
> that runs
> iteratively, however, so that might not buy us as much as we might
> hope.
>
>
> I think a lot of the reasons why we used CGP as a grabbag was because
> for a while we didn't really have a compelling model for *why* we were
> moving out of canonical form. But we've really clarified the pass
> pipeline structure now so I'd love to try and build reasonable
> "optimization" (as opposed to "canonicalization" or "simplification")
> passes and organize them nicely.
Sounds good (although we still need to be careful not to lose the
iterative nature of the current optimizations in cases where that's
important).
>
> One perhaps especially notable thing: this will often end up
> *narrowing*. I think that it would be very valuable to try and place
> this or other passes that effectively narrow data access types *before
> the vectorizers*. This seems important to enable efficient packing of
> element types.
I think it would be useful to work through some examples in this
context. It might help vectorization to narrow beforehand, or it might
end up requiring more scatter/gather load/stores (whereas, in the
context of vectorization, the shifts and masks might have been more
efficient).
-Hal
>
> Past that, I generally like the plan of a dedicated pass to narrow
> integer stuff across basic blocks, etc., using MemorySSA and other
> efficient tools. =]
>
> -Chandler
>
>
> -Hal
>
> >
> > Thanks,
> > Wei.
> >
> >>> Eli suggested me to ask for more opinions before start
> writting code.
> >>> I think it is a good idea and here is the post. Comments are
> >>> appreciated.
> >>>
> >>> Thanks,
> >>> Wei.
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >> --
> >> Hal Finkel
> >> Lead, Compiler Technology and Programming Languages
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170312/f5d8a00e/attachment.html>
More information about the llvm-dev
mailing list