[llvm-dev] [RFC] bitfield access shrinking

Thu Mar 9 15:42:54 PST 2017

I think it'd be nice to keep CGP as "a grab-bag of relatively simple
transformations that don't require/preserve any complex analyses".

Note that that's not quite where we're at right now - while CGP doesn't
seem to use AA at all, it does have *one* transformation that uses LI and
DT - eliminateMostlyEmptyBlocks. And it constructs those on demand. :-\
We may want to split that out as well.

(The "ModifiedDT" flag is a bit of red herring - IIUC, what it really means
is ModifiedCFG, I think the only place that uses it is the optimizeBlock()
iteration - and that probably ought to be using a worklist instead...)

On Thu, Mar 9, 2017 at 1:49 PM, Wei Mi <wmi at google.com> wrote:

> On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> > On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
> >>
> >> In
> >> http://lists.llvm.org/pipermail/cfe-commits/Week-of-
> Mon-20120827/063200.html,
> >> consecutive bitfields are wrapped as a group and represented as a
> >> large integer and emits loads stores and bit operations appropriate
> >> for extracting bits from within it. It fixes the problem of violating
> >> C++11 memory model that original widen load/store of bitfield was
> >> facing. It also brings more coalescing opportunities across bitfields.
> >>
> >> If some bitfields are natural aligned and their num of bits can form a
> >> legal integer type that the target supports, it is more efficient to
> >> access them directly than doing a large integer load/store plus a
> >> series of bit operations. We call such reverse transformation legal
> >> type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
> >> such shrinking.
> >>
> >> However, DAGCombiner has the one-basic-block-a-time limitation, so we
> >> started to implement a new shrinking optimization in
> >> https://reviews.llvm.org/D30416, and initially put it in instcombine,
> >> then moved it to CGP because we want to use some TargetLowering
> >> information.
> >>
> >> The initial implementation in D30416 only supports load-and-or-store
> >> pattern matching, and it uses a inst-by-inst scan as a safety check to
> >> make sure there is no other memory write insn between the load and
> >> store (no alias query is done).  After getting the initial
> >> implementation, we found more problems: EarlyCSE, LoadPRE and even
> >> InstCombine itself can do coalescing before the shrinking (LoadPRE
> >> does it the most thoroughly).
> >> The coalescing can move the load many BasicBlocks earlier and make
> >> simple inst-by-inst scan unscalable and oftentimes fail. It also
> >> breaks the load-and-or-store pattern. An example is below:
> >>
> >> Before coalescing done by earlycse/loadpre:
> >> %bf.load = load i64, i64* %0, align 8
> >> %bf.clear = and i64 %bf.load, -65536
> >> %bf.set = or i64 %bf.value, %bf.clear
> >> store i64 %bf.set2, i64* %9, align 8
> >> .....
> >> %bf.load1 = load i64, i64* %0, align 8
> >> %bf.clear1 = and i64 %bf.load1, -4294901761
> >> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >> store i64 %bf.set2, i64* %9, align 8
> >> .....
> >> %bf.load2 = load i64, i64* %0, align 8
> >> %bf.clear2 = and i64 %bf.load2, -4294901761
> >> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >> store i64 %bf.set2, i64* %9, align 8
> >>
> >> After coalescing, it will become:
> >> %bf.load = load i64, i64* %0, align 8
> >> %bf.clear = and i64 %bf.load, -65536
> >> %bf.set = or i64 %bf.value, %bf.clear
> >> .....
> >> %bf.clear1 = and i64 %bf.set, -4294901761
> >> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >> .....
> >> %bf.clear2 = and i64 %bf.set1, -4294901761
> >> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >> store i64 %bf.set2, i64* %9, align 8
> >>
> >> After load-store coalescing, %bf.load now is far away from the store,
> >> and the previous load-and-or-store pattern disappears.
> >>
> >> A simple idea to fix it is to move the shrinking in a very early pass
> >> before the first pass of EarlyCSE. However, if we move shrinking
> >> earlier, it is possible to block the coalescing of other ilegal type
> >> bitfields which can not be shrinked. So for coalescing and shrinking,
> >> no matter which one is first, it will block the other one.
> >>
> >> After some discussions with Eli and Michael, I come up with another
> >> idea to let shrinking stay in the late pipeline. It needs two changes
> >> to the current patch:
> >>
> >> 1. extending the pattern match to handle store(or(and(or(and...))
> >> pattern above. It needs to analyze the bit mask of every and-or pairs.
> >> If the last and-or pair touch different section with the other and-or
> >> pairs, we can split the original store into two, and do the shrinking
> >> for the second store, like below
> >>
> >> %bf.load = load i64, i64* %0, align 8
> >> %bf.clear = and i64 %bf.load, -65536
> >> %bf.set = or i64 %bf.value, %bf.clear
> >> .....
> >>
> >> %bf.clear1 = and i64 %bf.set, -4294901761
> >> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >> store i64 %bf.set1, i64* %0, align 8                         // the
> first
> >> store.
> >> .....
> >>
> >> %bf.value2.shrinked = shrink_op %bf.value2
> >> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked store.
> >>
> >> 2. use memoryssa + alias query to do the safety check. Because
> >> dominator tree is not properly updated in CGP, I have to create a new
> >> pass and put it before CGP, in order to use memoryssa.
> >
> >
> > This makes sense to me. Should we just fix CGP to update the DT instead
> of
> > working around it?
> >
> >  -Hal
>
> I am not familiar enough with CGP to tell. I just notice ModifiedDT is
> modified at several places in CGP, which indicates there are a few
> transformations needing their specific dominator tree maintainance
> work. And I remember simplifyCFG also doesn't use dominator tree to
> save the effort to maintain it on the fly. So maybe it is easier to
> separate CGP into two parts: which does need dominator tree and
> doesn't. Those transformations which don't need dominator tree can
> stay together.
>
> Currently, I know consthoisting is another potential CGP
> transformation but left out because of its need of dominator tree. But
> consthoisting has already evolved to contain a substantial amount of
> code so may be better to stay as a separate pass.
>
> Thanks,
> Wei.
>
> >
> >>
> >> Eli suggested me to ask for more opinions before start writting code.
> >> I think it is a good idea and here is the post. Comments are
> >> appreciated.
> >>
> >> Thanks,
> >> Wei.
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> >
> > --
> > Hal Finkel
> > Lead, Compiler Technology and Programming Languages
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/97b71e2b/attachment.html>