[llvm-dev] [RFC] bitfield access shrinking

Wei Mi via llvm-dev llvm-dev at lists.llvm.org
Thu Mar 9 13:49:40 PST 2017


On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>>
>> In
>> http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
>> consecutive bitfields are wrapped as a group and represented as a
>> large integer and emits loads stores and bit operations appropriate
>> for extracting bits from within it. It fixes the problem of violating
>> C++11 memory model that original widen load/store of bitfield was
>> facing. It also brings more coalescing opportunities across bitfields.
>>
>> If some bitfields are natural aligned and their num of bits can form a
>> legal integer type that the target supports, it is more efficient to
>> access them directly than doing a large integer load/store plus a
>> series of bit operations. We call such reverse transformation legal
>> type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
>> such shrinking.
>>
>> However, DAGCombiner has the one-basic-block-a-time limitation, so we
>> started to implement a new shrinking optimization in
>> https://reviews.llvm.org/D30416, and initially put it in instcombine,
>> then moved it to CGP because we want to use some TargetLowering
>> information.
>>
>> The initial implementation in D30416 only supports load-and-or-store
>> pattern matching, and it uses a inst-by-inst scan as a safety check to
>> make sure there is no other memory write insn between the load and
>> store (no alias query is done).  After getting the initial
>> implementation, we found more problems: EarlyCSE, LoadPRE and even
>> InstCombine itself can do coalescing before the shrinking (LoadPRE
>> does it the most thoroughly).
>> The coalescing can move the load many BasicBlocks earlier and make
>> simple inst-by-inst scan unscalable and oftentimes fail. It also
>> breaks the load-and-or-store pattern. An example is below:
>>
>> Before coalescing done by earlycse/loadpre:
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> store i64 %bf.set2, i64* %9, align 8
>> .....
>> %bf.load1 = load i64, i64* %0, align 8
>> %bf.clear1 = and i64 %bf.load1, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> store i64 %bf.set2, i64* %9, align 8
>> .....
>> %bf.load2 = load i64, i64* %0, align 8
>> %bf.clear2 = and i64 %bf.load2, -4294901761
>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>> store i64 %bf.set2, i64* %9, align 8
>>
>> After coalescing, it will become:
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> .....
>> %bf.clear1 = and i64 %bf.set, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> .....
>> %bf.clear2 = and i64 %bf.set1, -4294901761
>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>> store i64 %bf.set2, i64* %9, align 8
>>
>> After load-store coalescing, %bf.load now is far away from the store,
>> and the previous load-and-or-store pattern disappears.
>>
>> A simple idea to fix it is to move the shrinking in a very early pass
>> before the first pass of EarlyCSE. However, if we move shrinking
>> earlier, it is possible to block the coalescing of other ilegal type
>> bitfields which can not be shrinked. So for coalescing and shrinking,
>> no matter which one is first, it will block the other one.
>>
>> After some discussions with Eli and Michael, I come up with another
>> idea to let shrinking stay in the late pipeline. It needs two changes
>> to the current patch:
>>
>> 1. extending the pattern match to handle store(or(and(or(and...))
>> pattern above. It needs to analyze the bit mask of every and-or pairs.
>> If the last and-or pair touch different section with the other and-or
>> pairs, we can split the original store into two, and do the shrinking
>> for the second store, like below
>>
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> .....
>>
>> %bf.clear1 = and i64 %bf.set, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> store i64 %bf.set1, i64* %0, align 8                         // the first
>> store.
>> .....
>>
>> %bf.value2.shrinked = shrink_op %bf.value2
>> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked store.
>>
>> 2. use memoryssa + alias query to do the safety check. Because
>> dominator tree is not properly updated in CGP, I have to create a new
>> pass and put it before CGP, in order to use memoryssa.
>
>
> This makes sense to me. Should we just fix CGP to update the DT instead of
> working around it?
>
>  -Hal

I am not familiar enough with CGP to tell. I just notice ModifiedDT is
modified at several places in CGP, which indicates there are a few
transformations needing their specific dominator tree maintainance
work. And I remember simplifyCFG also doesn't use dominator tree to
save the effort to maintain it on the fly. So maybe it is easier to
separate CGP into two parts: which does need dominator tree and
doesn't. Those transformations which don't need dominator tree can
stay together.

Currently, I know consthoisting is another potential CGP
transformation but left out because of its need of dominator tree. But
consthoisting has already evolved to contain a substantial amount of
code so may be better to stay as a separate pass.

Thanks,
Wei.

>
>>
>> Eli suggested me to ask for more opinions before start writting code.
>> I think it is a good idea and here is the post. Comments are
>> appreciated.
>>
>> Thanks,
>> Wei.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>


More information about the llvm-dev mailing list