[llvm-dev] [RFC] bitfield access shrinking

Thu Mar 9 10:14:48 PST 2017

In http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
consecutive bitfields are wrapped as a group and represented as a
large integer and emits loads stores and bit operations appropriate
for extracting bits from within it. It fixes the problem of violating
C++11 memory model that original widen load/store of bitfield was
facing. It also brings more coalescing opportunities across bitfields.

If some bitfields are natural aligned and their num of bits can form a
legal integer type that the target supports, it is more efficient to
access them directly than doing a large integer load/store plus a
series of bit operations. We call such reverse transformation legal
type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
such shrinking.

However, DAGCombiner has the one-basic-block-a-time limitation, so we
started to implement a new shrinking optimization in
https://reviews.llvm.org/D30416, and initially put it in instcombine,
then moved it to CGP because we want to use some TargetLowering
information.

The initial implementation in D30416 only supports load-and-or-store
pattern matching, and it uses a inst-by-inst scan as a safety check to
make sure there is no other memory write insn between the load and
store (no alias query is done).  After getting the initial
implementation, we found more problems: EarlyCSE, LoadPRE and even
InstCombine itself can do coalescing before the shrinking (LoadPRE
does it the most thoroughly).
The coalescing can move the load many BasicBlocks earlier and make
simple inst-by-inst scan unscalable and oftentimes fail. It also
breaks the load-and-or-store pattern. An example is below:

Before coalescing done by earlycse/loadpre:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load1 = load i64, i64* %0, align 8
%bf.clear1 = and i64 %bf.load1, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load2 = load i64, i64* %0, align 8
%bf.clear2 = and i64 %bf.load2, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After coalescing, it will become:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....
%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
.....
%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After load-store coalescing, %bf.load now is far away from the store,
and the previous load-and-or-store pattern disappears.

A simple idea to fix it is to move the shrinking in a very early pass
before the first pass of EarlyCSE. However, if we move shrinking
earlier, it is possible to block the coalescing of other ilegal type
bitfields which can not be shrinked. So for coalescing and shrinking,
no matter which one is first, it will block the other one.

After some discussions with Eli and Michael, I come up with another
idea to let shrinking stay in the late pipeline. It needs two changes
to the current patch:

1. extending the pattern match to handle store(or(and(or(and...))
pattern above. It needs to analyze the bit mask of every and-or pairs.
If the last and-or pair touch different section with the other and-or
pairs, we can split the original store into two, and do the shrinking
for the second store, like below

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8                         // the first store.
.....

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked store.

2. use memoryssa + alias query to do the safety check. Because
dominator tree is not properly updated in CGP, I have to create a new
pass and put it before CGP, in order to use memoryssa.

Eli suggested me to ask for more opinions before start writting code.
I think it is a good idea and here is the post. Comments are
appreciated.

Thanks,
Wei.