<div dir="ltr">Yeah,.<div><br></div><div>This seems similar to trying to optimize extract/insert values with real binary operations.</div><div><br></div><div>We do it, but ... historically we only do it by teaching things they look like things they already know ;)</div><div>(IE we teach gvn that when it looks like this, it's really an add of these two things)<br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 9, 2017 at 10:47 AM, Hal Finkel via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

On 03/09/2017 12:28 PM, Krzysztof Parzyszek via llvm-dev wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

We could add intrinsics to extract/insert a bitfield, which would simplify a lot of that bitwise logic.<br>

</blockquote>

<br></span>

But then you need to teach a bunch of places about how to simply them, fold using bitwise logic and other things that reduce demanded bits into them, etc. This seems like a difficult tradeoff.<br>

<br>

 -Hal<div><div class="h5"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

-Krzysztof<br>

<br>

<br>

On 3/9/2017 12:14 PM, Wei Mi via llvm-dev wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

In <a href="http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermai<wbr>l/cfe-commits/Week-of-Mon-<wbr>20120827/063200.html</a>,<br>

consecutive bitfields are wrapped as a group and represented as a<br>

large integer and emits loads stores and bit operations appropriate<br>

for extracting bits from within it. It fixes the problem of violating<br>

C++11 memory model that original widen load/store of bitfield was<br>

facing. It also brings more coalescing opportunities across bitfields.<br>

<br>

If some bitfields are natural aligned and their num of bits can form a<br>

legal integer type that the target supports, it is more efficient to<br>

access them directly than doing a large integer load/store plus a<br>

series of bit operations. We call such reverse transformation legal<br>

type bitfield shrinking. Currently, llvm depends on DAGCombiner to do<br>

such shrinking.<br>

<br>

However, DAGCombiner has the one-basic-block-a-time limitation, so we<br>

started to implement a new shrinking optimization in<br>

<a href="https://reviews.llvm.org/D30416" rel="noreferrer" target="_blank">https://reviews.llvm.org/D3041<wbr>6</a>, and initially put it in instcombine,<br>

then moved it to CGP because we want to use some TargetLowering<br>

information.<br>

<br>

The initial implementation in D30416 only supports load-and-or-store<br>

pattern matching, and it uses a inst-by-inst scan as a safety check to<br>

make sure there is no other memory write insn between the load and<br>

store (no alias query is done).  After getting the initial<br>

implementation, we found more problems: EarlyCSE, LoadPRE and even<br>

InstCombine itself can do coalescing before the shrinking (LoadPRE<br>

does it the most thoroughly).<br>

The coalescing can move the load many BasicBlocks earlier and make<br>

simple inst-by-inst scan unscalable and oftentimes fail. It also<br>

breaks the load-and-or-store pattern. An example is below:<br>

<br>

Before coalescing done by earlycse/loadpre:<br>

%bf.load = load i64, i64* %0, align 8<br>

%bf.clear = and i64 %bf.load, -65536<br>

%bf.set = or i64 %bf.value, %bf.clear<br>

store i64 %bf.set2, i64* %9, align 8<br>

.....<br>

%bf.load1 = load i64, i64* %0, align 8<br>

%bf.clear1 = and i64 %bf.load1, -4294901761<br>

%bf.set1 = or i64 %bf.value1, %bf.clear1<br>

store i64 %bf.set2, i64* %9, align 8<br>

.....<br>

%bf.load2 = load i64, i64* %0, align 8<br>

%bf.clear2 = and i64 %bf.load2, -4294901761<br>

%bf.set2 = or i64 %bf.value2, %bf.clear2<br>

store i64 %bf.set2, i64* %9, align 8<br>

<br>

After coalescing, it will become:<br>

%bf.load = load i64, i64* %0, align 8<br>

%bf.clear = and i64 %bf.load, -65536<br>

%bf.set = or i64 %bf.value, %bf.clear<br>

.....<br>

%bf.clear1 = and i64 %bf.set, -4294901761<br>

%bf.set1 = or i64 %bf.value1, %bf.clear1<br>

.....<br>

%bf.clear2 = and i64 %bf.set1, -4294901761<br>

%bf.set2 = or i64 %bf.value2, %bf.clear2<br>

store i64 %bf.set2, i64* %9, align 8<br>

<br>

After load-store coalescing, %bf.load now is far away from the store,<br>

and the previous load-and-or-store pattern disappears.<br>

<br>

A simple idea to fix it is to move the shrinking in a very early pass<br>

before the first pass of EarlyCSE. However, if we move shrinking<br>

earlier, it is possible to block the coalescing of other ilegal type<br>

bitfields which can not be shrinked. So for coalescing and shrinking,<br>

no matter which one is first, it will block the other one.<br>

<br>

After some discussions with Eli and Michael, I come up with another<br>

idea to let shrinking stay in the late pipeline. It needs two changes<br>

to the current patch:<br>

<br>

1. extending the pattern match to handle store(or(and(or(and...))<br>

pattern above. It needs to analyze the bit mask of every and-or pairs.<br>

If the last and-or pair touch different section with the other and-or<br>

pairs, we can split the original store into two, and do the shrinking<br>

for the second store, like below<br>

<br>

%bf.load = load i64, i64* %0, align 8<br>

%bf.clear = and i64 %bf.load, -65536<br>

%bf.set = or i64 %bf.value, %bf.clear<br>

.....<br>

<br>

%bf.clear1 = and i64 %bf.set, -4294901761<br>

%bf.set1 = or i64 %bf.value1, %bf.clear1<br>

store i64 %bf.set1, i64* %0, align 8                         // the first store.<br>

.....<br>

<br>

%bf.value2.shrinked = shrink_op %bf.value2<br>

store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked store.<br>

<br>

2. use memoryssa + alias query to do the safety check. Because<br>

dominator tree is not properly updated in CGP, I have to create a new<br>

pass and put it before CGP, in order to use memoryssa.<br>

<br>

Eli suggested me to ask for more opinions before start writting code.<br>

I think it is a good idea and here is the post. Comments are<br>

appreciated.<br>

<br>

Thanks,<br>

Wei.<br>

______________________________<wbr>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

<br>

</blockquote>

<br>

</blockquote>

<br>

-- <br></div></div>

Hal Finkel<br>

Lead, Compiler Technology and Programming Languages<br>

Leadership Computing Facility<br>

Argonne National Laboratory<div class="HOEnZb"><div class="h5"><br>

<br>

______________________________<wbr>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

</div></div></blockquote></div><br></div>