[llvm-dev] Load combine pass

Thu Sep 29 14:36:33 PDT 2016

We have seen cases where it is profitable to widen at a given point in compilation, but after optimizations such as inlining (with more surrounding code), future opts after inlining do a better job with the non-widened code [1]

Couple of concerns:
1. having inst combine rules for widening would need to take care of endian-ness, and interactions with other optimizations, since instcombine is run multiple times. From the above discussion there are couple of passes, such as PRE, LICM (probably more), that can be affected negatively by the widening. I’m not sure if it’s a good idea to add as instcombine rule.

2. I think identifying enough obvious cases to warrant a late fix-up pass, which would be improvements/neutral for all architectures may be difficult :) There’s also the question of maintainability - adding new rules to the pass. Widening rules are perhaps best suited as back end optimizations, in the presence of profitability cost model specific to the architecture. For example, there is the store widening in Hexagon arch. I’m not sure if we have more of this in place.

[1] http://lists.llvm.org/pipermail/llvm-dev/2016-June/101789.html

Anna

On Sep 29, 2016, at 2:25 PM, Artur Pilipenko <apilipenko at azulsystems.com<mailto:apilipenko at azulsystems.com>> wrote:

On 29 Sep 2016, at 21:16, Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>> wrote:

Hi Artur,

Artur Pilipenko wrote:
On 29 Sep 2016, at 21:01, Sanjoy Das<sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>  wrote:

Hi Artur,

Artur Pilipenko wrote:

BTW, do we really need to emit an atomic load if all the individual
components are bytes?
Depends -- do you mean at the at the hardware level or at the IR
level?

If you mean at the IR level, then I think yes; since otherwise it is
legal to do transforms that break byte-wise atomicity in the IR, e.g.:

 i32* ptr = ...
 i32  val = *ptr

=>   // Since no threads can be legally racing on *ptr

 i32* ptr = ...
 i32 val0 = *ptr
 i32 val1 = *ptr
 i32 val = (val0&  1) | (val1&  ~1);

If you're talking about the hardware level, then I'm not sure; and my
guess is that the answer is almost certainly arch-dependent.
I meant the case when we have a load by bytes pattern like this:
i8* p = ...
i8 b0 = *p++;
i8 b1 = *p++;
i8 b2 = *p++;
i8 b3 = *p++;
i32 result = b0<<  24 | b1<<  16 | b2<<  8 | b<<  0;

When we fold it to a i32 load, should this load be atomic?

If we do fold it to a non-atomic i32 load, then it would be legal for
LLVM to do the IR transform I mentioned above.  That breaks the
byte-wise atomicity you had in the original program.

That is, in:

i8* p = ...
i8 b0 = *p++;
i8 b1 = *p++;
i8 b2 = *p++;
i8 b3 = *p++;
// Note: I changed this to be little endian, and I've assumed
// that we're compiling for a little endian system
i32 result = b3<< 24 | b2<<  16 | b1<<  8 | b0<<  0;

say all of p[0..3] are 0, and you have a thread racing to set b0 to
-1.  Then result can either be 0 or 255.

However, say you first transform this to a non-atomic i32 load:

i8* p = ...
i32* p.i32 = (i32*)p
i32 result = *p.i32

and we do the transform above

i8* p = ...
i32* p.i32 = (i32*)p
i32 result0 = *p.i32
i32 result1 = *p.i32
i32 result = (result0 & 1) | (result1 & ~1);

then it is possible for result to be 254 (by result0 observing 0 and
result observing 255).
I see. For some reason I was assuming byte-wise atomicity for non-atomic loads.

So, if any of the components are atomic, the resulting load must be atomic as well.

Artur

-- Sanjoy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160929/338254f8/attachment.html>