[LLVMdev] Optimization hints for "constant" loads

Sat Oct 11 18:01:41 PDT 2014

----- Original Message -----
> From: "Philip Reames" <listmail at philipreames.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: nicholas at mxc.ca, llvmdev at cs.uiuc.edu
> Sent: Friday, October 10, 2014 8:07:55 PM
> Subject: Re: Optimization hints for "constant" loads
> 
> 
> On 09/28/2014 01:22 AM, Hal Finkel wrote:
> > ----- Original Message -----
> >> From: "Philip Reames" <listmail at philipreames.com>
> >> To: llvmdev at cs.uiuc.edu
> >> Cc: "Hal Finkel" <hfinkel at anl.gov>, nicholas at mxc.ca
> >> Sent: Wednesday, September 10, 2014 12:11:28 AM
> >> Subject: Optimization hints for "constant" loads
> >>
> >> I'm looking at how to optimize IR which contains reads from a
> >> field
> >> which is known to be initialized exactly once.  I've got an idea
> >> on
> >> how
> >> to approach this, but wanted to see if others have alternate ideas
> >> or
> >> similar problems which might spark discussion.  It feels like
> >> there's
> >> a
> >> potentially generally useful optimization hint here if we can
> >> generalize
> >> it sufficiently without loosing optimization potential.
> >>
> >> The problem:
> >> struct array {
> >> private:
> >>     // once initialized 'len' never changes
> >>     int len;
> >>     // data can be modified at will
> >>     char data[0];
> >> public:
> >>     static array* make(int len) {
> >>       array* a = ... allocate uninitialized space
> >>       a->len = len;
> >>       return a;
> >>     }
> >> };
> >> void access(array* a, int idx) {
> >>     if( idx >= 0 && idx <- a->len ) {
> >>       a->data[idx] = 5;
> >>     }
> >> }
> >> void foo(array* a) {
> >>     for(int i = 0; i < a->len; i++) {
> >>       access(a, i);
> >>     }
> >> }
> >> // assume 'access' is inlined into 'foo' and the loop is unrolled
> >> a
> >> time
> >> or two
> >>
> >> To phrase that again in english, I've got a field which is
> >> initialized
> >> once, with naive code which reads from it many times.  I know at
> >> IR
> >> generation time that a load from array::len is special, but I
> >> loose
> >> this
> >> information as soon as I generate IR.  In particular, I run into
> >> aliasing problems where the location a->len is considered
> >> 'mayalias'
> >> with unrelated stores thus preventing value forwarding, LICM, and
> >> other
> >> desirable optimizations.
> >>
> >> Existing Approaches:
> > I think that, at least in theory, we already have a solution to
> > this problem. If, after the initialization is complete, you insert
> > a call to the llvm.invariant.start intrinsic (and perhaps also do
> > so at the start of any routine you know can be called only after
> > initialization is complete), that should convey the information
> > you want.
> >
> > I've never worked with this intrinsic before, but if this does not
> > work, I'd be really curious to know why.
>  From some experiments, the existing invariant intrinsics appear
>  nearly
> useless for my purposes.  The problem is that any call can hide a
> invariant.end intrinsic.  As a result, the optimizer must
> conservatively
> assume that any call clobbers the "invariant" location.  This makes
> the
> intrinsic a non-starter in it's current form.
> 
> ; Function Attrs: uwtable
> define zeroext i1 @_Z4testv() #0 {
>    %1 = tail call noalias i8* @_Znwm(i64 4) #3
>    %2 = bitcast i8* %1 to i32*
>    store i32 5, i32* %2, align 4, !tbaa !1
>    %discard = call {}* @llvm.invariant.start(i64 4, i8* %1)
>    tail call void @_Z6escapePi(i32* %2)
>    %3 = load i32* %2, align 4, !tbaa !1
>    %4 = icmp eq i32 %3, 5 <-- this conditional should be folded away
>    and is not
>    ret i1 %4
> }
> 
> declare {}* @llvm.invariant.start(i64, i8*)
> 
> ; Function Attrs: nobuiltin
> declare noalias i8* @_Znwm(i64) #1
> 
> declare void @_Z6escapePi(i32*) #2
> 
> It also appears that the intrinsic has limited implementation in the
> optimizer.  Even surprisingly simple cases don't appear to kick in.
> Consider:
> define zeroext i1 @_Z4testv() #0 {
>    %1 = tail call noalias i8* @_Znwm(i64 4) #4
>    %2 = bitcast i8* %1 to i32*
>    store i32 5, i32* %2, align 4, !tbaa !1
>    %discard = tail call {}* @llvm.invariant.start(i64 4, i8* %1)
>    %3 = load i32* %2, align 4, !tbaa !1
>    %4 = icmp eq i32 %3, 5 <-- This conditional should be folded
>    tail call void @_Z6escapePi(i32* %2, i1 %4)
>    %5 = load i32* %2, align 4, !tbaa !1
>    %6 = icmp eq i32 %5, 5
>    ret i1 %6
> }
> 
> 
> We could extend the existing intrinsic with a notion of
> invariant.start
> that has no end.  This could be as simple as adding a boolean
> parameter
> to the intrinsic.  I think this could be made to work.  There'd be
> other
> implementation work needed to make it actually useful, the validity
> based on dominance model used by assumes seems like an obvious
> candidate.
> 
> Thinking through this, I see a couple of places that we'd change:
> - isLocationKnownInvariant(Value* Loc, Instruction* Cntx) and related
> analysis pass (analogous to assumptions)
>    - The new "no end" version is easy (dominance)
>    - The older version is a graph walk looking paths which include
> either an end, or call.
> - eagerly propagate known invariant location store values in EarlyCSE
> (strictly by dominance for the new intrinsic form)
> - Add an InstCombine rule to fold a store to a known invariant
> location
> to undef
> - Teach GVN about it (gets the non dominance cases), could also do
> fun
> things which phis of invariant locations, but not sure that's
> actually
> useful
> - Teach LICM to lift loads of known invariant locations out of loops
> 
> Does this seem like a workable approach?

Yes, I think this sounds quite reasonable.

 -Hal

> 
> Philip
> >> 1) Use TBAA -  By tagging loads and stores to the two fields of
> >> the
> >> array struct as disjoint branches in the TBAA tree, I can inform
> >> LLVM
> >> that a load of 'len' never aliases with a store through 'data'.
> >>  This
> >> mostly works, and enables many loads to be forwarded by GVN, but
> >> (due
> >> to
> >> the intervening stores) is a complete loss in EarlyCSE and (due to
> >> intervening calls) LICM.
> >>      a) Things like http://llvm.org/bugs/show_bug.cgi?id=20805
> >>      could
> >> improve the situation in EarlyCSE.
> >> 2) Use "invariant.load" metadata - This metadata indicates that
> >> the
> >> field loaded from is initialized before the execution of the code
> >> being
> >> compiled.  In particular, "invariant.load" implies that the load
> >> is
> >> not
> >> control dependent on any branch, is safe to speculate, and that no
> >> write
> >> aliases the location read from.  This mostly works, but only if
> >> 'array::make' is never visible in the IR.   As soon as
> >> 'array::make'
> >> gets inlined, all bets are off and mis-compiles may result.
> >>      a) Also, in practice, only LICM really knows about
> >> "invariant.load".  It would be pretty straight forward to teach
> >> both
> >> EarlyCSE and GVN about them though.
> >> http://llvm.org/bugs/show_bug.cgi?id=20806
> >>
> >> New Approaches:
> >> (This isn't so much "being proposed" as "being put forward for
> >> discussion".)
> >>
> >> 1) Introduce a new metadata type "initialized-before-load" which
> >> implies
> >> that the value of any two loads tagged with the metadata along any
> >> execution path must yield the same result.
> >>
> >> This doesn't give much freedom to the 'first' load; it's still
> >> control
> >> dependent, can't be reordered with preceding stores, can't be
> >> lifted
> >> out
> >> of loops, etc...  It can however be reordered with following
> >> stores
> >> or
> >> sunk out of a loop provided the loop body is known to execute at
> >> least
> >> once.
> >>
> >> The second load has a lot more freedom.  Provided that there is
> >> always
> >> another load to the same location (with the metadata) provable
> >> preceding
> >> it on all paths, it can be reordered freely, lifted over control
> >> branches, lifted out of loops, etc... Importantly, it is also
> >> legal
> >> to
> >> forward the value of a preceding load to a later load provided
> >> that
> >> a)
> >> both have the metadata and b) that one load executes strictly
> >> before
> >> the
> >> other.
> >>
> >> A store marked "initialized-before-load" is undefined if there is
> >> a
> >> load
> >> with the same metadata on the same location preceding it. There
> >> may
> >> be
> >> *multiple* stores to the location along a path, provided that the
> >> first
> >> load is strictly after *all* of them.
> >>
> >> This seems staight forward to implement in EarlyCSE and LICM.  I
> >> haven't
> >> looked closely at GVN, but expect it's probably not hard either.
> >>
> >> 2) Introduce a slightly different metadata "initialized-once".
> >> Semantics
> >> are very similar to the preceding except that there can only be a
> >> single
> >> store to the location along any path.
> >>
> >> Value forwarding from the store to a following load (with
> >> metadata)
> >> is
> >> allowed regardless of potentially aliasing intervening stores.
> >>
> >> This was actually my original idea, but it has a couple of
> >> problems.
> >> First, it breaks on surprisingly common initialization patterns
> >> such
> >> as
> >> default initialization followed by real initialization.  Secondly,
> >> I'm
> >> not sure the optimizer would always preserve the "write once"
> >> property.
> >> In particular, the optimizer is free to divide a large write into
> >> several smaller ones (assuming the write is not atomic.)
> >>
> >>
> >>
> >> Thoughts?  Suggestions?  Similar sounding problems this might be
> >> able
> >> to
> >> solve?
> >>
> >> Philip
> >>
> >>
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory