[LLVMdev] Optimization hints for "constant" loads

Mon Oct 13 14:59:08 PDT 2014

On 10/11/2014 06:01 PM, Hal Finkel wrote:
> ----- Original Message -----
>> From: "Philip Reames" <listmail at philipreames.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: nicholas at mxc.ca, llvmdev at cs.uiuc.edu
>> Sent: Friday, October 10, 2014 8:07:55 PM
>> Subject: Re: Optimization hints for "constant" loads
>>
>>
>> On 09/28/2014 01:22 AM, Hal Finkel wrote:
>>> ----- Original Message -----
>>>> From: "Philip Reames" <listmail at philipreames.com>
>>>> To: llvmdev at cs.uiuc.edu
>>>> Cc: "Hal Finkel" <hfinkel at anl.gov>, nicholas at mxc.ca
>>>> Sent: Wednesday, September 10, 2014 12:11:28 AM
>>>> Subject: Optimization hints for "constant" loads
>>>>
>>>> I'm looking at how to optimize IR which contains reads from a
>>>> field
>>>> which is known to be initialized exactly once.  I've got an idea
>>>> on
>>>> how
>>>> to approach this, but wanted to see if others have alternate ideas
>>>> or
>>>> similar problems which might spark discussion.  It feels like
>>>> there's
>>>> a
>>>> potentially generally useful optimization hint here if we can
>>>> generalize
>>>> it sufficiently without loosing optimization potential.
>>>>
>>>> The problem:
>>>> struct array {
>>>> private:
>>>>      // once initialized 'len' never changes
>>>>      int len;
>>>>      // data can be modified at will
>>>>      char data[0];
>>>> public:
>>>>      static array* make(int len) {
>>>>        array* a = ... allocate uninitialized space
>>>>        a->len = len;
>>>>        return a;
>>>>      }
>>>> };
>>>> void access(array* a, int idx) {
>>>>      if( idx >= 0 && idx <- a->len ) {
>>>>        a->data[idx] = 5;
>>>>      }
>>>> }
>>>> void foo(array* a) {
>>>>      for(int i = 0; i < a->len; i++) {
>>>>        access(a, i);
>>>>      }
>>>> }
>>>> // assume 'access' is inlined into 'foo' and the loop is unrolled
>>>> a
>>>> time
>>>> or two
>>>>
>>>> To phrase that again in english, I've got a field which is
>>>> initialized
>>>> once, with naive code which reads from it many times.  I know at
>>>> IR
>>>> generation time that a load from array::len is special, but I
>>>> loose
>>>> this
>>>> information as soon as I generate IR.  In particular, I run into
>>>> aliasing problems where the location a->len is considered
>>>> 'mayalias'
>>>> with unrelated stores thus preventing value forwarding, LICM, and
>>>> other
>>>> desirable optimizations.
>>>>
>>>> Existing Approaches:
>>> I think that, at least in theory, we already have a solution to
>>> this problem. If, after the initialization is complete, you insert
>>> a call to the llvm.invariant.start intrinsic (and perhaps also do
>>> so at the start of any routine you know can be called only after
>>> initialization is complete), that should convey the information
>>> you want.
>>>
>>> I've never worked with this intrinsic before, but if this does not
>>> work, I'd be really curious to know why.
>>   From some experiments, the existing invariant intrinsics appear
>>   nearly
>> useless for my purposes.  The problem is that any call can hide a
>> invariant.end intrinsic.  As a result, the optimizer must
>> conservatively
>> assume that any call clobbers the "invariant" location.  This makes
>> the
>> intrinsic a non-starter in it's current form.
>>
>> ; Function Attrs: uwtable
>> define zeroext i1 @_Z4testv() #0 {
>>     %1 = tail call noalias i8* @_Znwm(i64 4) #3
>>     %2 = bitcast i8* %1 to i32*
>>     store i32 5, i32* %2, align 4, !tbaa !1
>>     %discard = call {}* @llvm.invariant.start(i64 4, i8* %1)
>>     tail call void @_Z6escapePi(i32* %2)
>>     %3 = load i32* %2, align 4, !tbaa !1
>>     %4 = icmp eq i32 %3, 5 <-- this conditional should be folded away
>>     and is not
>>     ret i1 %4
>> }
>>
>> declare {}* @llvm.invariant.start(i64, i8*)
>>
>> ; Function Attrs: nobuiltin
>> declare noalias i8* @_Znwm(i64) #1
>>
>> declare void @_Z6escapePi(i32*) #2
>>
>> It also appears that the intrinsic has limited implementation in the
>> optimizer.  Even surprisingly simple cases don't appear to kick in.
>> Consider:
>> define zeroext i1 @_Z4testv() #0 {
>>     %1 = tail call noalias i8* @_Znwm(i64 4) #4
>>     %2 = bitcast i8* %1 to i32*
>>     store i32 5, i32* %2, align 4, !tbaa !1
>>     %discard = tail call {}* @llvm.invariant.start(i64 4, i8* %1)
>>     %3 = load i32* %2, align 4, !tbaa !1
>>     %4 = icmp eq i32 %3, 5 <-- This conditional should be folded
>>     tail call void @_Z6escapePi(i32* %2, i1 %4)
>>     %5 = load i32* %2, align 4, !tbaa !1
>>     %6 = icmp eq i32 %5, 5
>>     ret i1 %6
>> }
>>
>>
>> We could extend the existing intrinsic with a notion of
>> invariant.start
>> that has no end.  This could be as simple as adding a boolean
>> parameter
>> to the intrinsic.  I think this could be made to work.  There'd be
>> other
>> implementation work needed to make it actually useful, the validity
>> based on dominance model used by assumes seems like an obvious
>> candidate.
>>
>> Thinking through this, I see a couple of places that we'd change:
>> - isLocationKnownInvariant(Value* Loc, Instruction* Cntx) and related
>> analysis pass (analogous to assumptions)
>>     - The new "no end" version is easy (dominance)
>>     - The older version is a graph walk looking paths which include
>> either an end, or call.
>> - eagerly propagate known invariant location store values in EarlyCSE
>> (strictly by dominance for the new intrinsic form)
>> - Add an InstCombine rule to fold a store to a known invariant
>> location
>> to undef
>> - Teach GVN about it (gets the non dominance cases), could also do
>> fun
>> things which phis of invariant locations, but not sure that's
>> actually
>> useful
>> - Teach LICM to lift loads of known invariant locations out of loops
>>
>> Does this seem like a workable approach?
> Yes, I think this sounds quite reasonable.
I'll start throwing together some patches.  It'll take a couple of weeks 
due to current distractions, but I'll get something up for review when I 
can.

Philip