[llvm-dev] persuading licm to do the right thing

Wed Dec 9 14:25:38 PST 2015

I understand that GEPs do not access memory.
They do a (possibly expensive) address calculation,
perhaps adding a few values to a label and leaving the result in a register.
Getting a label into a register is (to me) just like loading a 64-bit
integer value
into a register. It can happen in many places and it can cost a few
instructions
and several bytes. I'd like to see such things commoned and hoisted when
possible.

I agree with you that I may be fighting the spirit of the IR.

I also agree with Daniel that the classic solution is to lower and
re-optimize.
Indeed, I tried it, sort of, by rewriting GEPs into a sequence of adds and
multiplies,
with bitcasts to gets the types right. This helps in some places, but the
re-canonicalization defeats my efforts in others places.

So now I feel like I'm going to have to write my own LICM and VN passes.
I can do it and it won't take that long, but I had imagined that I could
pick and choose from the tools already available to achieve the desired
effect.
Seems not, and I'm surprised.

Preston

On Wed, Dec 9, 2015 at 1:10 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Dec 9, 2015, at 11:16 AM, Preston Briggs <briggs at reservoir.com> wrote:
>
> A GEP can represent a potentially large tree of instructions.
> Seems like all the sub-trees are hidden from optimization;
> that is, I never see licm or value numbering doing anything with them.
> If I rewrite the GEPs as lots of little adds and multiplies,
> then opt will do a better job (I speculate this happens during lowering).
>
> One of the computations that's hidden in the GEP in my example
> is the non-zero effort required to get the value of a label into a
> register.
> I'd like to expose that effort to the optimizer; instead, it seems
> hidden, papered over with the GEP.
>
> Your point about putting labels in global variables seems to work
> if I do it by hand. Kind of embarassing though, don't you think,
> introducing an indirection to achieve better code?
>
>
> Isn’t this indirection just how your target work?
> It seems to me that this indirection just achieve exactly what you want: a
> more risc-like and less implicit representation!
>
> As far as I can tell, accessing a “label” does not require a load at the
> IR level, (and indeed it does not on most targets). It is baked in
> everywhere in the code that operates on the IR. You will fight the model
> all the time if you try to circumvent that.
> For instance the bitcast of the global label I used in my first answer is
> a compile time constant expression for LLVM.
> On the same topic, integer literals are assumed to be inlined in the code,
> we don’t expose any load in the IR to operate on a literal value.
>
> Also, you are relying on an implicit load (only on your target) for global
> labels, which I believe is not valid IR. For LLVM IR, a GEP is nothing more
> than pointer arithmetic.
> Per LangRef:
>
>     The ‘getelementptr‘ instruction is used to get the address of a
> subelement of an aggregate data structure. It performs address calculation
> only and does not access memory. The instruction can also be used to
> calculate a vector of such addresses.
>
> Note the *does not access memory* part (now, you may argue that the “load”
> you are emitting is not accessing memory exposed in the optimizer memory
> model and this does not apply).
>
> —
> Mehdi
>
>
>
>
>
> Thanks,
> Preston
>
>
>
>
>
> On Wed, Dec 9, 2015 at 9:17 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
>>
>> On Dec 9, 2015, at 9:00 AM, Preston Briggs <briggs at reservoir.com> wrote:
>>
>> I suppose your view is reasonable, and perhaps common.
>> My own "taste" has always preferred machine-independent code
>> that is as simple as possible, so GEPs reduced to nothing more than an
>> add, etc, i.e., quite risc-like. Then optimize it to reduce the total
>> number
>> of operations (as best we can), then raise the level during instruction
>> selection, taking advantage of available instructions.
>>
>>
>> I’m not sure I see something related to risc-like here, it seems to me
>> that  your problem is not GEP vs ADD but rather that you want to expose a
>> mode where global addresses need to be loaded and can’t be referenced
>> directly.
>> (Unless I misunderstood the problem which is very possible as well)
>>
>> Maybe you could do this with a transformation that would put all the
>> global variable addresses in a global array and reference them through the
>> array. That’s the only workaround I could see.
>>
>> —
>> Mehdi
>>
>>
>>
>>
>> I guess my whole scheme of using opt in this context is probably wrong
>> headed.
>>
>> Thanks
>>
>>
>>
>> On Wed, Dec 9, 2015 at 8:45 AM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Dec 9, 2015, at 7:58 AM, Preston Briggs <briggs at reservoir.com> wrote:
>>>
>>> I'm trying to make the IR "better", in a machine-independent fashion,
>>> without having to do any lowering.
>>>
>>>
>>> The question is “would the IR be more canonical” with the representation
>>> you suggest? Why would the optimizer benefit from this representation
>>> instead of the current one in general?
>>> Right now this GEP reads as an offset from a constant global, which
>>> seems pretty optimal to me.
>>>
>>> My impression is that when you reach a point where the “better” is
>>> target specific, this is part of the lowering (I’m using lowering in the
>>> sense that you go away from the canonical representation the optimizer
>>> expects). I believe it is pretty common that targets need to do this kind
>>> of lowering.
>>>
>>> —
>>> Mehdi
>>>
>>>
>>>
>>> I've written code that rewrites GEPs as simple adds and multiplies,
>>> which helps a lot, but there's still some sort of re-canonicalization
>>> that's getting in my way. Is there perhaps a way to suppress it?
>>>
>>>
>>> Thanks,
>>> Preston
>>>
>>>
>>> On Wed, Dec 9, 2015 at 7:47 AM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>>
>>>> I guess is has to be done as part of the lowering for such a target,
>>>> either during CodegenPrepare or during something like MachineLICM.
>>>>
>>>> —
>>>> Mehdi
>>>>
>>>>
>>>>
>>>> On Dec 9, 2015, at 7:13 AM, Preston Briggs <briggs at reservoir.com>
>>>> wrote:
>>>>
>>>> On some targets with limited addressing modes,
>>>> getting that 64-bit relocatable but loop-invariant value into a register
>>>> requires several instructions. I'd like those several instruction
>>>> outside
>>>> the loop, where they belong.
>>>>
>>>> Yes, my experience is that something (I assume instcombine)
>>>> recanonicalizes.
>>>>
>>>> Thanks,
>>>> Preston
>>>>
>>>>
>>>> On Tue, Dec 8, 2015 at 11:21 PM, Mehdi Amini <mehdi.amini at apple.com>
>>>> wrote:
>>>>
>>>>> Hi Preston,
>>>>>
>>>>> On Dec 8, 2015, at 10:56 PM, Preston Briggs via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> When I compile two different modules using
>>>>>
>>>>> clang -O -S -emit-llvm
>>>>>
>>>>>
>>>>> I get different .ll files, no surprise.
>>>>>
>>>>> The first looks like
>>>>>
>>>>> double *v;
>>>>>
>>>>> double zap(long n) {
>>>>>   double sum = 0;
>>>>>   for (long i = 0; i < n; i++)
>>>>>     sum += v[i];
>>>>>   return sum;
>>>>> }
>>>>>
>>>>>
>>>>> yielding
>>>>>
>>>>> @v = common global double* null, align 8
>>>>>
>>>>> ; Function Attrs: nounwind readonly uwtable
>>>>> define double @zap(i64 %n) #0 {
>>>>> entry:
>>>>>   %cmp4 = icmp sgt i64 %n, 0
>>>>>   br i1 %cmp4, label %for.body.lr.ph, label %for.end
>>>>>
>>>>> for.body.lr.ph:                                   ; preds = %entry
>>>>>   %0 = load double** @v, align 8, !tbaa !1
>>>>>   br label %for.body
>>>>>
>>>>> for.body:                                         ; preds = %for.body,
>>>>> %for.body.lr.ph
>>>>>   %i.06 = phi i64 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
>>>>>   %sum.05 = phi double [ 0.000000e+00, %for.body.lr.ph ], [ %add,
>>>>> %for.body ]
>>>>>   %arrayidx = getelementptr inbounds double* %0, i64 %i.06
>>>>>   %1 = load double* %arrayidx, align 8, !tbaa !5
>>>>>   %add = fadd double %sum.05, %1
>>>>>   %inc = add nsw i64 %i.06, 1
>>>>>
>>>>> %exitcond = icmp eq i64 %inc, %n
>>>>>   br i1 %exitcond, label %for.end, label %for.body
>>>>>
>>>>> for.end:                                          ; preds = %for.body,
>>>>> %entry
>>>>>   %sum.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add,
>>>>> %for.body ]
>>>>>   ret double %sum.0.lcssa
>>>>> }
>>>>>
>>>>>
>>>>> and the second looks like
>>>>>
>>>>> double v[10000];
>>>>>
>>>>> double zap(long n) {
>>>>>   double sum = 0;
>>>>>   for (long i = 0; i < n; i++)
>>>>>     sum += v[i];
>>>>>   return sum;
>>>>> }
>>>>>
>>>>>
>>>>> yielding
>>>>>
>>>>> ; ModuleID = 'z.c'
>>>>> target datalayout =
>>>>> "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64-S128"
>>>>> target triple = "x86_64-unknown-linux-gnu"
>>>>>
>>>>> @v = common global [10000 x double] zeroinitializer, align 16
>>>>>
>>>>> ; Function Attrs: nounwind readonly uwtable
>>>>> define double @zap(i64 %n) #0 {
>>>>> entry:
>>>>>   %cmp4 = icmp sgt i64 %n, 0
>>>>>   br i1 %cmp4, label %for.body, label %for.end
>>>>>
>>>>> for.body:                                         ; preds = %entry,
>>>>> %for.body
>>>>>   %i.06 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
>>>>>   %sum.05 = phi double [ %add, %for.body ], [ 0.000000e+00, %entry ]
>>>>>   %arrayidx = getelementptr inbounds [10000 x double]* @v, i64 0, i64
>>>>> %i.06
>>>>>   %0 = load double* %arrayidx, align 8, !tbaa !1
>>>>>   %add = fadd double %sum.05, %0
>>>>>   %inc = add nsw i64 %i.06, 1
>>>>>   %exitcond = icmp eq i64 %inc, %n
>>>>>   br i1 %exitcond, label %for.end, label %for.body
>>>>>
>>>>> for.end:                                          ; preds = %for.body,
>>>>> %entry
>>>>>   %sum.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add,
>>>>> %for.body ]
>>>>>   ret double %sum.0.lcssa
>>>>> }
>>>>>
>>>>> attributes #0 = { nounwind readonly uwtable
>>>>> "less-precise-fpmad"="false" "no-frame-pointer-elim"="false"
>>>>> "no-infs-fp-math"="false" "no-nans-fp-math"="false"
>>>>> "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
>>>>> "use-soft-float"="false" }
>>>>>
>>>>> !llvm.ident = !{!0}
>>>>>
>>>>> !0 = metadata !{metadata !"Clang Front-End version 3.4.1
>>>>> (tags/RELEASE_34/final)"}
>>>>> !1 = metadata !{metadata !2, metadata !2, i64 0}
>>>>> !2 = metadata !{metadata !"double", metadata !3, i64 0}
>>>>> !3 = metadata !{metadata !"omnipotent char", metadata !4, i64 0}
>>>>> !4 = metadata !{metadata !"Simple C/C++ TBAA"}
>>>>>
>>>>>
>>>>> (I included all the metadata and such for the 2nd case, on the off
>>>>> chance it matters.)
>>>>>
>>>>> Is there any way I can convince licm (or something) to rip open the
>>>>> GEP and hoist the reference to @v outside the loop, similar to the first
>>>>> example?
>>>>>
>>>>>
>>>>>
>>>>> I believe that in the second case, there is no need to load the
>>>>> address of v as it is constant. However you have a constant address to an
>>>>> array, which is represented by [10000 x double]* @v in the IR, which
>>>>> requires to use the two-level GEP.
>>>>>
>>>>> You “could” manage to represent it this way:
>>>>>
>>>>> define double @zap(i64 %n) #0 {
>>>>> entry:
>>>>>   %cmp6 = icmp sgt i64 %n, 0
>>>>>   %hoisted = bitcast [10000 x double]* @v to double*
>>>>>   br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
>>>>>
>>>>> for.body.preheader:                               ; preds = %entry
>>>>>   br label %for.body
>>>>>
>>>>> for.cond.cleanup.loopexit:                        ; preds = %for.body
>>>>>   %add.lcssa = phi double [ %add, %for.body ]
>>>>>   br label %for.cond.cleanup
>>>>>
>>>>> for.cond.cleanup:                                 ; preds =
>>>>> %for.cond.cleanup.loopexit, %entry
>>>>>   %sum.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add.lcssa,
>>>>> %for.cond.cleanup.loopexit ]
>>>>>   ret double %sum.0.lcssa
>>>>>
>>>>> for.body:                                         ; preds =
>>>>> %for.body.preheader, %for.body
>>>>>   %i.08 = phi i64 [ %inc, %for.body ], [ 0, %for.body.preheader ]
>>>>>   %sum.07 = phi double [ %add, %for.body ], [ 0.000000e+00,
>>>>> %for.body.preheader ]
>>>>>   %arrayidx = getelementptr double, double* %hoisted, i64 %i.08
>>>>>   %0 = load double, double* %arrayidx, align 8, !tbaa !2
>>>>>   %add = fadd double %sum.07, %0
>>>>>   %inc = add nuw nsw i64 %i.08, 1
>>>>>   %exitcond = icmp eq i64 %inc, %n
>>>>>   br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
>>>>> }
>>>>>
>>>>>
>>>>> However instcombine will recanonicalize it like it was originally.
>>>>>
>>>>> Since it is a GEP that operate on a constant address, this shouldn’t
>>>>> matter, why would you want to split this?
>>>>>
>>>>> Best,
>>>>>
>>>>> —
>>>>> Mehdi
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151209/51bfea6b/attachment-0001.html>