[llvm-dev] LoopStrengthReduction generates false code

Eli Friedman via llvm-dev llvm-dev at lists.llvm.org
Wed Jun 10 12:04:24 PDT 2020


" getelementptr i8" is a GEP over byte-size elements, so it is in fact just "@buffer + %lsr.iv".  Note we bitcast the operand to i8*, then bitcast the result from i8* to i32*.

-Eli

> -----Original Message-----
> From: Boris Boesler <baembel at gmx.de>
> Sent: Wednesday, June 10, 2020 1:59 AM
> To: Eli Friedman <efriedma at quicinc.com>
> Cc: llvm-dev at lists.llvm.org
> Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false code
>
> The IR after LSR is:
>
> *** IR Dump After Loop Strength Reduction ***
> ; Preheader:
> entry:
>   tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]*
> @buffer, i32 0, i32 0)) #2
>   br label %while.body
>
> ; Loop:
> while.body:                                       ; preds = %while.body, %entry
>   %lsr.iv = phi i32 [ %lsr.iv.next, %while.body ], [ 0, %entry ]
>   %uglygep = getelementptr i8, i8* bitcast ([10 x i32]* @buffer to i8*), i32
> %lsr.iv
>   %uglygep1 = bitcast i8* %uglygep to i32*
>   %0 = load i32, i32* %uglygep1, align 4, !tbaa !2
>   %cmp1 = icmp ne i32 %0, -559038737
>   %cmp11 = icmp eq i32 %lsr.iv, 0
>   %cmp = or i1 %cmp11, %cmp1
>   %lsr.iv.next = add nuw i32 %lsr.iv, 8
>   br i1 %cmp, label %while.body, label %while.end
>
> ; Exit blocks
> while.end:                                        ; preds = %while.body
>   store volatile i32 %0, i32* %result, align 4, !tbaa !2
>   ret void
>
> I guess "%uglygep = getelementptr.." will be lowered to @buffer + (%lsr.iv *
> StoreSize(i32)). That's what I see in the final code. But then %lsr.iv.next
> should be incremented by 1; BUT it is incremented by 8.
>
> Incrementing %lsr.iv.next by 8 would make sense if getelementptr were
> lowered to @buffer + %lsr.iv.
>
> Thanks for your help,
> Boris
>
>
>
>
> > Am 09.06.2020 um 21:56 schrieb Eli Friedman <efriedma at quicinc.com>:
> >
> > Hmm.  Then I'm not sure; at first glance, the debug output looks fine either
> way.  Could you show the IR after LSR, and explain why it's wrong?
> >
> > -Eli
> >
> >> -----Original Message-----
> >> From: Boris Boesler <baembel at gmx.de>
> >> Sent: Tuesday, June 9, 2020 11:59 AM
> >> To: Eli Friedman <efriedma at quicinc.com>
> >> Cc: llvm-dev at lists.llvm.org
> >> Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false code
> >>
> >> Hm, no. I expect byte addresses - everywhere. The compiler should not
> know
> >> that the arch needs word addresses. During lowering LOAD and STORE get
> >> explicit conversion operations for the memory address. Even if my arch
> was
> >> byte addressed the code would be false/illegal.
> >>
> >> Boris
> >>
> >>> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at quicinc.com>:
> >>>
> >>> Blindly guessing here, "memory is not byte addressed", but you never
> fixed
> >> ScalarEvolution to handle that, so it's modeling the GEP in a way you're
> not
> >> expecting.
> >>>
> >>> -Eli
> >>>
> >>>> -----Original Message-----
> >>>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris
> >> Boesler
> >>>> via llvm-dev
> >>>> Sent: Tuesday, June 9, 2020 1:17 AM
> >>>> To: llvm-dev at lists.llvm.org
> >>>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
> >>>>
> >>>> Hi.
> >>>>
> >>>> In my backend I get false code after using StrengthLoopReduction. In
> the
> >>>> generated code the loop index variable is multiplied by 8 (correct,
> >> everything
> >>>> is 64 bit aligned) to get an address offset, and the index variable is
> >>>> incremented by 1*8, which is not correct. It should be incremented by 1
> >>>> only. The factor 8 appears again.
> >>>>
> >>>> I compared the debug output (-debug-only=scalar-evolution,loop-
> reduce)
> >> for
> >>>> my backend and the ARM backend, but simply can't read/understand
> it.
> >>>> They differ in factors 4 vs 8 (ok), but there are more differences,
> probably
> >>>> caused by the implementation of TargetTransformInfo for ARM, while I
> >>>> haven't implemented it for my arch, yet.
> >>>>
> >>>> How can I debug this further? In my arch everything is 64 bit aligned
> >> (factor 8
> >>>> in many passes) and the memory is not byte addressed.
> >>>>
> >>>> Thanks,
> >>>> Boris
> >>>>
> >>>> ----8<----
> >>>>
> >>>> LLVM assembly:
> >>>>
> >>>> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
> >>>>
> >>>> ; Function Attrs: nounwind
> >>>> define dso_local void @some_main(i32* %result) local_unnamed_addr
> #0
> >> {
> >>>> entry:
> >>>> tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x
> >> i32]*
> >>>> @buffer, i32 0, i32 0)) #2
> >>>> br label %while.body
> >>>>
> >>>> while.body:                                       ; preds = %entry, %while.body
> >>>> %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
> >>>> %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32
> 0,
> >>>> i32 %i.010
> >>>> %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
> >>>> %cmp1 = icmp ne i32 %0, -559038737
> >>>> %inc = add nuw nsw i32 %i.010, 1
> >>>> %cmp11 = icmp eq i32 %i.010, 0
> >>>> %cmp = or i1 %cmp11, %cmp1
> >>>> br i1 %cmp, label %while.body, label %while.end
> >>>>
> >>>> while.end:                                        ; preds = %while.body
> >>>> %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer,
> i32
> >> 0,
> >>>> i32 %i.010
> >>>> %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
> >>>> store volatile i32 %1, i32* %result, align 4, !tbaa !2
> >>>> ret void
> >>>> }
> >>>>
> >>>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
> >>>>
> >>>> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-
> >> math"="false"
> >>>> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-
> >> pointer-
> >>>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-
> math"="false"
> >>>> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-
> fp-
> >>>> math"="false" "no-trapping-math"="false" "stack-protector-buffer-
> >> size"="8"
> >>>> "unsafe-fp-math"="false" "use-soft-float"="false" }
> >>>> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false"
> "disable-
> >>>> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-
> >>>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-
> math"="false"
> >>>> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-
> >> trapping-
> >>>> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-
> math"="false"
> >>>> "use-soft-float"="false" }
> >>>> attributes #2 = { nounwind }
> >>>>
> >>>> !llvm.module.flags = !{!0}
> >>>> !llvm.ident = !{!1}
> >>>>
> >>>> !0 = !{i32 1, !"wchar_size", i32 4}
> >>>> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
> >>>> !2 = !{!3, !3, i64 0}
> >>>> !3 = !{!"int", !4, i64 0}
> >>>> !4 = !{!"omnipotent char", !5, i64 0}
> >>>> !5 = !{!"Simple C/C++ TBAA"}
> >>>>
> >>>>
> >>>> (-debug-only=scalar-evolution,loop-reduce) for my arch:
> >>>>
> >>>> LSR on loop %while.body:
> >>>> Collecting IV Chains.
> >>>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
> >>>> IV={@buffer,+,8}<nsw><%while.body>
> >>>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> >>>> IV={0,+,1}<nuw><nsw><%while.body>
> >>>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ])
> IV+1
> >>>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> >>>> LSR has identified the following interesting factors and types: *8
> >>>> LSR is examining the following fixup sites:
> >>>> UserInst=%cmp11, OperandValToReplace=%i.010
> >>>> UserInst=%0, OperandValToReplace=%arrayidx
> >>>> LSR found 2 uses:
> >>>> LSR is examining the following uses:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg({@buffer,+,8}<nsw><%while.body>)
> >>>>
> >>>> After generating reuse formulae:
> >>>> LSR is examining the following uses:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>>   reg({0,+,8}<nuw><nsw><%while.body>)
> >>>>   reg({0,+,1}<nuw><nsw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg({@buffer,+,8}<nsw><%while.body>)
> >>>>   reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> >>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type:
> >> i32
> >>>> Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>)
> >>>>   in favor of formula reg({0,+,-1}<nw><%while.body>)
> >>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
> Offsets={0},
> >>>> widest fixup type: i32*
> >>>>
> >>>> After filtering out undesirable candidates:
> >>>> LSR is examining the following uses:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>>   reg({0,+,8}<nuw><nsw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg({@buffer,+,8}<nsw><%while.body>)
> >>>>   reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> >>>> New best at 2 instructions 2 regs, with addrec cost 2.
> >>>> Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body>
> >>>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
> >>>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
> >>>>
> >>>> The chosen solution requires 2 instructions 2 regs, with addrec cost 1,
> >> plus 1
> >>>> base add:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,8}<nuw><nsw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> >>>>
> >>>>
> >>>> (-debug-only=scalar-evolution,loop-reduce) for ARM:
> >>>>
> >>>> LSR on loop %while.body:
> >>>> Collecting IV Chains.
> >>>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
> >>>> IV={@buffer,+,4}<nsw><%while.body>
> >>>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> >>>> IV={0,+,1}<nuw><nsw><%while.body>
> >>>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ])
> IV+1
> >>>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> >>>> LSR has identified the following interesting factors and types: *4
> >>>> LSR is examining the following fixup sites:
> >>>> UserInst=%cmp11, OperandValToReplace=%i.010
> >>>> UserInst=%0, OperandValToReplace=%arrayidx
> >>>> LSR found 2 uses:
> >>>> LSR is examining the following uses:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg({@buffer,+,4}<nsw><%while.body>)
> >>>>
> >>>> After generating reuse formulae:
> >>>> LSR is examining the following uses:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>>   reg({0,+,4}<nuw><nsw><%while.body>)
> >>>>   reg({0,+,1}<nuw><nsw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg({@buffer,+,4}<nsw><%while.body>)
> >>>>   reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
> >>>>   -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
> >>>>   reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
> >>>>   reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
> >>>>   reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
> >>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type:
> >> i32
> >>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
> Offsets={0},
> >>>> widest fixup type: i32*
> >>>> Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
> >>>>   in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
> >>>> Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
> >>>>   in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
> >>>>
> >>>> After filtering out undesirable candidates:
> >>>> LSR is examining the following uses:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>>   reg({0,+,4}<nuw><nsw><%while.body>)
> >>>>   reg({0,+,1}<nuw><nsw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg({@buffer,+,4}<nsw><%while.body>)
> >>>>   reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
> >>>>   reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
> >>>>   reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
> >>>> New best at 1 instruction 2 regs, with addrec cost 1.
> >>>> Regs: {0,+,-1}<nw><%while.body> @buffer
> >>>>
> >>>> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
> >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>>>   reg({0,+,-1}<nw><%while.body>)
> >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
> >> type:
> >>>> i32*
> >>>>   reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
> >>>>
> >>>> _______________________________________________
> >>>> LLVM Developers mailing list
> >>>> llvm-dev at lists.llvm.org
> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >



More information about the llvm-dev mailing list