[llvm-dev] LoopStrengthReduction generates false code

Tue Jun 9 10:36:59 PDT 2020

Blindly guessing here, "memory is not byte addressed", but you never fixed ScalarEvolution to handle that, so it's modeling the GEP in a way you're not expecting.

-Eli

> -----Original Message-----
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris Boesler
> via llvm-dev
> Sent: Tuesday, June 9, 2020 1:17 AM
> To: llvm-dev at lists.llvm.org
> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
>
> Hi.
>
> In my backend I get false code after using StrengthLoopReduction. In the
> generated code the loop index variable is multiplied by 8 (correct, everything
> is 64 bit aligned) to get an address offset, and the index variable is
> incremented by 1*8, which is not correct. It should be incremented by 1
> only. The factor 8 appears again.
>
> I compared the debug output (-debug-only=scalar-evolution,loop-reduce) for
> my backend and the ARM backend, but simply can't read/understand it.
> They differ in factors 4 vs 8 (ok), but there are more differences, probably
> caused by the implementation of TargetTransformInfo for ARM, while I
> haven't implemented it for my arch, yet.
>
> How can I debug this further? In my arch everything is 64 bit aligned (factor 8
> in many passes) and the memory is not byte addressed.
>
> Thanks,
> Boris
>
> ----8<----
>
> LLVM assembly:
>
> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
>
> ; Function Attrs: nounwind
> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
> entry:
>   tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]*
> @buffer, i32 0, i32 0)) #2
>   br label %while.body
>
> while.body:                                       ; preds = %entry, %while.body
>   %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>   %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0,
> i32 %i.010
>   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>   %cmp1 = icmp ne i32 %0, -559038737
>   %inc = add nuw nsw i32 %i.010, 1
>   %cmp11 = icmp eq i32 %i.010, 0
>   %cmp = or i1 %cmp11, %cmp1
>   br i1 %cmp, label %while.body, label %while.end
>
> while.end:                                        ; preds = %while.body
>   %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0,
> i32 %i.010
>   %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>   store volatile i32 %1, i32* %result, align 4, !tbaa !2
>   ret void
> }
>
> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>
> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false"
> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-
> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-
> math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8"
> "unsafe-fp-math"="false" "use-soft-float"="false" }
> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-
> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-
> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-
> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
> "use-soft-float"="false" }
> attributes #2 = { nounwind }
>
> !llvm.module.flags = !{!0}
> !llvm.ident = !{!1}
>
> !0 = !{i32 1, !"wchar_size", i32 4}
> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
> !2 = !{!3, !3, i64 0}
> !3 = !{!"int", !4, i64 0}
> !4 = !{!"omnipotent char", !5, i64 0}
> !5 = !{!"Simple C/C++ TBAA"}
>
>
> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>
> LSR on loop %while.body:
> Collecting IV Chains.
> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
> IV={@buffer,+,8}<nsw><%while.body>
> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> IV={0,+,1}<nuw><nsw><%while.body>
> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> LSR has identified the following interesting factors and types: *8
> LSR is examining the following fixup sites:
>   UserInst=%cmp11, OperandValToReplace=%i.010
>   UserInst=%0, OperandValToReplace=%arrayidx
> LSR found 2 uses:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg({@buffer,+,8}<nsw><%while.body>)
>
> After generating reuse formulae:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,8}<nuw><nsw><%while.body>)
>     reg({0,+,1}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg({@buffer,+,8}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>   Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>)
>     in favor of formula reg({0,+,-1}<nw><%while.body>)
> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
> widest fixup type: i32*
>
> After filtering out undesirable candidates:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,8}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg({@buffer,+,8}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> New best at 2 instructions 2 regs, with addrec cost 2.
>  Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body>
> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
>  Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>
> The chosen solution requires 2 instructions 2 regs, with addrec cost 1, plus 1
> base add:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,8}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>
>
> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>
> LSR on loop %while.body:
> Collecting IV Chains.
> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
> IV={@buffer,+,4}<nsw><%while.body>
> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> IV={0,+,1}<nuw><nsw><%while.body>
> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> LSR has identified the following interesting factors and types: *4
> LSR is examining the following fixup sites:
>   UserInst=%cmp11, OperandValToReplace=%i.010
>   UserInst=%0, OperandValToReplace=%arrayidx
> LSR found 2 uses:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg({@buffer,+,4}<nsw><%while.body>)
>
> After generating reuse formulae:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,4}<nuw><nsw><%while.body>)
>     reg({0,+,1}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg({@buffer,+,4}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>     -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>     reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
>     reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>     reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
> widest fixup type: i32*
>   Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>     in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>   Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>     in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>
> After filtering out undesirable candidates:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,4}<nuw><nsw><%while.body>)
>     reg({0,+,1}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg({@buffer,+,4}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>     reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>     reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
> New best at 1 instruction 2 regs, with addrec cost 1.
>  Regs: {0,+,-1}<nw><%while.body> @buffer
>
> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
> i32*
>     reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev