[llvm-dev] LoopStrengthReduction generates false code

Tue Jun 9 01:17:20 PDT 2020

Hi.

In my backend I get false code after using StrengthLoopReduction. In the generated code the loop index variable is multiplied by 8 (correct, everything is 64 bit aligned) to get an address offset, and the index variable is incremented by 1*8, which is not correct. It should be incremented by 1 only. The factor 8 appears again.

I compared the debug output (-debug-only=scalar-evolution,loop-reduce) for my backend and the ARM backend, but simply can't read/understand it. They differ in factors 4 vs 8 (ok), but there are more differences, probably caused by the implementation of TargetTransformInfo for ARM, while I haven't implemented it for my arch, yet.

How can I debug this further? In my arch everything is 64 bit aligned (factor 8 in many passes) and the memory is not byte addressed.

Thanks,
Boris

----8<----

LLVM assembly:

@buffer = common dso_local global [10 x i32] zeroinitializer, align 4

; Function Attrs: nounwind
define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
entry:
  tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]* @buffer, i32 0, i32 0)) #2
  br label %while.body

while.body:                                       ; preds = %entry, %while.body
  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, i32 %i.010
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %cmp1 = icmp ne i32 %0, -559038737
  %inc = add nuw nsw i32 %i.010, 1
  %cmp11 = icmp eq i32 %i.010, 0
  %cmp = or i1 %cmp11, %cmp1
  br i1 %cmp, label %while.body, label %while.end

while.end:                                        ; preds = %while.body
  %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, i32 %i.010
  %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
  store volatile i32 %1, i32* %result, align 4, !tbaa !2
  ret void
}

declare dso_local void @fill_array(i32*) local_unnamed_addr #1

attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nounwind }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
!2 = !{!3, !3, i64 0}
!3 = !{!"int", !4, i64 0}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C/C++ TBAA"}

(-debug-only=scalar-evolution,loop-reduce) for my arch:

LSR on loop %while.body:
Collecting IV Chains.
IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) IV={@buffer,+,8}<nsw><%while.body>
IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0) IV={0,+,1}<nuw><nsw><%while.body>
IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
LSR has identified the following interesting factors and types: *8
LSR is examining the following fixup sites:
  UserInst=%cmp11, OperandValToReplace=%i.010
  UserInst=%0, OperandValToReplace=%arrayidx
LSR found 2 uses:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg({@buffer,+,8}<nsw><%while.body>)

After generating reuse formulae:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,8}<nuw><nsw><%while.body>)
    reg({0,+,1}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg({@buffer,+,8}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
  Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>)
    in favor of formula reg({0,+,-1}<nw><%while.body>)
Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*

After filtering out undesirable candidates:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,8}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg({@buffer,+,8}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
New best at 2 instructions 2 regs, with addrec cost 2.
 Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body>
New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
 Regs: {0,+,8}<nuw><nsw><%while.body> @buffer

The chosen solution requires 2 instructions 2 regs, with addrec cost 1, plus 1 base add:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,8}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)

(-debug-only=scalar-evolution,loop-reduce) for ARM:

LSR on loop %while.body:
Collecting IV Chains.
IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) IV={@buffer,+,4}<nsw><%while.body>
IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0) IV={0,+,1}<nuw><nsw><%while.body>
IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
LSR has identified the following interesting factors and types: *4
LSR is examining the following fixup sites:
  UserInst=%cmp11, OperandValToReplace=%i.010
  UserInst=%0, OperandValToReplace=%arrayidx
LSR found 2 uses:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg({@buffer,+,4}<nsw><%while.body>)

After generating reuse formulae:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,4}<nuw><nsw><%while.body>)
    reg({0,+,1}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg({@buffer,+,4}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
    -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
    reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
    reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
  Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
  Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)

After filtering out undesirable candidates:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,4}<nuw><nsw><%while.body>)
    reg({0,+,1}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg({@buffer,+,4}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
    reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
New best at 1 instruction 2 regs, with addrec cost 1.
 Regs: {0,+,-1}<nw><%while.body> @buffer

The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)