[llvm-dev] LoopStrengthReduction generates false code

Tue Jun 9 11:59:09 PDT 2020

Hm, no. I expect byte addresses - everywhere. The compiler should not know that the arch needs word addresses. During lowering LOAD and STORE get explicit conversion operations for the memory address. Even if my arch was byte addressed the code would be false/illegal.

Boris

> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at quicinc.com>:
> 
> Blindly guessing here, "memory is not byte addressed", but you never fixed ScalarEvolution to handle that, so it's modeling the GEP in a way you're not expecting.
> 
> -Eli
> 
>> -----Original Message-----
>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris Boesler
>> via llvm-dev
>> Sent: Tuesday, June 9, 2020 1:17 AM
>> To: llvm-dev at lists.llvm.org
>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
>> 
>> Hi.
>> 
>> In my backend I get false code after using StrengthLoopReduction. In the
>> generated code the loop index variable is multiplied by 8 (correct, everything
>> is 64 bit aligned) to get an address offset, and the index variable is
>> incremented by 1*8, which is not correct. It should be incremented by 1
>> only. The factor 8 appears again.
>> 
>> I compared the debug output (-debug-only=scalar-evolution,loop-reduce) for
>> my backend and the ARM backend, but simply can't read/understand it.
>> They differ in factors 4 vs 8 (ok), but there are more differences, probably
>> caused by the implementation of TargetTransformInfo for ARM, while I
>> haven't implemented it for my arch, yet.
>> 
>> How can I debug this further? In my arch everything is 64 bit aligned (factor 8
>> in many passes) and the memory is not byte addressed.
>> 
>> Thanks,
>> Boris
>> 
>> ----8<----
>> 
>> LLVM assembly:
>> 
>> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
>> 
>> ; Function Attrs: nounwind
>> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
>> entry:
>>  tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]*
>> @buffer, i32 0, i32 0)) #2
>>  br label %while.body
>> 
>> while.body:                                       ; preds = %entry, %while.body
>>  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>>  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0,
>> i32 %i.010
>>  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>>  %cmp1 = icmp ne i32 %0, -559038737
>>  %inc = add nuw nsw i32 %i.010, 1
>>  %cmp11 = icmp eq i32 %i.010, 0
>>  %cmp = or i1 %cmp11, %cmp1
>>  br i1 %cmp, label %while.body, label %while.end
>> 
>> while.end:                                        ; preds = %while.body
>>  %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0,
>> i32 %i.010
>>  %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>>  store volatile i32 %1, i32* %result, align 4, !tbaa !2
>>  ret void
>> }
>> 
>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>> 
>> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false"
>> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-
>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
>> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-
>> math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8"
>> "unsafe-fp-math"="false" "use-soft-float"="false" }
>> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-
>> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-
>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
>> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-
>> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
>> "use-soft-float"="false" }
>> attributes #2 = { nounwind }
>> 
>> !llvm.module.flags = !{!0}
>> !llvm.ident = !{!1}
>> 
>> !0 = !{i32 1, !"wchar_size", i32 4}
>> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
>> !2 = !{!3, !3, i64 0}
>> !3 = !{!"int", !4, i64 0}
>> !4 = !{!"omnipotent char", !5, i64 0}
>> !5 = !{!"Simple C/C++ TBAA"}
>> 
>> 
>> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>> 
>> LSR on loop %while.body:
>> Collecting IV Chains.
>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>> IV={@buffer,+,8}<nsw><%while.body>
>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>> IV={0,+,1}<nuw><nsw><%while.body>
>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>> LSR has identified the following interesting factors and types: *8
>> LSR is examining the following fixup sites:
>>  UserInst=%cmp11, OperandValToReplace=%i.010
>>  UserInst=%0, OperandValToReplace=%arrayidx
>> LSR found 2 uses:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>> 
>> After generating reuse formulae:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>>    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>  Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>)
>>    in favor of formula reg({0,+,-1}<nw><%while.body>)
>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
>> widest fixup type: i32*
>> 
>> After filtering out undesirable candidates:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>>    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>> New best at 2 instructions 2 regs, with addrec cost 2.
>> Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body>
>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>> 
>> The chosen solution requires 2 instructions 2 regs, with addrec cost 1, plus 1
>> base add:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>> 
>> 
>> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>> 
>> LSR on loop %while.body:
>> Collecting IV Chains.
>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>> IV={@buffer,+,4}<nsw><%while.body>
>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>> IV={0,+,1}<nuw><nsw><%while.body>
>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>> LSR has identified the following interesting factors and types: *4
>> LSR is examining the following fixup sites:
>>  UserInst=%cmp11, OperandValToReplace=%i.010
>>  UserInst=%0, OperandValToReplace=%arrayidx
>> LSR found 2 uses:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>> 
>> After generating reuse formulae:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>>    reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>>    -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>    reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>    reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
>> widest fixup type: i32*
>>  Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>>  Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>>    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>> 
>> After filtering out undesirable candidates:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>>    reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>    reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
>> New best at 1 instruction 2 regs, with addrec cost 1.
>> Regs: {0,+,-1}<nw><%while.body> @buffer
>> 
>> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
>> i32*
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev