[llvm-dev] LoopStrengthReduction generates false code
Boris Boesler via llvm-dev
llvm-dev at lists.llvm.org
Wed Jun 10 01:58:54 PDT 2020
The IR after LSR is:
*** IR Dump After Loop Strength Reduction ***
; Preheader:
entry:
tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]* @buffer, i32 0, i32 0)) #2
br label %while.body
; Loop:
while.body: ; preds = %while.body, %entry
%lsr.iv = phi i32 [ %lsr.iv.next, %while.body ], [ 0, %entry ]
%uglygep = getelementptr i8, i8* bitcast ([10 x i32]* @buffer to i8*), i32 %lsr.iv
%uglygep1 = bitcast i8* %uglygep to i32*
%0 = load i32, i32* %uglygep1, align 4, !tbaa !2
%cmp1 = icmp ne i32 %0, -559038737
%cmp11 = icmp eq i32 %lsr.iv, 0
%cmp = or i1 %cmp11, %cmp1
%lsr.iv.next = add nuw i32 %lsr.iv, 8
br i1 %cmp, label %while.body, label %while.end
; Exit blocks
while.end: ; preds = %while.body
store volatile i32 %0, i32* %result, align 4, !tbaa !2
ret void
I guess "%uglygep = getelementptr.." will be lowered to @buffer + (%lsr.iv * StoreSize(i32)). That's what I see in the final code. But then %lsr.iv.next should be incremented by 1; BUT it is incremented by 8.
Incrementing %lsr.iv.next by 8 would make sense if getelementptr were lowered to @buffer + %lsr.iv.
Thanks for your help,
Boris
> Am 09.06.2020 um 21:56 schrieb Eli Friedman <efriedma at quicinc.com>:
>
> Hmm. Then I'm not sure; at first glance, the debug output looks fine either way. Could you show the IR after LSR, and explain why it's wrong?
>
> -Eli
>
>> -----Original Message-----
>> From: Boris Boesler <baembel at gmx.de>
>> Sent: Tuesday, June 9, 2020 11:59 AM
>> To: Eli Friedman <efriedma at quicinc.com>
>> Cc: llvm-dev at lists.llvm.org
>> Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false code
>>
>> Hm, no. I expect byte addresses - everywhere. The compiler should not know
>> that the arch needs word addresses. During lowering LOAD and STORE get
>> explicit conversion operations for the memory address. Even if my arch was
>> byte addressed the code would be false/illegal.
>>
>> Boris
>>
>>> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at quicinc.com>:
>>>
>>> Blindly guessing here, "memory is not byte addressed", but you never fixed
>> ScalarEvolution to handle that, so it's modeling the GEP in a way you're not
>> expecting.
>>>
>>> -Eli
>>>
>>>> -----Original Message-----
>>>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris
>> Boesler
>>>> via llvm-dev
>>>> Sent: Tuesday, June 9, 2020 1:17 AM
>>>> To: llvm-dev at lists.llvm.org
>>>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
>>>>
>>>> Hi.
>>>>
>>>> In my backend I get false code after using StrengthLoopReduction. In the
>>>> generated code the loop index variable is multiplied by 8 (correct,
>> everything
>>>> is 64 bit aligned) to get an address offset, and the index variable is
>>>> incremented by 1*8, which is not correct. It should be incremented by 1
>>>> only. The factor 8 appears again.
>>>>
>>>> I compared the debug output (-debug-only=scalar-evolution,loop-reduce)
>> for
>>>> my backend and the ARM backend, but simply can't read/understand it.
>>>> They differ in factors 4 vs 8 (ok), but there are more differences, probably
>>>> caused by the implementation of TargetTransformInfo for ARM, while I
>>>> haven't implemented it for my arch, yet.
>>>>
>>>> How can I debug this further? In my arch everything is 64 bit aligned
>> (factor 8
>>>> in many passes) and the memory is not byte addressed.
>>>>
>>>> Thanks,
>>>> Boris
>>>>
>>>> ----8<----
>>>>
>>>> LLVM assembly:
>>>>
>>>> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
>>>>
>>>> ; Function Attrs: nounwind
>>>> define dso_local void @some_main(i32* %result) local_unnamed_addr #0
>> {
>>>> entry:
>>>> tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x
>> i32]*
>>>> @buffer, i32 0, i32 0)) #2
>>>> br label %while.body
>>>>
>>>> while.body: ; preds = %entry, %while.body
>>>> %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>>>> %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0,
>>>> i32 %i.010
>>>> %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>>>> %cmp1 = icmp ne i32 %0, -559038737
>>>> %inc = add nuw nsw i32 %i.010, 1
>>>> %cmp11 = icmp eq i32 %i.010, 0
>>>> %cmp = or i1 %cmp11, %cmp1
>>>> br i1 %cmp, label %while.body, label %while.end
>>>>
>>>> while.end: ; preds = %while.body
>>>> %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32
>> 0,
>>>> i32 %i.010
>>>> %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>>>> store volatile i32 %1, i32* %result, align 4, !tbaa !2
>>>> ret void
>>>> }
>>>>
>>>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>>>>
>>>> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-
>> math"="false"
>>>> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-
>> pointer-
>>>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
>>>> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-
>>>> math"="false" "no-trapping-math"="false" "stack-protector-buffer-
>> size"="8"
>>>> "unsafe-fp-math"="false" "use-soft-float"="false" }
>>>> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-
>>>> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-
>>>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
>>>> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-
>> trapping-
>>>> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
>>>> "use-soft-float"="false" }
>>>> attributes #2 = { nounwind }
>>>>
>>>> !llvm.module.flags = !{!0}
>>>> !llvm.ident = !{!1}
>>>>
>>>> !0 = !{i32 1, !"wchar_size", i32 4}
>>>> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
>>>> !2 = !{!3, !3, i64 0}
>>>> !3 = !{!"int", !4, i64 0}
>>>> !4 = !{!"omnipotent char", !5, i64 0}
>>>> !5 = !{!"Simple C/C++ TBAA"}
>>>>
>>>>
>>>> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>>>>
>>>> LSR on loop %while.body:
>>>> Collecting IV Chains.
>>>> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>>>> IV={@buffer,+,8}<nsw><%while.body>
>>>> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0)
>>>> IV={0,+,1}<nuw><nsw><%while.body>
>>>> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
>>>> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>>>> LSR has identified the following interesting factors and types: *8
>>>> LSR is examining the following fixup sites:
>>>> UserInst=%cmp11, OperandValToReplace=%i.010
>>>> UserInst=%0, OperandValToReplace=%arrayidx
>>>> LSR found 2 uses:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg({@buffer,+,8}<nsw><%while.body>)
>>>>
>>>> After generating reuse formulae:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> reg({0,+,8}<nuw><nsw><%while.body>)
>>>> reg({0,+,1}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg({@buffer,+,8}<nsw><%while.body>)
>>>> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type:
>> i32
>>>> Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>)
>>>> in favor of formula reg({0,+,-1}<nw><%while.body>)
>>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
>>>> widest fixup type: i32*
>>>>
>>>> After filtering out undesirable candidates:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> reg({0,+,8}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg({@buffer,+,8}<nsw><%while.body>)
>>>> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>>>> New best at 2 instructions 2 regs, with addrec cost 2.
>>>> Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body>
>>>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
>>>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>>>>
>>>> The chosen solution requires 2 instructions 2 regs, with addrec cost 1,
>> plus 1
>>>> base add:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,8}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>>>>
>>>>
>>>> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>>>>
>>>> LSR on loop %while.body:
>>>> Collecting IV Chains.
>>>> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>>>> IV={@buffer,+,4}<nsw><%while.body>
>>>> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0)
>>>> IV={0,+,1}<nuw><nsw><%while.body>
>>>> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
>>>> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>>>> LSR has identified the following interesting factors and types: *4
>>>> LSR is examining the following fixup sites:
>>>> UserInst=%cmp11, OperandValToReplace=%i.010
>>>> UserInst=%0, OperandValToReplace=%arrayidx
>>>> LSR found 2 uses:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg({@buffer,+,4}<nsw><%while.body>)
>>>>
>>>> After generating reuse formulae:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> reg({0,+,4}<nuw><nsw><%while.body>)
>>>> reg({0,+,1}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg({@buffer,+,4}<nsw><%while.body>)
>>>> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>>>> -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>>> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
>>>> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>>> reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type:
>> i32
>>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
>>>> widest fixup type: i32*
>>>> Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>>> in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>>>> Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>>>> in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>>>>
>>>> After filtering out undesirable candidates:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> reg({0,+,4}<nuw><nsw><%while.body>)
>>>> reg({0,+,1}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg({@buffer,+,4}<nsw><%while.body>)
>>>> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>>>> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>>> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
>>>> New best at 1 instruction 2 regs, with addrec cost 1.
>>>> Regs: {0,+,-1}<nw><%while.body> @buffer
>>>>
>>>> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>> reg({0,+,-1}<nw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
>> type:
>>>> i32*
>>>> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
More information about the llvm-dev
mailing list