[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

Thu Apr 9 15:52:44 PDT 2020

Hi John,
Thanks for the suggestions! I will start looking at adding
foldMemoryOperand to ARMInstrInfo.

Thanks,
Prathamesh

On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com> wrote:
>
> If I'm understanding what's going on in this test correctly, what's happening is:
>  * ARMTargetLowering::LowerCall prefers indirect calls when a function is called at least 3 times in minsize
>  * In thumb 1 (without -fno-omit-frame-pointer) we have effectively only 3 callee-saved registers (r4-r6)
>  * The function has three arguments, so those three plus the register we need to hold the function address is more than our callee-saved registers
>  * Therefore something needs to be spilt
>  * The function address can be rematerialized, so we spill that and insert and LDR before each call
>
> If we didn't have this spilling happening (e.g. if the function had one less argument) then the code size of using BL vs BLX
>  * BL: 3*4-byte BL = 12 bytes
>  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes
> (So maybe even not considering spilling, LowerCall should be adjusted to do this for functions called 4 or more times)
>
> When we have to spill, if we compare spilling the functions address vs spilling an argument:
>  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16 bytes
>  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes
> So just changing the spilling heuristic won't work.
>
> The two ways I see of fixing this:
>  * In LowerCall only prefer an indirect call if the number of integer register arguments is less than the number of callee-saved registers.
>  * When the load of the function address is spilled, instead of just rematerializing the load instead convert the BX back into BL.
>
> The first of these would be easier, but there will be situations where we need to use less than three callee-saved registers (e.g. arguments are loaded from a pointer) and there are situations where we will spill the function address for reasons entirely unrelated to the function arguments (e.g. if we have enough live local variables).
>
> For the second, looking at InlineSpiller.cpp it does have the concept of rematerializing by folding a memory operand into another instruction, so I think we could make use of that to do this. It looks like it would involve adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a LDR into a BX by turning it into a BL.
>
> John
>
> ________________________________
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 07 April 2020 21:07
> To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call
>
> On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> <prathamesh.kulkarni at linaro.org> wrote:
> >
> > Hi,
> > Compiling attached test-case, which is reduced version of of
> > uECC_shared_secret from tinycrypt library [1], with
> > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > results in reloading of register holding function's address before
> > every call to blx:
> >
> >         ldr       r3, .LCPI0_0
> >         blx      r3
> >         mov    r0, r6
> >         mov    r1, r5
> >         mov    r2, r4
> >         ldr       r3, .LCPI0_0
> >         blx       r3
> >         ldr        r3, .LCPI0_0
> >         mov     r0, r6
> >         mov     r1, r5
> >         mov     r2, r4
> >         blx       r3
> >
> > .LCPI0_0:
> >         .long   foo
> >
> > From dump of regalloc (attached), AFAIU, what seems to happen during
> > greedy allocator is, all virt regs %0 to %3 are live across first two
> > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4
> > respectively, and %3 which holds foo's address doesn't have any
> > register left.
> > Since it's live-range has least weight, it does not evict any existing interval,
> > and gets split. Eventually we have the following allocation:
> >
> > [%0 -> $r6] tGPR
> > [%1 -> $r5] tGPR
> > [%2 -> $r4] tGPR
> > [%6 -> $r3] tGPR
> > [%11 -> $r3] tGPR
> > [%16 -> $r3] tGPR
> > [%17 -> $r3] tGPR
> >
> > where %6, %11, %16 and %17 all are derived from %3.
> > And since r3 is a call-clobbered register, the compiler is forced to
> > reload foo's address
> > each time before blx.
> >
> > To fix this, I thought of following approaches:
> > (a) Disable the heuristic to prefer indirect call when there are at
> > least 3 calls to
> > same function in basic block in ARMTargetLowering::LowerCall for Thumb-1 ISA.
> >
> > (b) In ARMTargetLowering::LowerCall, put another constraint like
> > number of arguments, as a proxy for register pressure for Thumb-1, but
> > that's bound to trip another cases.
> >
> > (c) Give higher priority to allocate vrit reg used for indirect calls
> > ? However, if that
> > results in spilling of some other register, it would defeat the
> > purpose of saving code-size. I suppose ideally we want to trigger the
> > heuristic of using indirect call only when we know beforehand that it
> > will not result in spilling. But I am not sure if it's possible to
> > estimate that during isel ?
> >
> > I would be grateful for suggestions on how to proceed further.
> ping ?
>
> Thanks,
> Prathamesh
> >
> > [1] https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> >
> > Thanks,
> > Prathamesh
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev