[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

Tue Apr 14 15:06:48 PDT 2020

> Could you please point out what am I doing wrong in the patch ?

It's because you're getting the function name by doing
  callee->getName().str().c_str()
The str() call generates a temporary copy of the name which ceases to exist outside of this expression
causing the c_str() to return a pointer to no-longer-valid memory.

I'm not sure what the correct way of getting the name as a char* is here. Doing
  callee->getName().data()
appears to work, though I don't know if we can rely on the StringReg that getName() returns being
appropriately null-terminated.

> However for above case, IIUC, we would want all calls to be converted to bl  ?
It would be better yes, though I'm not sure how you'd go about making that happen. It's probably not
worth worrying too much about though, as the new behaviour is still better than the old.

John

________________________________
From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org>
Sent: 15 April 2020 01:44
To: John Brawn <John.Brawn at arm.com>
Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

Hi,
I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo.
For the following case:

void f(int x, int y, int z)
{
  void bar(int, int, int);

  bar(x, y, z);
  bar(x, z, y);
  bar(y, x, z);
  bar(y, y, x);
}

it calls foldMemoryOperand twice, and thus converts two calls from blx to bl.
callMI->dump() shows the function name "bar" correctly, however in
generated assembly call to bar is garbled:
(compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):

        add     r7, sp, #16
        mov     r6, r2
        mov     r5, r1
        mov     r4, r0
        bl      "<90>w\n        "
        mov     r1, r2
        mov     r2, r5
        bl      "<90>w\n        "
        mov     r0, r5
        mov     r1, r4
        mov     r2, r6
        ldr     r6, .LCPI0_0
        blx     r6
        mov     r0, r5
        mov     r1, r5
        mov     r2, r4
        blx     r6

regalloc dump (attached) shows:
Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r weight:3.209746e-03
>From original %3
        also spill snippet %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
  tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit
killed $r0, implicit killed $r1, implicit killed $r2
        folded:   144r  tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F",
implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed
$r1, implicit killed $r2 :: (load 4 from constant-pool)
        remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg ::
(load 4 from constant-pool)
                232e    %7:tgpr = COPY killed %10:tgpr

Could you please point out what am I doing wrong in the patch ?

Also, I guess, it only converted two calls to bl because further
spilling wasn't necessary.
However for above case, IIUC, we would want all calls to be converted to bl  ?
Since,
4 bl == 16 bytes
2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
bytes (litpool) == 18 bytes

Thanks,
Prathamesh

On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
<prathamesh.kulkarni at linaro.org> wrote:
>
> Hi John,
> Thanks for the suggestions! I will start looking at adding
> foldMemoryOperand to ARMInstrInfo.
>
> Thanks,
> Prathamesh
>
> On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com> wrote:
> >
> > If I'm understanding what's going on in this test correctly, what's happening is:
> >  * ARMTargetLowering::LowerCall prefers indirect calls when a function is called at least 3 times in minsize
> >  * In thumb 1 (without -fno-omit-frame-pointer) we have effectively only 3 callee-saved registers (r4-r6)
> >  * The function has three arguments, so those three plus the register we need to hold the function address is more than our callee-saved registers
> >  * Therefore something needs to be spilt
> >  * The function address can be rematerialized, so we spill that and insert and LDR before each call
> >
> > If we didn't have this spilling happening (e.g. if the function had one less argument) then the code size of using BL vs BLX
> >  * BL: 3*4-byte BL = 12 bytes
> >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes
> > (So maybe even not considering spilling, LowerCall should be adjusted to do this for functions called 4 or more times)
> >
> > When we have to spill, if we compare spilling the functions address vs spilling an argument:
> >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16 bytes
> >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes
> > So just changing the spilling heuristic won't work.
> >
> > The two ways I see of fixing this:
> >  * In LowerCall only prefer an indirect call if the number of integer register arguments is less than the number of callee-saved registers.
> >  * When the load of the function address is spilled, instead of just rematerializing the load instead convert the BX back into BL.
> >
> > The first of these would be easier, but there will be situations where we need to use less than three callee-saved registers (e.g. arguments are loaded from a pointer) and there are situations where we will spill the function address for reasons entirely unrelated to the function arguments (e.g. if we have enough live local variables).
> >
> > For the second, looking at InlineSpiller.cpp it does have the concept of rematerializing by folding a memory operand into another instruction, so I think we could make use of that to do this. It looks like it would involve adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a LDR into a BX by turning it into a BL.
> >
> > John
> >
> > ________________________________
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: 07 April 2020 21:07
> > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call
> >
> > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > <prathamesh.kulkarni at linaro.org> wrote:
> > >
> > > Hi,
> > > Compiling attached test-case, which is reduced version of of
> > > uECC_shared_secret from tinycrypt library [1], with
> > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > results in reloading of register holding function's address before
> > > every call to blx:
> > >
> > >         ldr       r3, .LCPI0_0
> > >         blx      r3
> > >         mov    r0, r6
> > >         mov    r1, r5
> > >         mov    r2, r4
> > >         ldr       r3, .LCPI0_0
> > >         blx       r3
> > >         ldr        r3, .LCPI0_0
> > >         mov     r0, r6
> > >         mov     r1, r5
> > >         mov     r2, r4
> > >         blx       r3
> > >
> > > .LCPI0_0:
> > >         .long   foo
> > >
> > > From dump of regalloc (attached), AFAIU, what seems to happen during
> > > greedy allocator is, all virt regs %0 to %3 are live across first two
> > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4
> > > respectively, and %3 which holds foo's address doesn't have any
> > > register left.
> > > Since it's live-range has least weight, it does not evict any existing interval,
> > > and gets split. Eventually we have the following allocation:
> > >
> > > [%0 -> $r6] tGPR
> > > [%1 -> $r5] tGPR
> > > [%2 -> $r4] tGPR
> > > [%6 -> $r3] tGPR
> > > [%11 -> $r3] tGPR
> > > [%16 -> $r3] tGPR
> > > [%17 -> $r3] tGPR
> > >
> > > where %6, %11, %16 and %17 all are derived from %3.
> > > And since r3 is a call-clobbered register, the compiler is forced to
> > > reload foo's address
> > > each time before blx.
> > >
> > > To fix this, I thought of following approaches:
> > > (a) Disable the heuristic to prefer indirect call when there are at
> > > least 3 calls to
> > > same function in basic block in ARMTargetLowering::LowerCall for Thumb-1 ISA.
> > >
> > > (b) In ARMTargetLowering::LowerCall, put another constraint like
> > > number of arguments, as a proxy for register pressure for Thumb-1, but
> > > that's bound to trip another cases.
> > >
> > > (c) Give higher priority to allocate vrit reg used for indirect calls
> > > ? However, if that
> > > results in spilling of some other register, it would defeat the
> > > purpose of saving code-size. I suppose ideally we want to trigger the
> > > heuristic of using indirect call only when we know beforehand that it
> > > will not result in spilling. But I am not sure if it's possible to
> > > estimate that during isel ?
> > >
> > > I would be grateful for suggestions on how to proceed further.
> > ping ?
> >
> > Thanks,
> > Prathamesh
> > >
> > > [1] https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > >
> > > Thanks,
> > > Prathamesh
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200414/386a9a63/attachment.html>