<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

If I'm understanding what's going on in this test correctly, what's happening is:</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * ARMTargetLowering::LowerCall prefers indirect calls when a function is called at least 3 times in minsize</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * In thumb 1 (without -fno-omit-frame-pointer) we have effectively only 3 callee-saved registers (r4-r6)</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * The function has three arguments, so those three plus the register we need to hold the function address is more than our callee-saved registers</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * Therefore something needs to be spilt</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * The function address can be rematerialized, so we spill that and insert and LDR before each call</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

If we didn't have this spilling happening (e.g. if the function had one less argument) then the code size of using BL vs BLX</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * BL: 3*4-byte BL = 12 bytes</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

 * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes</div>

<div>

<div id="appendonsend"></div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

(So maybe even not considering spilling, LowerCall should be adjusted to do this for functions called 4 or more times)</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

When we have to spill, if we compare spilling the functions address vs spilling an argument:</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

 * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16 bytes</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

 * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

So just changing the spilling heuristic won't work.</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

The two ways I see of fixing this:</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

 * In LowerCall only prefer an indirect call if the number of integer register arguments is less than the number of callee-saved registers.</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

 * When the load of the function address is spilled, instead of just rematerializing the load instead convert the BX back into BL.</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

The first of these would be easier, but there will be situations where we need to use less than three callee-saved registers (e.g. arguments are loaded from a pointer) and there are situations where we will spill the function address for reasons entirely unrelated

 to the function arguments (e.g. if we have enough live local variables).</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

For the second, looking at InlineSpiller.cpp it does have the concept of rematerializing by folding a memory operand into another instruction, so I think we could make use of that to do this. It looks like it would involve adding a foldMemoryOperand function

 to ARMInstrInfo and then have this fold a LDR into a BX by turning it into a BL.</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

John</div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev@lists.llvm.org><br>

<b>Sent:</b> 07 April 2020 21:07<br>

<b>To:</b> llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org><br>

<b>Subject:</b> Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt">

<div class="PlainText">On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni<br>

<prathamesh.kulkarni@linaro.org> wrote:<br>

><br>

> Hi,<br>

> Compiling attached test-case, which is reduced version of of<br>

> uECC_shared_secret from tinycrypt library [1], with<br>

> --target=arm-linux-gnueabi -march=armv6-m -Oz -S<br>

> results in reloading of register holding function's address before<br>

> every call to blx:<br>

><br>

>         ldr       r3, .LCPI0_0<br>

>         blx      r3<br>

>         mov    r0, r6<br>

>         mov    r1, r5<br>

>         mov    r2, r4<br>

>         ldr       r3, .LCPI0_0<br>

>         blx       r3<br>

>         ldr        r3, .LCPI0_0<br>

>         mov     r0, r6<br>

>         mov     r1, r5<br>

>         mov     r2, r4<br>

>         blx       r3<br>

><br>

> .LCPI0_0:<br>

>         .long   foo<br>

><br>

> From dump of regalloc (attached), AFAIU, what seems to happen during<br>

> greedy allocator is, all virt regs %0 to %3 are live across first two<br>

> calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4<br>

> respectively, and %3 which holds foo's address doesn't have any<br>

> register left.<br>

> Since it's live-range has least weight, it does not evict any existing interval,<br>

> and gets split. Eventually we have the following allocation:<br>

><br>

> [%0 -> $r6] tGPR<br>

> [%1 -> $r5] tGPR<br>

> [%2 -> $r4] tGPR<br>

> [%6 -> $r3] tGPR<br>

> [%11 -> $r3] tGPR<br>

> [%16 -> $r3] tGPR<br>

> [%17 -> $r3] tGPR<br>

><br>

> where %6, %11, %16 and %17 all are derived from %3.<br>

> And since r3 is a call-clobbered register, the compiler is forced to<br>

> reload foo's address<br>

> each time before blx.<br>

><br>

> To fix this, I thought of following approaches:<br>

> (a) Disable the heuristic to prefer indirect call when there are at<br>

> least 3 calls to<br>

> same function in basic block in ARMTargetLowering::LowerCall for Thumb-1 ISA.<br>

><br>

> (b) In ARMTargetLowering::LowerCall, put another constraint like<br>

> number of arguments, as a proxy for register pressure for Thumb-1, but<br>

> that's bound to trip another cases.<br>

><br>

> (c) Give higher priority to allocate vrit reg used for indirect calls<br>

> ? However, if that<br>

> results in spilling of some other register, it would defeat the<br>

> purpose of saving code-size. I suppose ideally we want to trigger the<br>

> heuristic of using indirect call only when we know beforehand that it<br>

> will not result in spilling. But I am not sure if it's possible to<br>

> estimate that during isel ?<br>

><br>

> I would be grateful for suggestions on how to proceed further.<br>

ping ?<br>

<br>

Thanks,<br>

Prathamesh<br>

><br>

> [1] <a href="https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139">

https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139</a><br>

><br>

> Thanks,<br>

> Prathamesh<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

llvm-dev@lists.llvm.org<br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</div>

</span></font></div>

</div>

</body>

</html>