<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Customized return-value calling conventions for private functions could make better choices"

   href="https://bugs.llvm.org/show_bug.cgi?id=34840">34840</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Customized return-value calling conventions for private functions could make better choices

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Keywords</th>

          <td>performance

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>peter@cordes.ca

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>In the x86-64 SysV ABI, small aggregate return values are packed into rax and

rdx with multiple members per register, but a single member won't split across

register boundaries.  So pair<int,int> packs into RAX, but pair<long,int> uses

rdx and rax separately.  This somehow confuses clang's custom-return-value

optimization for private functions.

// <a href="https://godbolt.org/g/WCQMbf">https://godbolt.org/g/WCQMbf</a>

#include <utility>

using mypair1 = std::pair<int, int>;  // long, int makes better code

static __attribute((noinline))

mypair1 getpair1() { return {11, 22};}

        movabsq $94489280523, %rax      # imm = 0x160000000B

        retq                            # Still returns both values

int dst;

void caller1() {

    dst = getpair1().second;

}

        pushq   %rax

        callq   getpair1()

        shrq    $32, %rax          # extra instruction because we didn't

optimize the calling convention 

        movl    %eax, dst(%rip)

        popq    %rax

        retq

But with std::pair<long, int> we get non-ABI-compliant private functions (still

with the standard name, though?  Maybe better to rename them to give

compile-time errors if people have inline asm that calls them?  That's

horrible, but there have been stackoverflow questions...).  Anyway, with

pair<long,int>

getpair2():

        movl    $22, %eax       # This would go in edx in a non-private version

        retq

caller2:

       ...

       callq   getpair1()

       movl    %eax, dst(%rip)

       ...

So changing the type of the member we're *not* using from int to long allows it

to optimize away, and return .second in eax.  If bending the return-value type

/ convention is allowed at all, we should definitely be doing it to optimize

the <int, int> and <int, bool> case.

In most cases, returning separate members in separate registers is going to be

a win, unless we're about to store the whole struct to memory.  (Even then, if

it saves insns overall in callee + caller it might be better to do separate MOV

stores instead of ALU packing into a single reg.)  Up to 4 registers might be a

reasonable threshold, for callers that use most of the members of a returned

object.  If all the callers actually want the object stored in memory and don't

read it directly right away / at all, passing a hidden pointer is good.

Anyway, consider std::optional<int>.  The current libc++ implementation ends up

returning like pair<int, bool>, so testing the boolean part with a TEST

instruction takes a 64-bit constant.  (And clang doesn't think of using  bt

$32, %rax, but that doesn't macro-fuse with jcc like test %dl,%dl will, and

it's bigger.)  (See <a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Missed constant propagation of aggregate return values of non-inlined static functions"

   href="show_bug.cgi?id=34839">bug 34839</a> for an example, or change the #if 0 in the

godbolt link above.)

Returning the boolean in %dl instead of packed in with the T value member in

%rax would be better, and usually easier for the callee.  We get this with

optional<long>.  Hopefully LTO and static functions can do this whenever

possible.

related: libstdc++ has an optional<> that's not trivially copyable, so

optimizing the return value passing for that case would be even more valuable. 

(With LTO or for private functions, optimize away the hidden pointer and

stores/reloads, as suggested by <a href="https://stackoverflow.com/a/46549978/224132">https://stackoverflow.com/a/46549978/224132</a>).

-------

To get better code for non-private calls where we can't bend the SysV ABI, it

might be better to have std::optional<T> store the bool first, like

pair<bool,int>.  (On balance, I think this is *not* a good idea; see downsides

below.)

You can't access the value of an optional<T> without checking the boolean

(unless <a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Missed constant propagation of aggregate return values of non-inlined static functions"

   href="show_bug.cgi?id=34839">bug 34839</a> is fixed and constant-propagation for aggregate return values

lets it optimize away).  So putting the boolean in an easy-to-access place

would help some cases.

For 32-bit T, <bool, int> would allow test %al,%al / shr $32, %rax.  (Or rorx

$32, %rax, %rcx).  The callee can create it cheaply after generating the value

in eax with shl $32, %rax / mov $1, %al  (or setcc %al without any extra

zero-extension).  On CPUs with partial-reg stalls

(tune=generic/nehalem/whatever): OR $1,%rax is still only 4 bytes with an imm8.

 (Although bts $32, %rax is also cheap for returning the current optional<int>

with a variable value and a constant has_value.)

For 64-bit T, the bool goes in %al and T goes in %rdx in the x86-64 SysV ABI. 

For 8, 16, 24, or whatever T with multiple small members, it's still just

shifts / movzx to extract them.

Downsides to using pair<bool,T> for optional<T>:

- Puts an extra shift on the critical path for the value.  The has_value is

often an input to a control dependency, which is off the critical path because

of speculative execution + branch prediction.  Using a 32-bit value in a

register with high garbage is often not a problem.  If the bottleneck isn't the

front-end, then more bloated code to test the has_value might not hurt.

 Especially if the 64-bit constant can be hoisted out of a loop.  Looping over

an array of packed optional<int>, loading the whole thing with a 64b load,

testing the bool in the upper 32 bits, then using the value in the lower 32

bits is actually really efficient.

- Padding inside the object in memory, instead of at the end of the object. 

IDK if that matters, though.  You can't pack another object into trailing

padding in a function that lets a non-const pointer escape (because writes to

objects are allowed to clobber the padding).  Actually since casting away const

is allowed, I guess any pointer.  And if a pointer doesn't escape, you can

spill the object in whatever format you like.

- optional<96_bit_type> (like struct {long; int;}) will be larger if the bool

is first instead of last.  There's never padding at the beginning of a struct,

but there sometimes is at the end.  This downside applies to all architectures,

not just x86-64, and we don't want to hurt everything else for a small benefit

on x86-64.

So on second though, class optiona{ T value; bool has_value; } makes more

sense.

Hopefully LTO can tweak the calling convention of functions returning

optional<int> or other 32-bit types in most cases, so we get the bool in a

separate register for easy testing.  Otherwise use BT instead of movabs/test

unless the constant can be hoisted.

---

Can we design optional<T> such that the has_value flag is all-ones or all-zeros

and extends to the next alignment boundary?  That would allow test %rax,%rax /

jl  to test the sign bit.  Or test %eax,%eax for optional<short> or

optional<three_bytes>.

(jl is more efficient than js, because it macro-fuses with test on more CPUs. 

TEST same,same always clears OF so they're equivalent.)

I think you could implement an array like uint8_t has_value[ func(sizeof(T)) ],

where func() is constexpr returning next_power_of_2(sizeof(T)) - sizeof(T) (and

clamp to 1..4 on 64-bit machines).  Maybe some kind of union hack could let it

compile efficiently, to set / clear all the has_value bytes with AND or OR, but

that might be worse than BTS / BTR in code that needs to modify the has_value a

lot.

This might need compiler support to actually not suck, and that sounds like

more work than std::optional is worth unless it *really* catches on.  But if it

does, test/jl is extremely good for reading has_value.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>