[llvm-bugs] [Bug 34840] New: Customized return-value calling conventions for private functions could make better choices

Wed Oct 4 22:40:49 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=34840

            Bug ID: 34840
           Summary: Customized return-value calling conventions for
                    private functions could make better choices
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

In the x86-64 SysV ABI, small aggregate return values are packed into rax and
rdx with multiple members per register, but a single member won't split across
register boundaries.  So pair<int,int> packs into RAX, but pair<long,int> uses
rdx and rax separately.  This somehow confuses clang's custom-return-value
optimization for private functions.

// https://godbolt.org/g/WCQMbf
#include <utility>
using mypair1 = std::pair<int, int>;  // long, int makes better code
static __attribute((noinline))
mypair1 getpair1() { return {11, 22};}

        movabsq $94489280523, %rax      # imm = 0x160000000B
        retq                            # Still returns both values

int dst;
void caller1() {
    dst = getpair1().second;
}

        pushq   %rax
        callq   getpair1()
        shrq    $32, %rax          # extra instruction because we didn't
optimize the calling convention 
        movl    %eax, dst(%rip)
        popq    %rax
        retq

But with std::pair<long, int> we get non-ABI-compliant private functions (still
with the standard name, though?  Maybe better to rename them to give
compile-time errors if people have inline asm that calls them?  That's
horrible, but there have been stackoverflow questions...).  Anyway, with
pair<long,int>

getpair2():
        movl    $22, %eax       # This would go in edx in a non-private version
        retq
caller2:
       ...
       callq   getpair1()
       movl    %eax, dst(%rip)
       ...

So changing the type of the member we're *not* using from int to long allows it
to optimize away, and return .second in eax.  If bending the return-value type
/ convention is allowed at all, we should definitely be doing it to optimize
the <int, int> and <int, bool> case.

In most cases, returning separate members in separate registers is going to be
a win, unless we're about to store the whole struct to memory.  (Even then, if
it saves insns overall in callee + caller it might be better to do separate MOV
stores instead of ALU packing into a single reg.)  Up to 4 registers might be a
reasonable threshold, for callers that use most of the members of a returned
object.  If all the callers actually want the object stored in memory and don't
read it directly right away / at all, passing a hidden pointer is good.

Anyway, consider std::optional<int>.  The current libc++ implementation ends up
returning like pair<int, bool>, so testing the boolean part with a TEST
instruction takes a 64-bit constant.  (And clang doesn't think of using  bt
$32, %rax, but that doesn't macro-fuse with jcc like test %dl,%dl will, and
it's bigger.)  (See bug 34839 for an example, or change the #if 0 in the
godbolt link above.)

Returning the boolean in %dl instead of packed in with the T value member in
%rax would be better, and usually easier for the callee.  We get this with
optional<long>.  Hopefully LTO and static functions can do this whenever
possible.

related: libstdc++ has an optional<> that's not trivially copyable, so
optimizing the return value passing for that case would be even more valuable. 
(With LTO or for private functions, optimize away the hidden pointer and
stores/reloads, as suggested by https://stackoverflow.com/a/46549978/224132).

-------

To get better code for non-private calls where we can't bend the SysV ABI, it
might be better to have std::optional<T> store the bool first, like
pair<bool,int>.  (On balance, I think this is *not* a good idea; see downsides
below.)

You can't access the value of an optional<T> without checking the boolean
(unless bug 34839 is fixed and constant-propagation for aggregate return values
lets it optimize away).  So putting the boolean in an easy-to-access place
would help some cases.

For 32-bit T, <bool, int> would allow test %al,%al / shr $32, %rax.  (Or rorx
$32, %rax, %rcx).  The callee can create it cheaply after generating the value
in eax with shl $32, %rax / mov $1, %al  (or setcc %al without any extra
zero-extension).  On CPUs with partial-reg stalls
(tune=generic/nehalem/whatever): OR $1,%rax is still only 4 bytes with an imm8.
 (Although bts $32, %rax is also cheap for returning the current optional<int>
with a variable value and a constant has_value.)

For 64-bit T, the bool goes in %al and T goes in %rdx in the x86-64 SysV ABI. 
For 8, 16, 24, or whatever T with multiple small members, it's still just
shifts / movzx to extract them.

Downsides to using pair<bool,T> for optional<T>:

- Puts an extra shift on the critical path for the value.  The has_value is
often an input to a control dependency, which is off the critical path because
of speculative execution + branch prediction.  Using a 32-bit value in a
register with high garbage is often not a problem.  If the bottleneck isn't the
front-end, then more bloated code to test the has_value might not hurt.

 Especially if the 64-bit constant can be hoisted out of a loop.  Looping over
an array of packed optional<int>, loading the whole thing with a 64b load,
testing the bool in the upper 32 bits, then using the value in the lower 32
bits is actually really efficient.

- Padding inside the object in memory, instead of at the end of the object. 
IDK if that matters, though.  You can't pack another object into trailing
padding in a function that lets a non-const pointer escape (because writes to
objects are allowed to clobber the padding).  Actually since casting away const
is allowed, I guess any pointer.  And if a pointer doesn't escape, you can
spill the object in whatever format you like.

- optional<96_bit_type> (like struct {long; int;}) will be larger if the bool
is first instead of last.  There's never padding at the beginning of a struct,
but there sometimes is at the end.  This downside applies to all architectures,
not just x86-64, and we don't want to hurt everything else for a small benefit
on x86-64.

So on second though, class optiona{ T value; bool has_value; } makes more
sense.

Hopefully LTO can tweak the calling convention of functions returning
optional<int> or other 32-bit types in most cases, so we get the bool in a
separate register for easy testing.  Otherwise use BT instead of movabs/test
unless the constant can be hoisted.

---

Can we design optional<T> such that the has_value flag is all-ones or all-zeros
and extends to the next alignment boundary?  That would allow test %rax,%rax /
jl  to test the sign bit.  Or test %eax,%eax for optional<short> or
optional<three_bytes>.

(jl is more efficient than js, because it macro-fuses with test on more CPUs. 
TEST same,same always clears OF so they're equivalent.)

I think you could implement an array like uint8_t has_value[ func(sizeof(T)) ],
where func() is constexpr returning next_power_of_2(sizeof(T)) - sizeof(T) (and
clamp to 1..4 on 64-bit machines).  Maybe some kind of union hack could let it
compile efficiently, to set / clear all the has_value bytes with AND or OR, but
that might be worse than BTS / BTR in code that needs to modify the has_value a
lot.

This might need compiler support to actually not suck, and that sounds like
more work than std::optional is worth unless it *really* catches on.  But if it
does, test/jl is extremely good for reading has_value.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171005/ec86fd54/attachment-0001.html>