[llvm-bugs] [Bug 34840] New: Customized return-value calling conventions for private functions could make better choices
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Oct 4 22:40:49 PDT 2017
https://bugs.llvm.org/show_bug.cgi?id=34840
Bug ID: 34840
Summary: Customized return-value calling conventions for
private functions could make better choices
Product: new-bugs
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Keywords: performance
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: peter at cordes.ca
CC: llvm-bugs at lists.llvm.org
In the x86-64 SysV ABI, small aggregate return values are packed into rax and
rdx with multiple members per register, but a single member won't split across
register boundaries. So pair<int,int> packs into RAX, but pair<long,int> uses
rdx and rax separately. This somehow confuses clang's custom-return-value
optimization for private functions.
// https://godbolt.org/g/WCQMbf
#include <utility>
using mypair1 = std::pair<int, int>; // long, int makes better code
static __attribute((noinline))
mypair1 getpair1() { return {11, 22};}
movabsq $94489280523, %rax # imm = 0x160000000B
retq # Still returns both values
int dst;
void caller1() {
dst = getpair1().second;
}
pushq %rax
callq getpair1()
shrq $32, %rax # extra instruction because we didn't
optimize the calling convention
movl %eax, dst(%rip)
popq %rax
retq
But with std::pair<long, int> we get non-ABI-compliant private functions (still
with the standard name, though? Maybe better to rename them to give
compile-time errors if people have inline asm that calls them? That's
horrible, but there have been stackoverflow questions...). Anyway, with
pair<long,int>
getpair2():
movl $22, %eax # This would go in edx in a non-private version
retq
caller2:
...
callq getpair1()
movl %eax, dst(%rip)
...
So changing the type of the member we're *not* using from int to long allows it
to optimize away, and return .second in eax. If bending the return-value type
/ convention is allowed at all, we should definitely be doing it to optimize
the <int, int> and <int, bool> case.
In most cases, returning separate members in separate registers is going to be
a win, unless we're about to store the whole struct to memory. (Even then, if
it saves insns overall in callee + caller it might be better to do separate MOV
stores instead of ALU packing into a single reg.) Up to 4 registers might be a
reasonable threshold, for callers that use most of the members of a returned
object. If all the callers actually want the object stored in memory and don't
read it directly right away / at all, passing a hidden pointer is good.
Anyway, consider std::optional<int>. The current libc++ implementation ends up
returning like pair<int, bool>, so testing the boolean part with a TEST
instruction takes a 64-bit constant. (And clang doesn't think of using bt
$32, %rax, but that doesn't macro-fuse with jcc like test %dl,%dl will, and
it's bigger.) (See bug 34839 for an example, or change the #if 0 in the
godbolt link above.)
Returning the boolean in %dl instead of packed in with the T value member in
%rax would be better, and usually easier for the callee. We get this with
optional<long>. Hopefully LTO and static functions can do this whenever
possible.
related: libstdc++ has an optional<> that's not trivially copyable, so
optimizing the return value passing for that case would be even more valuable.
(With LTO or for private functions, optimize away the hidden pointer and
stores/reloads, as suggested by https://stackoverflow.com/a/46549978/224132).
-------
To get better code for non-private calls where we can't bend the SysV ABI, it
might be better to have std::optional<T> store the bool first, like
pair<bool,int>. (On balance, I think this is *not* a good idea; see downsides
below.)
You can't access the value of an optional<T> without checking the boolean
(unless bug 34839 is fixed and constant-propagation for aggregate return values
lets it optimize away). So putting the boolean in an easy-to-access place
would help some cases.
For 32-bit T, <bool, int> would allow test %al,%al / shr $32, %rax. (Or rorx
$32, %rax, %rcx). The callee can create it cheaply after generating the value
in eax with shl $32, %rax / mov $1, %al (or setcc %al without any extra
zero-extension). On CPUs with partial-reg stalls
(tune=generic/nehalem/whatever): OR $1,%rax is still only 4 bytes with an imm8.
(Although bts $32, %rax is also cheap for returning the current optional<int>
with a variable value and a constant has_value.)
For 64-bit T, the bool goes in %al and T goes in %rdx in the x86-64 SysV ABI.
For 8, 16, 24, or whatever T with multiple small members, it's still just
shifts / movzx to extract them.
Downsides to using pair<bool,T> for optional<T>:
- Puts an extra shift on the critical path for the value. The has_value is
often an input to a control dependency, which is off the critical path because
of speculative execution + branch prediction. Using a 32-bit value in a
register with high garbage is often not a problem. If the bottleneck isn't the
front-end, then more bloated code to test the has_value might not hurt.
Especially if the 64-bit constant can be hoisted out of a loop. Looping over
an array of packed optional<int>, loading the whole thing with a 64b load,
testing the bool in the upper 32 bits, then using the value in the lower 32
bits is actually really efficient.
- Padding inside the object in memory, instead of at the end of the object.
IDK if that matters, though. You can't pack another object into trailing
padding in a function that lets a non-const pointer escape (because writes to
objects are allowed to clobber the padding). Actually since casting away const
is allowed, I guess any pointer. And if a pointer doesn't escape, you can
spill the object in whatever format you like.
- optional<96_bit_type> (like struct {long; int;}) will be larger if the bool
is first instead of last. There's never padding at the beginning of a struct,
but there sometimes is at the end. This downside applies to all architectures,
not just x86-64, and we don't want to hurt everything else for a small benefit
on x86-64.
So on second though, class optiona{ T value; bool has_value; } makes more
sense.
Hopefully LTO can tweak the calling convention of functions returning
optional<int> or other 32-bit types in most cases, so we get the bool in a
separate register for easy testing. Otherwise use BT instead of movabs/test
unless the constant can be hoisted.
---
Can we design optional<T> such that the has_value flag is all-ones or all-zeros
and extends to the next alignment boundary? That would allow test %rax,%rax /
jl to test the sign bit. Or test %eax,%eax for optional<short> or
optional<three_bytes>.
(jl is more efficient than js, because it macro-fuses with test on more CPUs.
TEST same,same always clears OF so they're equivalent.)
I think you could implement an array like uint8_t has_value[ func(sizeof(T)) ],
where func() is constexpr returning next_power_of_2(sizeof(T)) - sizeof(T) (and
clamp to 1..4 on 64-bit machines). Maybe some kind of union hack could let it
compile efficiently, to set / clear all the has_value bytes with AND or OR, but
that might be worse than BTS / BTR in code that needs to modify the has_value a
lot.
This might need compiler support to actually not suck, and that sounds like
more work than std::optional is worth unless it *really* catches on. But if it
does, test/jl is extremely good for reading has_value.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171005/ec86fd54/attachment-0001.html>
More information about the llvm-bugs
mailing list