[cfe-dev] [RFC] Per-callsite inline intrinsics

Tue Sep 4 11:32:14 PDT 2018

Hi folks,

TL;DR: I propose to add 3 new C/C++ intrinsics for controlling inlining at
callsite:
* __builtin_no_inline(Foo()) -- prevents the call to Foo() from being
inlined at that particular callsite.
* __builtin_always_inline(Foo()) -- inlines this call to Foo(), if possible.
* __builtin_flatten_inline(Foo()) -- inlines this call to Foo() and
(transitively) everything called within Foo’s body.

These intrinsics apply to the outermost call-like expression and it will be
possible to use them with: function calls, member function calls, operator
calls, constructor calls, indirect calls (with function pointers, member
function pointers, virtual calls).

I proposed patch implementing the first two intrinsics here:
https://reviews.llvm.org/D51200. I would really appreciate feedback on the
proposed semantics and implementation. I don’t have much experience with
Clang, and I’d appreciate any help with the technical problems I mentioned
in the code review. Details below.

Motivation:
It’s often the case that the compiler missed some inlining opportunity or
inlined a function call excessively. In a lot of cases, it’s possible to
map a performance regression to a few wrong inlining decisions. When that
happens, we can manually enforce the correct inlining decisions by:
1. Marking the callees of interest with __attribute__ ((noinline)),
__attribute__ ((always_inline)), or gnu::flatten. This affects all call
sites with such callees. For more fine-grained control over inlining, one
workaround is to create a few copies (or proxies), each marked with a
different attribute.
2. Globally changing the inline thresholds (e.g., -mllvm
-inline-threshold=K).
3. Manually modifying the source in order to change the calculated inlining
cost (e.g., splitting function into a few smaller ones), or even inlining a
function by hand by copy-pasting it into the callsite.

Problem with the existing solutions:
* (1) and (2) is that they can affect inlining globally instead of only at
the places where it matters.
* (1) and (3) can have the disadvantage of duplicating code and thus making
it less maintainable.
* (1) and (3) sometimes cannot be applied if for some reason we cannot
modify the inlined functions. This can be the case when these functions are
declared in an external library.

Proposed solution:
I propose to introduce new Clang intrinsics for controlling inlining at the
call-site level. This way, it’s possible to cleanly hint a compiler on what
should happen to only a particular function call. These intrinsic are also
self-documenting, in the sense that they are easy to reason about for
humans and appear directly in source code.

The proposed intrinsics are __builtin_no_inline, __builtin_always_inline,
and __builtin_flatten_inline.

Example:
int foo(int) { /* ... */ }

void baz(int) { /* ... */ }

struct S {

 S();

 void bar(int);

 virtual void virt();

 S operator++();

 friend S operator+(const S &, const S &);

};

S *GetS();

int main() {

 // Inline the function call to foo(0) into main.

 int x = __builtin_always_inline(foo(0));

 // Prevent the constructor from being inlined into main.

 S s = __builtin_no_inline(S());

 // Force inline S::bar into main without forcing foo to be inlined.

 __builtin_always_inline(s.bar(foo(x)));

 // Force inline foo into main without forcing S::bar to be inlined.

 s.bar(__builtin_always_inline(foo(x)));

 // Force the outer call to baz to be inlined, then try to

 // transitively inline every function call from baz's body.

 // Does not force foo to be inlined.

 __builtin_flatten_inline(baz(foo(x)));

 // Force the operator call S + S to be inlined.

 ++__builtin_always_inline(s + s);

 // Try to inline the virtual call to virt, if possible.

 __builtin_always_inline(GetS()->virt());

}

Syntax and semantics:
The inline intrinsics can be applied to function calls, member function
calls, constructor calls, virtual calls, function pointer and member
function pointer calls, and operator calls.  They always affect the
outermost call and not subexpressions.

All the intrinsics work on a “best-effort” basis, and make the specified
inline decisions happen whenever possible. This may not always be the case,
e.g. if you wrap indirect calls with __builtin_always_inline and the target
doesn’t happen to be resolved during compilation.

One thing I’m not sure about is what to do when the expression inside
inline intrinsic doesn’t happen to be any kind of call. It doesn’t make
much sense to be able to write something like:
__builtin_always_inline(1 + 3), but what may happen in generic context
(e.g.,
__builtin_always_inline(t + u)), is that it’s not known if expressions will
end up operating on primitive types or user-defined ones that actually make
function calls. In my opinion, it will make life easier if inline
intrinsics over non-call-like expressions will be treated as no-ops, in any
context, as the compiler can already reason about them and won’t perform
any function calls. One option is to silently not inline when the compiler
resolves the call to an operation, which would be consistent with the
behavior of silently not inlining calls it cannot resolve.  Alternatively
we may emit warnings, which would make maintaining code with these
intrinsics easier.
I’d really like to get feedback on this issue.

Implementation:
I have already partially implemented the first two intrinsics
(__builtin_no_inline and __builtin_always_inline) here:
https://reviews.llvm.org/D51200. Calls wrapped with the inline intrinsics
are annotated with appropriate attributes during code generation. LLVM
seems to already take care of callsites attributed with alwaysinline and
noinline. I think it should also be possible to implement some appropriate
attribute for flattening, as there’s already gnu::flatten attribute for
function declarations.

Let me know what you think,
Kuba

-- 
Jakub Kuderski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180904/18d8fb8a/attachment.html>