[cfe-dev] Varying per function optimisation based on include path?

Tue Aug 20 09:20:25 PDT 2019

On Tue, Aug 20, 2019 at 9:42 AM via cfe-dev <cfe-dev at lists.llvm.org> wrote:

> > In -Og mode, it seems that it would equally make sense to take "a very
> big
> > slice around system headers specifically to avoid" debug symbols for code
> > that users can't debug.
>
>
>
> Our users seem to like to be able to dump their STL containers, which
> definitely requires debug symbols for "code they can't debug."
>

Hmm, I may have muddled things up by mentioning "debug symbols" without
fully understanding what people mean by that phrase precisely. I meant
"line-by-line debugging information enabling single-step through a bunch of
templates that the user doesn't care about and would prefer to see inlined
away." Forget debug symbols and focus on inlining, if that'll help avoid my
confusion. :)

> OTOH being able to more aggressively optimize system-header code even in
> –Og mode seems reasonable.
>
> OTOOH most of the system-header code is templates or otherwise inlineable
> early, and after inlining the distinction between app and sys code really
> goes away.
>
>
I believe we'd like to get "inlining early," but the problem is that `-Og`
disables inlining. So there is no "after inlining" at the moment.
Here's a very concrete example: https://godbolt.org/z/5tTgO4

int foo(std::tuple<int, int> t) {

    return std::get<0>(t);

}

At `-Og` this produces the assembly code

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq
_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  jmp _ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE

_ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE:

  jmp _ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_

_ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_:

  addq $4, %rdi

  jmp _ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_

_ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_:

  movq %rdi, %rax

  retq

I believe that if John McFarlane's proposal were adopted by Clang, so that
inlining-into-system-functions were allowed at `-Og`, then the resulting
assembly code would look like this instead, for a much better experience in
both debugging and runtime performance:

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq
_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  leaq 4(%rdi), %rax

  retq

Notice that we still aren't inlining `std::get` into `foo`, because `foo`
(as a user function) gets no inlining optimizations at `-Og`. But we do
inline and collapse the whole chain of function-template helpers into
`std::get` (because `std::get` is a function *defined* in a system header).
This inlining creates new optimization opportunities, such as combining the
`add` and `mov` into a single `lea`.

HTH,
–Arthur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190820/5f18b10d/attachment.html>