<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/56547>56547</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Missed devirtualization of hot function call makes Binary-Trees C++ benchmark 60% slower than on GCC (Benchmarks Game)
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            llvm:codegen,
            performance
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          yurai007
      </td>
    </tr>
</table>

<pre>
    **Problem description**

Consider following C++ snippet: https://godbolt.org/z/7fxz1rhbo. The most important part of example is recursive function _make_ doing Node allocation through function _allocate_ before calling itself:

`void* mem = store.allocate(sizeof(Node), alignof(Node));`

In _allocate_ function there is call to _do_allocate_ which is virtual:

```
void* allocate(size_t bytes, size_t alignment = alignof(max_align_t)) {
        return ::operator new(bytes, do_allocate(bytes, alignment));
}
```
In OK case scenario - binary produced by GCC, _do_allocate_ is devirtualized and then inlined together with _allocate_. Finally _make_ function contains only direct calls to overriden function - _do_allocate_impl_ without recursion:

```
make(int, monotonic_buffer_resource&) [clone .constprop.1]:
        push    r12
        push    rbp
        mov     rbp, rdi
        sub     rsp, 8
        mov     rsi, QWORD PTR [rdi+8]
        call    monotonic_buffer_resource::do_allocate_impl(void*)
        mov     rsi, QWORD PTR [rbp+8]
        ...
```
Unfortunately assembly produced by Clang is much worse. In _make_ output _do_allocate_ is **not** devirtualized to _do_allocate_impl_, indirect call through vtable can be seen:

```
make(int, monotonic_buffer_resource&):  # @make(int, monotonic_buffer_resource&)
        push    rbp
        push    r14
        push    rbx
        mov     rbx, rsi
        mov     ebp, edi
        mov     rax, qword ptr [rsi]
        mov     esi, 16
        mov     edx, 8
        mov     rdi, rbx
        call    qword ptr [rax + 16]
```
        
I'm not 100% sure but I believe that missing devirtualization opportunity leads to preserving recursion. In OK case we could see that GCC was able to get rid of _make_ recursion. However it's not the case for Clang, _make_ still calls itself:

```
        mov     rsi, rbx
        call    make(int, monotonic_buffer_resource&)
        mov     qword ptr [r14], rax
        mov     edi, ebp
        mov     rsi, rbx
        call    make(int, monotonic_buffer_resource&)
```

**Impact on Benchmarks Game Binary-Trees benchmark**

_Allocate_ function call is hotspot in one of C++ Benchmarks Game programs - Binary-Trees (currently top one): https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html You can easily spot out relevant difference between compilers output (_make_ function) in benchmark assembly: https://godbolt.org/z/813nMn7Pd

After building (using exact command from:  https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html) and running binarytrees-gpp-7 benchmark (in my case it's x86_64 Skylake box), it's clear that Clang binary is ~60% slower than GCC binary:

```
[yurai@archlinux release]$ time ./binarytrees-clang 21
stretch tree of depth 22         check: 8388607
...
long lived tree of depth 21      check: 4194303

real    0m4.646s

[yurai@archlinux release]$ time ./binarytrees-g++ 21
stretch tree of depth 22         check: 8388607
...
long lived tree of depth 21      check: 4194303

real    0m2.915s
```

**Potential root cause** 

As far as I can tell missed devirtualization is connected to lack of overriden function emission in CodeGen, just after parsing AST in frontend. It can be narrowed down to CodeGen::CodeGenModule::EmitTopLevelDecl. 
When ran for virtual _do_allocate_ it seems that its callee - CodeGenModule::EmitGlobalDefinition doesn't emit any thunks and later ScalarExprEmitter::VisitCallExpr doesn't visit overriden _do_allocate_impl_.

**Workaround attempts**

I couldn't find any easy way in persuading Clang to better code generation (in particular in forcing do_allocate devirtualization) for both original example and Binary-Trees benchmark. 
Using 'final' specifier doesn't change anything. Enabling more optimizations via -Ofast/-flto/-flto=thin doesn't help as well which make sense given that issue probably has nothing to do with middle-end.
Maybe the only way is more extensive code change, but it's something that I wanted to avoid.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzNWNmS27gV_Rr2C0ositof9NCL7XQljj1jT1x5UoEkKGIaBBgA1OKvz7kgpdbScpxkKhWX3BIJ4q7nnnvBzBT7ZZTe4_PZmkyJmhXC5VY2XhrdLUTJU5T0fx-NdrIQlpVGKbOVes0eo_QBH-a0bBrho9E9q7xvHH5E6Xt81qbIjPKxsWtcfcf_Wbn7PrRVZmL2tRKsNs4zWTfGeq49a7j1zJRM7HjdKMGkY1bkrXVyI1jZ6pxMY6uav4gVKwzZ8FdTCMZhUc7Doq-sadfVydP9InZkojRWsBx3aKv0TqiSjD1xM5omGyMLOM9qhCQaPTHnsSs-iInSuZPfhSnxg5RH6SJKH2GCXOvzm4to9ABxp9Kfz-w52ugrYYO3ZBrzhq0Kc_LctpJ5RcsbaX3L1bXJ_SdcHsy_MHjlWbb3wpGx_XWwuRYIPLn56kHNd6twtfKdIyyaPXTCWf_PCt9azciS0b1phOUIEtNii-1HNSdOnN4-qn2NUufH7OlNhxC0T39GaJxgLheaW2nYgGUSv_assaZoc1HAOfbh8ZHknwcPYStEHzi4XTCuCwq4ZlIDBrjhzVpQBthW-uokPzF7Dx1K7Q-IO-YrN9pzqR0zGquFBEh9yJ2j5JmNsBalol83DM6NAuLVKqgzrT9AHEX3w7ySDQijpLg9onK08UbLfJW1ZSnsygpnWpvjkWnI2OQhV0YLFsNY5xGmJh5Gk6ejjkMmm9ZVIaPD9MZC1pwv1GbDDguwxBbyfN21Wbfuwvr8xm4nafWXb59-fWKfv_5KFpOo9GFOZp7tCWUR9t5yOuDwMsSIVl8MBLOft4L8OrEiShZxHL-Zk980GMW3GhoBBO6cqDN1DspHxYlrHKtbFPHWWCdi9nxkMQCgAQauMNsRMNztflxg-JIiAqDIEalP0Hgkw43nIHjc0-BA5oT4I6FGtA9zRywaJ__exp-D2ytAx7d27G4BdBcA6uTb66IDsLgE8HE_D_v_gaQVrPE2gAPCLuF5lNeBaTi9sVzsflQPRdh85cwB--dW8B2jzgtVk7dJ87C9p9AondUM2WDDJInSCWoUDScD8J6BCCUF-quvuGe1dI564wncur5qmiYgXfo9U4IXgekaJFTYDW04klgA94GvtwCdaVVBmOsUgKPZljsWAAkRoF4GsqSu31fEiaQ_mS0ss-jUsN8F-0HUnWgUXldbgfK7rc5LBKsj4lvd_a0gXbDBzRz8F_A-qDhPIyCN_JFKfgPDooOFuEnCf7zJF0Hq_wYaeq4bDnIBHh6Ezqua2xfHPvBasIfQjgdfrRAOkOoXr8fI1f31-BNsBedVxrsGOZaAGzoXIHGYMC-1gWDXltcOjfVMMTwFeCxmC7CwNw3J6SnqfDI9WujWkDfwgtdxw9fCxYXIJNexxjx7-Rhu9IppKej1pHawbprBLK58rdjfTRt4VnAnYUPwp2vySmxoxi0kBR5yUX_Cb8HFKJG6kUpYd-gGcONi5KCWLvVrYI-95meG7vlwpD_q2efiNBH3pUdlZa1UBdUvVLah8jF7U_swdU2DUmlNHQj-fxM-cpPU2lZrMubqqZMIBFSzet_RQU8Ru_l0NR2zLy97hfixzOz6-bxfz8FdtmOirjH3YyR13Nm7aceNON6I8JAOdNU98mMmQTHvW8sleiC3eYXBst2FnMO2UOJj5iWAG1-4ngcj0mEnxeGmx5RAawT_QjSYSFMMZguWVyJ_oVTMR_P5NJl1O46DCWa9NVM4JxWXu4dnu8fDxXiUjE5dsQJHimSR1ON4Op66My__Q7fWfd3-nziWxovhxP1LdvtsPJhDcsWsMTRCtfCyG7_OKsexEiBCE3sOpY7ZT4XGCROv-iYd6ozWGMm6sU3x_IV8eOOQIELzpT2aPeIQ-QFjGpD7e4szMg_VivNxqNH7L1_pIRSnhsUFWq4_DHfIgQV8YYnZalJ4kBQm5P7iI8ZT1Q_N72rpv5rmL2i06knkKu59_UZHJAuh1Gh7py7HVE99HSwc6gkNNzA5cjRgNxR9UCbjUFNKDBLkamGEg5czT97DTQ3arloNnicaUJy8_gKp3L7bNZZE4E4n72_SSf8IhbRyImhD90_iez0ox9ep_2bsC8ewDKUcKurGu-ve9dwNM50auFAEe1ELe4w0e8oIDsKu5YFQO3pBAsDy5EVOLyrWQtNRmTzv6IveeMi8hX8hocbmYfp6tfgKUUSRlJLMoAaMlWs6ox7fmFDU3m7Fh7z-5jq6n5W0Ed_oUSKXpRSnQcxBfmsSt_cVno_ZO42JjXbW9ArFNCj73iB6LcHZ4FPJHXH-oFTeHL9HT7T_RHAlVEOls6Wi6V5sUKcDkDRYfI061z2cnGtDo884HakqHuY_MoaCWpjutF7LolBiQEXQufeR7zMRpsRwNg95cZ3RYodiCa-SQi46F6nCaA7u-4MzteiVkBHPEKD7yuV0lozviuWoWIwW_M5Lr8Ty4426R4ljnLmYcchTd56fw4Tz2teum5Dp-hAQczEHAQt3rVXLiwEAgWkznPupzyq1OXwNEMzfQUS4DMEFwt9PppPx7K5azmbZeCHGkyKZYbgX8xFPiiKdz8eTZJRk6fxOcZwT3BINIUrTIHR0T2FcE0mlFEV8Af6AJgYHGihTNIg7uUyTNE1mw8kQBI2D9GQkslk5T8rxmE8XOL-ME1FzqWKSSTPLnV0GY7N27bCopPPudRFjj1xrIYIhkM9bXxm7DE0qSWZ3wbFl8Oqf6hWldg">