<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/56547>56547</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Missed devirtualization of hot function call makes Binary-Trees C++ benchmark 60% slower than on GCC (Benchmarks Game)
</td>
</tr>
<tr>
<th>Labels</th>
<td>
llvm:codegen,
performance
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
yurai007
</td>
</tr>
</table>
<pre>
**Problem description**
Consider following C++ snippet: https://godbolt.org/z/7fxz1rhbo. The most important part of example is recursive function _make_ doing Node allocation through function _allocate_ before calling itself:
`void* mem = store.allocate(sizeof(Node), alignof(Node));`
In _allocate_ function there is call to _do_allocate_ which is virtual:
```
void* allocate(size_t bytes, size_t alignment = alignof(max_align_t)) {
return ::operator new(bytes, do_allocate(bytes, alignment));
}
```
In OK case scenario - binary produced by GCC, _do_allocate_ is devirtualized and then inlined together with _allocate_. Finally _make_ function contains only direct calls to overriden function - _do_allocate_impl_ without recursion:
```
make(int, monotonic_buffer_resource&) [clone .constprop.1]:
push r12
push rbp
mov rbp, rdi
sub rsp, 8
mov rsi, QWORD PTR [rdi+8]
call monotonic_buffer_resource::do_allocate_impl(void*)
mov rsi, QWORD PTR [rbp+8]
...
```
Unfortunately assembly produced by Clang is much worse. In _make_ output _do_allocate_ is **not** devirtualized to _do_allocate_impl_, indirect call through vtable can be seen:
```
make(int, monotonic_buffer_resource&): # @make(int, monotonic_buffer_resource&)
push rbp
push r14
push rbx
mov rbx, rsi
mov ebp, edi
mov rax, qword ptr [rsi]
mov esi, 16
mov edx, 8
mov rdi, rbx
call qword ptr [rax + 16]
```
I'm not 100% sure but I believe that missing devirtualization opportunity leads to preserving recursion. In OK case we could see that GCC was able to get rid of _make_ recursion. However it's not the case for Clang, _make_ still calls itself:
```
mov rsi, rbx
call make(int, monotonic_buffer_resource&)
mov qword ptr [r14], rax
mov edi, ebp
mov rsi, rbx
call make(int, monotonic_buffer_resource&)
```
**Impact on Benchmarks Game Binary-Trees benchmark**
_Allocate_ function call is hotspot in one of C++ Benchmarks Game programs - Binary-Trees (currently top one): https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html You can easily spot out relevant difference between compilers output (_make_ function) in benchmark assembly: https://godbolt.org/z/813nMn7Pd
After building (using exact command from: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html) and running binarytrees-gpp-7 benchmark (in my case it's x86_64 Skylake box), it's clear that Clang binary is ~60% slower than GCC binary:
```
[yurai@archlinux release]$ time ./binarytrees-clang 21
stretch tree of depth 22 check: 8388607
...
long lived tree of depth 21 check: 4194303
real 0m4.646s
[yurai@archlinux release]$ time ./binarytrees-g++ 21
stretch tree of depth 22 check: 8388607
...
long lived tree of depth 21 check: 4194303
real 0m2.915s
```
**Potential root cause**
As far as I can tell missed devirtualization is connected to lack of overriden function emission in CodeGen, just after parsing AST in frontend. It can be narrowed down to CodeGen::CodeGenModule::EmitTopLevelDecl.
When ran for virtual _do_allocate_ it seems that its callee - CodeGenModule::EmitGlobalDefinition doesn't emit any thunks and later ScalarExprEmitter::VisitCallExpr doesn't visit overriden _do_allocate_impl_.
**Workaround attempts**
I couldn't find any easy way in persuading Clang to better code generation (in particular in forcing do_allocate devirtualization) for both original example and Binary-Trees benchmark.
Using 'final' specifier doesn't change anything. Enabling more optimizations via -Ofast/-flto/-flto=thin doesn't help as well which make sense given that issue probably has nothing to do with middle-end.
Maybe the only way is more extensive code change, but it's something that I wanted to avoid.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzNWNmS27gV_Rr2C0ositof9NCL7XQljj1jT1x5UoEkKGIaBBgA1OKvz7kgpdbScpxkKhWX3BIJ4q7nnnvBzBT7ZZTe4_PZmkyJmhXC5VY2XhrdLUTJU5T0fx-NdrIQlpVGKbOVes0eo_QBH-a0bBrho9E9q7xvHH5E6Xt81qbIjPKxsWtcfcf_Wbn7PrRVZmL2tRKsNs4zWTfGeq49a7j1zJRM7HjdKMGkY1bkrXVyI1jZ6pxMY6uav4gVKwzZ8FdTCMZhUc7Doq-sadfVydP9InZkojRWsBx3aKv0TqiSjD1xM5omGyMLOM9qhCQaPTHnsSs-iInSuZPfhSnxg5RH6SJKH2GCXOvzm4to9ABxp9Kfz-w52ugrYYO3ZBrzhq0Kc_LctpJ5RcsbaX3L1bXJ_SdcHsy_MHjlWbb3wpGx_XWwuRYIPLn56kHNd6twtfKdIyyaPXTCWf_PCt9azciS0b1phOUIEtNii-1HNSdOnN4-qn2NUufH7OlNhxC0T39GaJxgLheaW2nYgGUSv_assaZoc1HAOfbh8ZHknwcPYStEHzi4XTCuCwq4ZlIDBrjhzVpQBthW-uokPzF7Dx1K7Q-IO-YrN9pzqR0zGquFBEh9yJ2j5JmNsBalol83DM6NAuLVKqgzrT9AHEX3w7ySDQijpLg9onK08UbLfJW1ZSnsygpnWpvjkWnI2OQhV0YLFsNY5xGmJh5Gk6ejjkMmm9ZVIaPD9MZC1pwv1GbDDguwxBbyfN21Wbfuwvr8xm4nafWXb59-fWKfv_5KFpOo9GFOZp7tCWUR9t5yOuDwMsSIVl8MBLOft4L8OrEiShZxHL-Zk980GMW3GhoBBO6cqDN1DspHxYlrHKtbFPHWWCdi9nxkMQCgAQauMNsRMNztflxg-JIiAqDIEalP0Hgkw43nIHjc0-BA5oT4I6FGtA9zRywaJ__exp-D2ytAx7d27G4BdBcA6uTb66IDsLgE8HE_D_v_gaQVrPE2gAPCLuF5lNeBaTi9sVzsflQPRdh85cwB--dW8B2jzgtVk7dJ87C9p9AondUM2WDDJInSCWoUDScD8J6BCCUF-quvuGe1dI564wncur5qmiYgXfo9U4IXgekaJFTYDW04klgA94GvtwCdaVVBmOsUgKPZljsWAAkRoF4GsqSu31fEiaQ_mS0ss-jUsN8F-0HUnWgUXldbgfK7rc5LBKsj4lvd_a0gXbDBzRz8F_A-qDhPIyCN_JFKfgPDooOFuEnCf7zJF0Hq_wYaeq4bDnIBHh6Ezqua2xfHPvBasIfQjgdfrRAOkOoXr8fI1f31-BNsBedVxrsGOZaAGzoXIHGYMC-1gWDXltcOjfVMMTwFeCxmC7CwNw3J6SnqfDI9WujWkDfwgtdxw9fCxYXIJNexxjx7-Rhu9IppKej1pHawbprBLK58rdjfTRt4VnAnYUPwp2vySmxoxi0kBR5yUX_Cb8HFKJG6kUpYd-gGcONi5KCWLvVrYI-95meG7vlwpD_q2efiNBH3pUdlZa1UBdUvVLah8jF7U_swdU2DUmlNHQj-fxM-cpPU2lZrMubqqZMIBFSzet_RQU8Ru_l0NR2zLy97hfixzOz6-bxfz8FdtmOirjH3YyR13Nm7aceNON6I8JAOdNU98mMmQTHvW8sleiC3eYXBst2FnMO2UOJj5iWAG1-4ngcj0mEnxeGmx5RAawT_QjSYSFMMZguWVyJ_oVTMR_P5NJl1O46DCWa9NVM4JxWXu4dnu8fDxXiUjE5dsQJHimSR1ON4Op66My__Q7fWfd3-nziWxovhxP1LdvtsPJhDcsWsMTRCtfCyG7_OKsexEiBCE3sOpY7ZT4XGCROv-iYd6ozWGMm6sU3x_IV8eOOQIELzpT2aPeIQ-QFjGpD7e4szMg_VivNxqNH7L1_pIRSnhsUFWq4_DHfIgQV8YYnZalJ4kBQm5P7iI8ZT1Q_N72rpv5rmL2i06knkKu59_UZHJAuh1Gh7py7HVE99HSwc6gkNNzA5cjRgNxR9UCbjUFNKDBLkamGEg5czT97DTQ3arloNnicaUJy8_gKp3L7bNZZE4E4n72_SSf8IhbRyImhD90_iez0ox9ep_2bsC8ewDKUcKurGu-ve9dwNM50auFAEe1ELe4w0e8oIDsKu5YFQO3pBAsDy5EVOLyrWQtNRmTzv6IveeMi8hX8hocbmYfp6tfgKUUSRlJLMoAaMlWs6ox7fmFDU3m7Fh7z-5jq6n5W0Ed_oUSKXpRSnQcxBfmsSt_cVno_ZO42JjXbW9ArFNCj73iB6LcHZ4FPJHXH-oFTeHL9HT7T_RHAlVEOls6Wi6V5sUKcDkDRYfI061z2cnGtDo884HakqHuY_MoaCWpjutF7LolBiQEXQufeR7zMRpsRwNg95cZ3RYodiCa-SQi46F6nCaA7u-4MzteiVkBHPEKD7yuV0lozviuWoWIwW_M5Lr8Ty4426R4ljnLmYcchTd56fw4Tz2teum5Dp-hAQczEHAQt3rVXLiwEAgWkznPupzyq1OXwNEMzfQUS4DMEFwt9PppPx7K5azmbZeCHGkyKZYbgX8xFPiiKdz8eTZJRk6fxOcZwT3BINIUrTIHR0T2FcE0mlFEV8Af6AJgYHGihTNIg7uUyTNE1mw8kQBI2D9GQkslk5T8rxmE8XOL-ME1FzqWKSSTPLnV0GY7N27bCopPPudRFjj1xrIYIhkM9bXxm7DE0qSWZ3wbFl8Oqf6hWldg">