<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/129764>129764</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
`-fzero-call-used-regs` should not trigger before tail-calls
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
nelhage
</td>
</tr>
</table>
<pre>
I believe that `-fzero-call-used-regs` should be modified to not clear registers prior to a tail call. Here's my reasoning:
With the landing of `clang::musttail`, there's been a bit of a trend towards using indirect tail calls to implement efficient interpreters and parsers; see [the original post about protobuf][proto-tc], and [CPython's recent new interpreter][python-tc]. This pattern is, in part, an alternative to using computed gotos to implement dispatch within a single large interpreter function.
In both cases (computed gotos, and indirect tail calls), the opcode/parser definition generates fairly similar code, ending with an indirect call through a dispatch table. Depending on compiler choices, this turns into (on x86) something like `jmpq *%REG` or `jmpq *(%REG1, %REG2, 8)`
With `-fzero-call-used-regs` enabled, clang/LLVM currently emit call-used-clearing `xor`s prior to the indirect tail-call, but not prior to a computed goto, even one that produces near-identical machine code ([example on goldbolt](https://godbolt.org/z/dxh754E49), showing the stylized core of an interpreter loop).
Such interpreter loops tend to be extreme hot spots. On CPython, I've measured the cost of `-fzero-call-used-regs=used-gpr` on **only** the opcode functions at about 2% on [the pyperformance suite](https://github.com/python/pyperformance/), when using the tail-call interpreter. It seems surprising and "unfair" to impose this cost on the tail-call style but not the computed goto style of interpreter, when, again, they emit very similar machine code containing similar indirect jumps (and potential JOP gadgets).
Also, GCC's implementation behaves in the way I describe, eliding the clearing for tail calls. See a [godbolt example](https://godbolt.org/z/3KTYzWoWb) -- if you remove the `clang::musttail` and add `-fno-optimize-sibling-calls` to the GCC options, the `xor`s will reappear
[proto-tc]: https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html
[python-tc]: https://github.com/python/cpython/pull/128718
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJyMVt1u2zwSfRr6ZmBDoiT_XPgicepudrtosS222EtKGknTpUgtSTlxnn4xlOLYbb6iQIHSIefvzJwzUt5TaxD3orgXxcNCjaGzbm9Qd6rFRWnr8_4RStSEJ4TQqQBinSybF3R2WSmtl6PHeumw9WKdgO_sqGsoEXpbU0NYQ7BgbIBKo3LgsCUf0HkYHFnHlwqCIg3sawV_Q4dCbjz0Z3CovDVkWpHdiYT_fafQQegQtDI1mRZsw9lUWk2Psrt-9IHdiXUi5IHfzv5KRAMKSgpspCA4NJzbk3K1h9GzNzI1OazCW0KeE6R-0NijCYBNQxXxiUxANziMpShTw6CcR-dFdg8eEURxz3laRy0ZpWGwPoAq7RhgcDbYcmxE8SCK-_hrGSr-JQ_RlSjuD1_OobMmZu6w4ogGn66jztbx2WS-gm8deRhUCOgMkGd_ZDizMLkGpflKBeJW2rnqyvbDGLCG1gb7U8E1-UGFqoMnCh0xgGyiuQGuxet8oBlNFcia1dSqRwOlDR1UyqMHIbe3YV5rfQdyIXdz68AOla1RyOOELtTYkCGOAi0adCqgh0aR02fw1JNWDiaLA-A0IZw4l34JxDEgdM6ObQfqrcKgSo0reMBhtrQmQkMaHVSdpQr9lBd5CKMznsu3XJo18LxdC7kDb3tkoFrQ9F_k2fzRD_8DIe-ELP714SNTxLrbv2-nq5SdT0fJxy3jsE6uBv93vEPD6ddsOLFBHj99-vc_oRqdQxP0GbCnqfjJMNKRExXr5Nk6sU6uKMnY33QmRmTn5Rgim6_Ye9PYiPwJDVgzq8XgbD1W6MGgckuq0QSqlIZeVR0ZjP1iEEVxj8-KJ4-Rb62uS6tDpMW2C2HwTHB5FPLY2ni1so7LfBHyWD93myL_kO_m2fGdfeLauA4fzppesIbKOozcNzdzq60dhNytYEL661h1v9x7CJNasLDhc3DYI3Q2gB9s8Cv4bODC2AM8Crk5IfSo_OhYADuu0odZrd7vYfYQz-3g4oyYaTburNHn6XRFiAvXPKhXVZFCFtFsEp7hPKBrrOuVqRD8SAHfhZJCN5aryvZMsrmC441xfBhRferQzJrBIS5zcQ3XCh4DC2DvwY9ucBSfR1WTcjTMVSHlLDLW40SnCR3zk1vuHF4mbkLxatTme9vcyOKcZ9SXVpGZpWSe_xO6N6W4mcDKmqCIt83l_sKAH2M_RA2LQm8Dj7DS8PfPX6BVdYuBNWvWvTvtIwk-Hg5RvS9aqqJsldipE7J0xIKe1BkeoUZfOSon2dJUvyJ84WjDXLsI5Aq-IoLiVs9UgJk5f8aW7B_f_vPy3X4vWbGWS6AGznYEh72NKx7_eqfGTqq6nubY2KUdAvX0gktPpSbTLicJXyevMvLxcAB-ZI1_VfUrxXkirXnPDwMqNwH4007M7uC2nlLbduXwhK6MC2AuTCYyFfKY5Hzm02vWy8vSXl4v7VUXej2Hu1qiv8Z7nyLVG1lGVsZjKrebdLuo91m9y3Zqgft0k6dpWmzSfNHt5SZNyixbF8mukIWqiyTdlYnaVkmeyXyXLWgvE1kkWZKnu7RI89W6yuuibNbZJk-bLMtFnmCvSK-0PvVc9IK8H3Gfyt1mnS-0KlH7-BUnZfxW4FshJX_UuT0bLcux9SJPNPng39wEChr3f_BRF1noqG3RQYkNy-mFrH4xOr3_DXIcbv5vOTj7A6sg5DEm6SN8sYrTXv4_AAD__8LNkZY">