<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/55153>55153</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Bad codegen for crc32c
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
MatzeB
</td>
</tr>
</table>
<pre>
crc32c (https://github.com/htot/crc32c) for optimized CRC32 computation (with a focus on X86 / SSE 4.2). Variants of the code are used in projects like RocksDB, LevelDB or Folly. The code produced by clang-12 is 2x slower than code produced by GCC. I am filing this to document my work on the issue.
The critical part of the code is a "duffs device" looking roughly like:
```
#define CRCtriplet(crc, buf, offset) \
crc ## 0 = __builtin_ia32_crc32di(crc ## 0, *(buf ## 0 + offset)); \
crc ## 1 = __builtin_ia32_crc32di(crc ## 1, *(buf ## 1 + offset)); \
crc ## 2 = __builtin_ia32_crc32di(crc ## 2, *(buf ## 2 + offset));
...
switch ( block_size ) {
case 128:
do {
CRCtriplet ( crc, next, -128 ); // jumps here for a full block of len 128
case 127:
CRCtriplet ( crc, next, -127 ); // jumps here or below for the first block smaller
case 126:
CRCtriplet ( crc, next, -126 );
...
case 1:
...
if (...) { block_size = 128; }
default:
;
} while ( n > 0 );
}
```
Code produced by clang-12 looks like this:
```
# switch-bb:
movl $1, %eax
movq %rax, 2016(%rsp) # 8-byte Spill
movl $2, %eax
movq %rax, 2048(%rsp) # 8-byte Spill
movl $3, %eax
movq %rax, 2040(%rsp) # 8-byte Spill
...
xorl %eax, %eax
movq %rax, -112(%rsp) # 8-byte Spill
xorl %eax, %eax
movq %rax, 1040(%rsp) # 8-byte Spill
xorl %eax, %eax
movq %rax, 1032(%rsp) # 8-byte Spill
...
# and for each switch-case we get this pattern:
.LBB2_11:
movq -128(%rsp), %rax # 8-byte Reload
crc32q -1016(%rax), %rcx
movq -112(%rsp), %rax # 8-byte Reload
crc32q -1016(%r15), %rax
crc32q -1016(%r10), %rdi
movq %rax, 1040(%rsp) # 8-byte Spill
movq -104(%rsp), %rax # 8-byte Reload
movq %rax, 1048(%rsp) # 8-byte Spill
```
The code should be generated without any of those spills and reloads.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy1V0uTokgQ_jV4yZCAQhQPHEad3tiI2Uv3xsbejAIKremSYqqKtp1fv1mFig-csJ1eghCBzPy-fJpmstilucojkoNHkrUxtfaiLx55wnPFzbrJ_Fxu8GZtpMFLK-uRKZRSgawN3_CfrID58zwigKJ1Y6jhsrLmtmgAKErmjQZ89G8yxsdP8PLyFUY-QSs-_EMVp5XB9yWYNUMTBQOqGDQazfIKaiW_sxwFBH9l8CzzV72YeWQO39gbE4sZII8nKcTOh78P-qhTNDnqZzvIBa1Ww5AA10DeQQu5ZQqhaHUt-8d87sOfQDdQcsGrFYqhlpFQoAsbVhnY7GAr1av1xrLlWjfM94KFF3xpPx0HxQ3PqYCaKnPmGFqjGAJSNGWpoWBvPGd4C0LKV4unZLNai53z1eahNTwO9md7S6KClbxiNuhG8VowzEyCmbFRyZrSXiTat4-n4MXzVg_wQCGEj_CEALxoActl1nBheLXkNCJLl96Ct-aOktagRxA4Qesn-mTW4dgzmt1EC-9GC3vRwg-hkbvRSC8a6UNrkXzf7yA1Vni-tqUOmcDKXGpsBnBRn8w6MceOagYhSY5ZhZOjkNcKp0eXaIe1T3XF3o29YnEncIiI61z43mxqDWuGfWT7FHuwEaLlaOtRsMpx6WM46WV4N5PJbSZIJGPYfo6SbYmSK232rPSGCsFUL6Xx71Eaw2kCz5UxndCHeQPxLPvdwW3lJPZlm_qzasBKdHnHkEwW59rYx7QRpg_ski_qwnbNBXNO4niNvroWPHOsA7iYGQcf5zeno51A-xlrh15HqXcG2UbZV_8wy674b-SbcKTJaN_OMaPvVzI_WplY4TuUIkE4RufsA13bSF4cFjMZZjvD4KXmQtzEJB_BHCWfghl9CDN4HPOqBN-l2pNw4HfTGIYheZzGo6jhbzn_OGr0sK_HgFshWhVufDGKc3_fAG5ibBmscAK5haGmxjBVHdvC_zabkWV4PVQORO0EP6W3dwzJ_4LoM85SWlwML_sb98Ma7HrJRuBoMb8Rqsti-FQGYXxm8Q6F4EQBf7L_15rqQhCMPiUEvQQfmDKXe1-3X9oprteyETjDbd1VTFGDA90u3LIxWKW7du2UWJnaWtSucpVjq_1BkUbFNJrSgeFGsHRGC2cULbnqbrf8QaNE-ou_BEK8HS7D_ZaOt24l1vgljsM4GqzTsCgnYxaM8iwblwnLWUSCskzCOCb4IEwGguJaoFMvxrWeVGzbbtX43YsXA56SgODEJJNwGofR2C-nbFTmZTYO4yLPWOSNArahXPiWhy_VaqBSRylrVhpfCq6N7l5SrfmqYszBoX3aYJRU-hc1P9ls4JBTx_w_4qR1qw">