<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/68810>68810</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Regression: no longer generating `vdpbf16ps` with `m32bcst` RHS
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:X86,
llvm:codegen,
regression,
performance
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
bjacob
</td>
</tr>
</table>
<pre>
This is a regression in the performance of the generated code, losing a peephole optimization to target a useful target instruction. Compiler Explorer: https://godbolt.org/z/xsGGsov5W
Summary:
AVX-512 has multiply-accumulate instructions with broadcasted-scalar memory-operand variant for the RHS operand ("m32bcst"). These aren't 1:1 reflected in intrinsics, instead the expectation is that the compiler will turn a sequence of intrinsics (broadcast RHS, then vector-to-vector FMA) into this instruction. This issue is about that peephole optimization no longer happening.
This is a follow-up to recently-fixed Issue #68117 which was about a compiler crash on the same source code. Now that it's fixed, we can test again, but this limits the ability to narrow the regression window on this issue.
Summary:
Clang version | result
--- | ---
Clang 15 | Generates optimal code
Clang 16 | Compiler crash (#68117).
Clang 17 | Compiler crash (#68117).
Clang 18 (trunk after #68117 fixed) | Generates sub-optimal code.
Testcase (see it in Compiler Explorer: https://godbolt.org/z/xsGGsov5W)
```c
#include <immintrin.h>
#include <stdint.h>
static __m512bh bitcast_16xf32_to_32xbf16(__m512 a) {
return *(const __m512bh*)(&a);
}
__m512 iree_mm512_dpbf16_ps_broadcast_rhs(
__m512 acc, __m512bh lhs, const uint16_t* rhs) {
return _mm512_dpbf16_ps(acc, lhs,
bitcast_16xf32_to_32xbf16(_mm512_set1_ps(
*(const float*)rhs)));
}
```
Compile with these flags: `-O2 -mavx512f -mavx512bf16`
Sub-optimal result with current trunk (Clang 18):
```asm
iree_mm512_dpbf16_ps_broadcast_rhs: # @iree_mm512_dpbf16_ps_broadcast_rhs
vbroadcastss zmm2, dword ptr [rdi]
vdpbf16ps zmm0, zmm1, zmm2
ret
```
Optimal result with Clang 15:
```asm
iree_mm512_dpbf16_ps_broadcast_rhs: # @iree_mm512_dpbf16_ps_broadcast_rhs
vdpbf16ps zmm0, zmm1, dword ptr [rdi]{1to16}
ret
```
@RKSimon @alexey-bataev
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMVsGO2zgS_Rr6UpAhUZYsH3xwd8fZxWI3QDrYyc2gqJLFhCI1JGW7--sHJGW3uqeRZDCXMQxbKrEeH189lsisFUeFuCXFHSkeFmx0nTbb-hvjul7UunnafumEBWGBgcGjQWuFViAUuA5hQNNq0zPFEXQbQkdUaJjDBrhukNB7kNoKdQQGA-LQaYmgByd68cych3IaHDNHdMBgtNiO8novlHVm5H7UEu51PwiJBj5cBqkNGpLvoHNusCTfEbondH_UTa2lW2pzJHT_TOj-Yj9-tPpU_EbSB5Lu4u_j2PfMPPm0WXT3_69JkVHomIV-lE4M8ilhnI_9KJnDORkLZ-E6qI1mDWfWYZNYziQz0GOvzVOiBzRMNXBiRjDloNUmaPP5X49wfUZoRSjtc1pz6wilhG6W8KVDi8AMKkLXDjKS7zIw2ErkXlHhhXdGKCu49dJ6UsiaAI6XAbmLmgoLrmMuxPlVuLOQEtxoFDCw-PuIU9VeID2p26o8Wz-H61DBCbnTJnE6iVew_--O0I3P1eCCQ-bFmjxjRwzOqfXoIqH3LaA0SK2OaKBjw4BKqONyXpsXC7ZaSn1OxsHbxiBH5eRT0ooLNvDvMB-heVll2RrOneAdnNl1fvaiBDfMdqCjhy3rEaweDcdg2SX8T58jW-EIXVsI8F6KMwJnChxaB-zIhPLBOqxNWJCiF84GTFYLKdyTJ6mYMQEP5_vnLFSjz5HCVarlz016L5k6wglNACHrezBoR-ni0yRJQixJkvnwrAjRj9PGtFF7JuMGnQ8sw8D71zIFo0ZNvUdfJaz_akLlnzozqu_AWofmpVyTyJs3XO1YJ3O-r32B1nFmfdEriwjC94y_2ynoZj4FKdP45dM9zYXicmwQSH4v-j7unmVH8g_vjbCuEcrNH4df6zcqh8OhLzJad1ALvxB3yMpLm9OD04ecXuo2Kwmt4iBgUZy7CABgMOxlQneEVlwr625wIbYJdSh9GsmnLLJ-mLOYkIVBPPT-8tAMftLDYA-3PnAwnfVQcdorGc6992_8ZRf6UaQxCuWy8uAI3UFIfo_32wkJrSbQiDUN_6EwEcKiyyLAlDOXpJWauahHpDJ9_6zItdCv9lu0Uuz3LjTnVrKjNxKQMk0-UUh6droUGW1vV4Hda5zHmYvjjo2QfDQGlYO4IwitrtskUNy9a0Rm-xj5hbLlO_AfQnMgq_QXEqYaxc_p9sxaf__c99QXqDlr08DgDJDizjSCFA9vEiP8YKf7575PfeJz32fTP50yDLof6P_pHc2uTe0foM9PlvmeTuu7zOmsvNnuivVjHcgq_fyfR9H7pr9KmcQLPiU1cwxPsGi2ebPJN2yB26zcrNO83KyKRbdNmzqr1xVmZdkWjJf5Ks1onrOmqKtNQzcLsaUpzbM0y7IirYpyWdPVuimrdd40bZ3lG7JKsWdCLqU89b5ZLsKLaltWVZYuJKtR2nBypLRm_DuqhuS7r1UZjjP3hFKfR_Kdb9xHf6aZwi8vwltodpL0seJhYbY-O6nHoyWrVArr7AsPJ5zE7ecXnHw3O0VMZ1B_6iRleisTKdNoIVKm14NXmfpzzmI0cvvmFSFcN9ZLrntC92Ed8S8ZjP6G3BG6D2JYQvdBjz8CAAD___81gvo">