<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/68117>68117</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Regression: Infinite recursion in x86-64 AVX-512 shuffle optimization
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:X86,
            clang:codegen,
            regression,
            crash-on-valid
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          bjacob
      </td>
    </tr>
</table>

<pre>
    # Summary

Minimized valid C + AVX-512 intrinsics testcase causes crash (infinite recursion) at optimization levels >= `-O1`.

# Regression window:

(See Compiler Explorer experiment: https://godbolt.org/z/zbGoT3Gc9).
- Crashes in Clang >= 16 (including current trunk).
- Worked in Clang <= 15.

# Minimized testcase

Compiler Explorer link:
https://godbolt.org/z/zbGoT3Gc9

```c
#include <immintrin.h>
#include <stdint.h>

static __m512bh bitcast_16xf32_to_32xbf16(__m512 a) {
  return *(const __m512bh *)(&a);
}

static __m512bh load_32xbf16(const uint16_t *src) {
 return bitcast_16xf32_to_32xbf16(_mm512_loadu_ps((const float *)src));
}

static __m512bh broadcast_load_2xbf16(const uint16_t *src) {
  return bitcast_16xf32_to_32xbf16(_mm512_set1_ps(*(const float *)src));
}

void dotprod_16x2xbf16_times_broadcasted_2xbf16_into_16xf32(
    float *out_ptr, const uint16_t *lhs_ptr, const uint16_t *rhs_ptr) {
  __m512 acc = _mm512_loadu_ps(out_ptr);

  __m512bh rhs = load_32xbf16(rhs_ptr);
  rhs_ptr += 32;
    acc =
 _mm512_dpbf16_ps(acc, rhs, broadcast_load_2xbf16(lhs_ptr));
  lhs_ptr += 2;

  _mm512_storeu_ps(out_ptr, acc);
}
```

Compile with these flags:

```
clang -O3 -mavx -mavx2 -mfma -mf16c -mavx512f -mavx512vl -mavx512cd -mavx512bw -mavx512dq -mavx512bf16
```

Note: the crash already reproduces with `-emit-llvm`, so it's not specific to object-code generation.

# Explanation of the testcase

AVX-512-BF16 brings a `vdpbf16ps` instruction that has a variant where the RHS operand is a `m32bcst` - a 32bit memory operand that the instruction broadcasts across all 32bit lanes. It can be obtained from intrinsics by feeding the result of a broadcast. This is the intent in this C code. It's a very typical pattern. That's working perfectly with Clang <= 15 as demo'd by the above Compiler Explorer link (https://godbolt.org/z/zbGoT3Gc9).

# Backtrace in LLDB shows infinite recursion in codegen:

```
(lldb) bt 100
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f5ffff8)
  * frame #0: 0x000000010011d834 clang-18`llvm::TargetLoweringBase::getValueType(this=0x00000001281183c0, DL=0x000000011d4312e0, Ty=<unavailable>, AllowUnknown=<unavailable>) const at TargetLowering.h:1567
    frame #1: 0x000000010055f35c clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getTypeLegalizationCost(this=0x000000011d2257d8, Ty=0x000000011dc73d28) const at BasicTTIImpl.h:822:25
    frame #2: 0x000000010056ecfc clang-18`llvm::X86TTIImpl::getVectorInstrCost(this=0x000000011d2257d8, Opcode=62, Val=0x000000011dc73d28, CostKind=TCK_RecipThroughput, Index=0, Op0=0x0000000000000000, Op1=0x0000000000000000) at X86TargetTransformInfo.cpp:4381:42
    frame #3: 0x0000000100580c18 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getPermuteShuffleOverhead(this=0x000000011d2257d8, VTy=0x000000011dc73d28, CostKind=TCK_RecipThroughput) at BasicTTIImpl.h:117:24
    frame #4: 0x0000000100568710 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, Tp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f600610, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f600600) at BasicTTIImpl.h:995:16
    frame #5: 0x0000000100567e7c clang-18`llvm::X86TTIImpl::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, BaseTp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f600e30, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f600e20) at X86TargetTransformInfo.cpp:2080:17
    frame #6: 0x000000010056f2a0 clang-18`llvm::X86TTIImpl::getVectorInstrCost(this=0x000000011d2257d8, Opcode=62, Val=0x000000011dc73d28, CostKind=TCK_RecipThroughput, Index=1, Op0=0x0000000000000000, Op1=0x0000000000000000) at X86TargetTransformInfo.cpp:4465:21
    frame #7: 0x0000000100580c18 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getPermuteShuffleOverhead(this=0x000000011d2257d8, VTy=0x000000011dc73d28, CostKind=TCK_RecipThroughput) at BasicTTIImpl.h:117:24
    frame #8: 0x0000000100568710 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, Tp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f6013d0, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f6013c0) at BasicTTIImpl.h:995:16
    frame #9: 0x0000000100567e7c clang-18`llvm::X86TTIImpl::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, BaseTp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f601bf0, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f601be0) at X86TargetTransformInfo.cpp:2080:17
    frame #10: 0x000000010056f2a0 clang-18`llvm::X86TTIImpl::getVectorInstrCost(this=0x000000011d2257d8, Opcode=62, Val=0x000000011dc73d28, CostKind=TCK_RecipThroughput, Index=1, Op0=0x0000000000000000, Op1=0x0000000000000000) at X86TargetTransformInfo.cpp:4465:21
    frame #11: 0x0000000100580c18 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getPermuteShuffleOverhead(this=0x000000011d2257d8, VTy=0x000000011dc73d28, CostKind=TCK_RecipThroughput) at BasicTTIImpl.h:117:24
    frame #12: 0x0000000100568710 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, Tp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f602190, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f602180) at BasicTTIImpl.h:995:16
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWd1y2zjSfRr4pksqAhQp6sIXshTP55rMl61Ym82dCgSbImKS4ACgbM3TbzVJ_diSs042W5WaGVdCsfDTfU534wAkpXN6UyNes-iGRcsr2frC2Ov0i1QmvUpNtrtmIoT7tqqk3bFgyYJ5f_1N17rSf2AGW1nqDBbAxA3MP30eRVyArr3VtdPKgUfnlXQISrYOHSgrXQFMJLrOda09gkXVWqdNzcQMpAfTeDItvTY1lLjF0gEL37FwCSwORh84i4PxKRaC-BE3Fh1ZgUddZ-aRhfPnY5J7RFiYqtElWnj31JTGogV8atDqCmvPwjkU3jeOpopbJm43JktN6cfGbpi4_YP-p7-YVfiLmjExGzCMYEGU0IGuYVHKerNHy-OepyrbTNcbUK21WHvwtq0fTg38y9gHzE7nL7r50RnNY9j3cT0dcc6u1PXDIRJvJ3fqNQ76f-qAomeEBFNXVZ_rcUGkL4xwPtO1P-3urs5LrxWs11XERVpAqomNX_P4KQ_F2pt1KJ7SnMdMJP0gkFQfbHrTGwCw6FtbAxOUXGVq54_musYZEwkTMc1j4TCNTZdfg1EamZ147q22uvY8Xnuy6qx6DmNA8VX8FVlfk-123bgO1WA6L430A9re9DeATa2RWee1g_0tqL8BtkPP96jn3wd8a3QGmfGNNRm56l2sva7QrQ8scE9hrWtvBkjkdoAMR6em9evGWyYWcE62LNzrnXbf-Swa-wpTCmjhnSfs4PCE47O5aQG2cN3sFzV09HiYCjA0kmjSlFCc9MEex9AwoMmaLjYdHKkU0bOFo5_X6qA8en7mvHzuXJxTGjLvjcWXAVhA5_xSrvdKcUGS4FH7AnyBDiEv5ca9lOfnc1Ung6MPIYwquX3qrwJGVV5JuvJY9W0RF_nhblseblV2uE0fD7fZ78dWCtHrsP_feKT9wBc47FiytCizHVikKm4Vup4TbUlYaT8qy21FVsQCnAHtmZg6qI0H16DSuVbgDZj0Cyo_UiZD2GCNttvlzmSe9FvW_Q5o8g7FJb0fttvRzS2PIbW63jiQhGjbF0vjWByArp23reqM-UJ6KCQN20qrZe3hsUCLnYuP_3cPpkEr6wz0YKkKRaqcJzsjkBCKVHuosDJ2dxjbGSUDp54ORelAKmucA1mWw_xS1ujGcOdByRpSBJN6qWvMILemOj09pDvIEbvdkxxYdG3pKSTy6GAMq0I7Qtxj8LTHauKqHZAGZEi-unxI2KLdgd81WskSGuk92posyH7Ao7EP5K1Bm6Py5a7P8ot9GaSDDCvDxDQjjORYpmZ76YBBWzCdA77jaHEsiBupHryVivjB-_fLG3CFeaQjx8tDFA0gyhusv77GSCDKLCUlTD3w4NBMVU-1DkyEnMr59xZb7KSNiaky1Vg2TYnjSup61A9lYtrVvTcNWJTO1N3wd58X65v5cj1fLN7d30O3eWTIwqXodCTL6MjGwmXwxOM8yvM8T0hZBhUiJLmVFRKQgFZj8BT0fzwIOM-ScAKdUox4wuKgW4DhnIXzlbQb9O_NI9KauKFV07Vv0H-SZYurXYNMJFQgnfe9VZFwnoSqW8PL98-6eDYJucCua7UjcQ4XbS23UpcyLZFON2IB87I0j_-sH2rzWF8eMxv2JOnhOUo6Ic15FE9Ptrs9ef6SfBTlYaQuk7-RTqvV6u6uasqe-uLY-TmJh67ujDoEheLxHjeyHE7dC-P8pfjwTIhomiWHIJx2qWmY0RnhhOEplI5fIgQL5yK6wFGccYxR5a9wPKWxTywqb-wdKdBb4H9ohlKMu1r8JMtX6CyAzP2q64yFy9Xi1_VHVLpZFda0m6JpPY24qzN8ovm95eDU1OGv7-Ov9HVPPkSrK4qVlbXLja3u6tyMVdOwcD4JE6qDibgQvPAseEmgePLDCuQfaKvW433R5nmJH7Zoi27Vfz3Gn16tkf8c1NnF-uF8SvUzuRCCyXn9JFMe_LAQDNzfUlsDr_tf10PcVo_m3nYntlXzakR-k-6Bhcu5tXL3EXN6uKJn0nfAJsEJsTiPgyDmwXdU5n2bfsX_3NKp7NR_v46Pgem0szvzv4LqUMhniZvNIkpffCFx0Xnipjh9-8L_MYmhKvhBycHwZ0wOijepjAgS2mr5pW0oPs9ULuQrS-wnlWj-P5XoSUzVLPiF4E3_lujkLyTRPMx-QhXg3eH2myV69ieTaJ7mP2NyUvyvJZqfPSf9rdFv12h-_qD1lxNpfuFB7E-r0oLPfkIhEDx5o0ofXulcZddhNgtn8gqveTyLw0nMo-iquI7jaYgqSvMIZ1mkVCbjfDpJEGOeZEEWX-lrEZBuBCGPeBRF4zRMJlkyneI0DCYSAzYJsJK6HBP-sbGbK-1ci9dxwvn0qpQplq77fihEKtUDUhQp9UyQJDAhusJh4Xz_Vmrfbg-f7I5DrXTFyNSj7qMiNUfLK3tNnkdpu3FsEpTaeXfE4rUv8fr49Y-K9-7iS7GnJB7Fk8MnSteX4bOvjVetLa9fvKfTvmjTsTIVE7ddBvufUWPNF1SeidsuHI6J2y4i_w4AAP__Nc-TjQ">