<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/62026>62026</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Bad replacement of `vpmaddwd` + `vpaddd` with `vpdpwssd`.
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          TheBlackPlague
      </td>
    </tr>
</table>

<pre>
    See the following instructions:
```asm
vpminsw ymm3,ymm0,YMMWORD PTR [rcx+0x20]
vpmaxsw ymm3,ymm3,ymm1
vpmaddwd ymm3,ymm3,YMMWORD PTR [r8+0x20]
vpaddd ymm2,ymm2,ymm3
```

It reads a `YMMWORD` from the address, calculates the minimum against `ymm0`, then stores it in `ymm3`. Then, it calculates the maximum between `ymm3` and `ymm1` and stores the result in `ymm3`. Furthermore, it multiplies `ymm3` with a `YMMWORD` from another address, and adds the adjacent pairs, storing the result in `ymm3` (at this point, the register is storing 8x `int32` instead of 16x `int16`). Then, it vertically adds `ymm2` and `ymm3`, storing the result in `ymm2`. 

This is a very simple part of a neural network forward propagation. Please take a look at the [uiCA uops code analyzer](https://bit.ly/3ZUJUd5) for the above instructions. The requirement of the above assembly is `AVX2`.

Now, Clang (LLVM) 16.0 replaces the above assembly with the following assembly on machines that support the `AVX-VNNI` instruction set:
```asm
vpminsw ymm3,ymm0,YMMWORD PTR [rcx+0x20]
vpmaxsw ymm3,ymm3,ymm1
vpdpwssd ymm2,ymm3,YMMWORD PTR [r8+0x20]
```
It effectively leverages the `vpdpwssd` instruction, part of the `AVX-VNNI` instruction set, that fuses the operation done by `vpmaddwd` and `vpaddd`, thus replacing them. 

Furthermore, the [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) marks the instruction with a latency of 5, reciprocal throughput of 0.5 (same as the `vpmaddwd` instruction). However, since the non-fused version also requires the `vpaddd` instruction, which has a latency of 1 and a reciprocal throughput of 0.33, it means the fused instruction is theoretically supposed to be finished first. This means it is a good replacement... right?

**It is not a good replacement!** See the [uiCA uops code analyzer](https://bit.ly/43hpSw8) for this version. In a non-expensive loop, this replacement slows the code down by 100% (taking effectively double the time). What can be done about this?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzEVl9v27YX_TT0y0UEmrIV-8EP-QP_fhnarmjTdtsbJV5LbChSIykr3qcfLiUnttsVxV4GGFEk8v7hueccSYaga4u4YctbtryfyT42zm8eG7w1snp6b2Td46x06rD5iAixQdg5Y9ygbQ3ahuj7KmpnA8tvGL9n_IYVfPzJ0I5P9l2rbRjg0LY5E3eHtuVM3P3-9u2XXz_cw_vHD8CWt756ZuKWPwvOlvcvcfL5LG66zF_XlRrUxYbLxKtv80qlUpQYo8Qx-PwA0236-xDBo1QBJLCCTyVYwWHnXZtgkUp5DIGJO6ikqXojI4a00mqr274FWUtCjBIkDAqCgXZYCNF5DKAjaDut56zgGTw2aGmXjt9klc8pa4lxQDyNAmnVdDs_3k4FKNBj6M03hba9jw361nmc6rW9ibozGsNp7kHH5vsgSOsowykQVFkqFSaAvsoKbYROap-WqSni0T91BUysZITY6ACd0zZOeIHHWoeIHnR4SbJ6pkhtYy4olJBGqcDtYF4cl-ZFAn19husefdSVNOYw9jrWF-dA5tO0ftSySECesuaROtdEmj36AwTddgahkz5SXxIs9l4asBgH559g5_wgvYLOu07WknSVwXuDMiBE-YQgwTj3BAkTJHL3-u4GetcFqJxCkFaaw1_oiepi1cTYJV2KLRPbUsfMHJjY5n98-uWTWjKxpoLjaEq3xzM5J4TA45-99tjS1NzuZKsMAdvSHOhwrOA3n39Lhz89-zs3EGB3RtqaBvnmzee3VHNeZBw8dkZWEyEvUiaGnRvNy5qz0Mqq0TaFygih7zrnJzxSI1ef3717ODJgOg4EjP-JQaluCOHUan7KoC4c6CEC7nZYRb1HcwCDe_SynsBjBT-WuTg0oX-k2k_Ak6QlI-z6MKV2HfpEQlDOIpSHsdhouifyGP30xc76MI130kl7rokLo5mI_GAjGniw0WsbdBXgf71W-D0eD8OQadqdVa5lYls5G5GsgVaY2PaBiS2Je6tcRf_rl6RXdUpKjxQ-Z01sDTGylf5pPPEpJpPRkeHa6kAgLqlfj5XuvKukgdh419dN1yeIebYkmgfZEpdfZ_MK19ls1hn83w00ymQr2lbj29U6e0UjUGQZgRqRJrijEk_yTphfTnxodNVAI8N57_PRi3_Ufp4fjR-lHQuNjZyiotOC83i0zCRA2hUdlAg7bXVoUMFO-xDJRXSYEtLLjbqqnVNHAyBnybIMvK6byPLtKVGYoN9DirIufieSifm4CY7fJv_SEhd5030cVq-WqMMR_gweLDm1s1f43KENeo_kwt1IXh1OG4Jg3DBCl4orN1jSzZxzJhI9onwiWZzqWbm-NGP7UbeYqPGFlFhJS5Am8cnS9eObkOXbmdrkap2v5Qw382I153wuxGrWbKodLpZiXexyxa-LeS5KkauCz9f5erEWWMz0RnCR8wVfc7645qtMrK4rVS6WVbkWO7G6ZguOrdQmM2bfZs7XMx1Cj5tCcFHMjCzRhPStKITFAdIiE4I-Hf2GYq7Kvg5swY0OMbxmiToa3NzKs_kR7y40wsTtOb2TDM9NLpv13mzOR1nr2PTlZAlUdbpcdd59xYrsIfVKhpDO8ncAAAD__1m1kIo">