<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/54889>54889</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Inaccuracies in the znver1 scheduling model: `vpmov*`, `vtestp*`, `vps*v*`, `vcmp*`
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          fabian-r
      </td>
    </tr>
</table>

<pre>
    We encountered several more inaccuracies in the znver1 scheduling model:

- The `vpmov(s|z)x(b|w|q)` instructions that write to ymm registers are predicted faster by llvm-mca than they run, e.g. (numbers are inverse throughput):
```
# LLVM-EXEGESIS-DEFREG YMM10 42
# LLVM-EXEGESIS-DEFREG YMM7 42
.intel_syntax noprefix
vpmovsxbw ymm10, xmm7  # llvm-mca: 0.5cy, llvm-exegesis: 2.0cy (3 uops)
```

It seems like they use the information for the xmm version, which is faster according to [uops.info](https://uops.info/table.html?search=vpmovsxbw&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ZENp=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on).
AMD's table doesn't include these versions of the instructions.

- Similarly, the `vtestp(s|d)` instructions with ymm operands are predicted faster by llvm-mca than they run, e.g.:
```
# LLVM-EXEGESIS-DEFREG YMM10 42
# LLVM-EXEGESIS-DEFREG YMM7 43
.intel_syntax noprefix
vtestps ymm10, ymm7  # llvm-mca: 0.28cy, llvm-exegesis: 2.0cy (3 uops)
```
For the xmm version, llvm-mca predicts the same whereas llvm-exegesis measures an inverse throughput of 1.0.
The AMD table claims a throughput of 2, i.e. an inverse throughput of 0.5, which agrees with neither of those.
[uops.info](https://uops.info/table.html?search=vtestp&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ZENp=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_bmi=on&cb_fma=on&cb_mmx=on&cb_sse=on) agrees with the llvm-exegesis measurements.

- The AVX2 variable shifts `vps(llvd|llvq|ravd|ravq|rlvd|rlvq)` with 3 register operands or 2 register operands and a memory operand are predicted too fast by llvm-mca, e.g.:
```
# LLVM-EXEGESIS-DEFREG XMM13 43
# LLVM-EXEGESIS-DEFREG XMM15 42
# LLVM-EXEGESIS-LIVEIN RDI
.intel_syntax noprefix
vpsllvd xmm13, xmm15, xmmword ptr [rdi+42]  # llvm-mca: 0.6cy, llvm-exegesis: 2.0cy (1*ZnFPU1)
```
and
```
# LLVM-EXEGESIS-DEFREG YMM13 43
# LLVM-EXEGESIS-DEFREG YMM15 42
# LLVM-EXEGESIS-DEFREG YMM7 8
.intel_syntax noprefix
vpsllvd ymm13, ymm15, ymm7  # llvm-mca: 0.5, llvm-exegesis: 4.0 (2*ZnFPU1)
```
The AMD table does not mention those instructions, [the uops.info measurements](https://uops.info/table.html?search=vpsllvd&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ZENp=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_bmi=on&cb_fma=on&cb_mmx=on&cb_sse=on) agree with llvm-exegesis on the throughput, but not the port usage.


- SSE and AVX CMP instructions `(V)CMPcc(SS|PS|PD|SD)` have wrong (inverse) throughput / resource usage and latency:
For throughput e.g.:
```
# LLVM-EXEGESIS-DEFREG XMM1 43
# LLVM-EXEGESIS-DEFREG XMM2 42
# LLVM-EXEGESIS-DEFREG XMM3 42
.intel_syntax noprefix
vcmpss xmm3, xmm1, xmm2, 1  # llvm-mca: 1.0cy, llvm-exegesis: 0.5cy
```
and for latency:
```
# LLVM-EXEGESIS-DEFREG XMM1 43
# LLVM-EXEGESIS-DEFREG XMM2 42
.intel_syntax noprefix
vcmpss xmm1, xmm1, xmm2, 1  # llvm-mca: 3.0cy, llvm-exegesis: 1.0cy
```
AMD's table reports them, consistently with llvm-exegesis, as having a latency of 1 and a throughput of 2, since they use only one FPU0/1 uop (that is for the xmm version, for the ymm version the throughput is 1 with two such uops).

Sorry for the long issue; sadly, there seems to be a lot to find.
Please do tell me if I should separate these into multiple issues!

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzdWNlu2zgU_Rr5hYigxeuDH9LYLgw0RVDPZNq-FLRE25yRRJekvPTr51xKtuXUadwUBQYDJJS4iLzbOffSc5Xuh38JJopElYUVWqTMiI3QPGO50oLJgidJqXkihUGH2ZVg3wosCJlJViItM1kssTQVmRffesHIC-r2hv2BtV432KxztfGivvF6d9-8aLDD-xzvW_x_RR9LsLOxukysVIXBGdyyrZZWMKvYPs-ZFktpIJ1hHDKtIaVMLERdcBpl8z3Lsk1-kyecPnZS7pkuCy-6Y8Jf-gxHFmU-P-wgSQOD7VdalcvVurQkx1H-blD_Vd0oZu_ePd7fjD-O345n09nNaDz5MH7LPt3fhwFrRy8v6x1X-RJWzr6YfWH5jhUKuizkrppzhjK7-ZZ0DgMSfpfnPcZo64OCkJIFfifZ07QbFDuxFEYamon8INmTtjEr1dqQVhdVcu3UwtciNyyT_4jKZqUzChlooXTOyR8Mb24MsjAyG8bo7O1KJismzcEJiBOlU4oGOM3rvKHzfdrH64wg0MraNYnoRRP8nSajieXzTPgrmyOCJkZwnay8eHQ0hhd1k_mXjFsM0snUs-tGxyl66q6Vts3-5_H75upccFNqkYvibFWqkkZvzo1odPlmd96L6u7Aryx5ew8Ve4hcUoWlShhM9izMmGRl6kwKw9bGM0wtaiOfgt4_h85M5jLjOnNetjWOrDBQvAJSegk4W2lXDi9qDQQX6Wvh8nuREF-BBKeqOeFg_wwOov6vAWFyObaP5qltZ9wiw3OBsAdJcnN-IqujCgYvLpALOTz0g9rHRIsImDpYkoxLQJA_WR6RGNIX_vM7ggVOOORLLUQdAYVACz-7MFNGHGLr1zFZR-B_F5BubS4bvUXOm4flzQ_NcdPBmQHJ2Rf964T0v89yt48fI7bhWjqXmpVcIGRc6kPk9bFVCsji8RUPzV0PD9er5jTNVYh2EsTHlHfCMkI1ujCMBtGTC6Tr_WH0Ce6tUg77TeS_GusfgfX4iOIfLuv8gBLeTR_H0_fsw2h6TWI0ZENCaRjXaTHs1C9bZB22tppyDvKPF73BoZ3RJbbovkQWoRfdfi4mD3-GzxEGjPvz1HiFuT69YK4mg_avt9j-YLH9wWLPMWnnomXafkB2iV60yzmpUQaERJYRYKiGcDx0lqzoOHiMoHYknjOYvbZscIr_jzmqIohzelJVYd4sZ-_YHGmCfEAzpB-KO74UZ-x1LDdmY8cj4DF2d_9wXlWQj6P-IwTAVJLgfTYDZT24ZoRmNqqpa8U3EE8r1IBYVWctEryRuOA5kJhRpU5EJZE7Ga7CHWR_ZKMqMR-_ejVTXUVU0cvAw6r4uiI-ydfGEDEdmap-upQefo-9kNjnIvqqOv8ZGnJ1-VOz_UbTXKd0eKXS8bNKV-a4pM15la2FAy2Fd04bJYhVSo2FzfYXMEJLULYhROmKwg-Wc6VZnUMvVGAGFXzjaqQK7K0KwUCFAMUkJO6iWHdXVroLXa4nD8P70_ATwNLHYV19bBUzJWq6unA9Q-xMaaT5w34ZQU0aU4Ij3qA-TY8XBuT_6m6Hy9hckL7EBCgDZJHWGz5kYDXiaga_4roPfl6wKaoXVWb0K8Caa24Pdxc4H_RcZlauYXt3JGQLq51a6TBOB_GAt6y0mRhOf-pHg8ZvBM7jlBhOt53mEJnj9skyRF410Cp1NjxPF0uYs5z7iUKATCga6sfNWqu_RQKWnBw0mXTa_f6gtRqm3bDfTtvdftCJO_0wbrejsJcOejyNFt1Bt9PK-FxkZojc5UVRIba1_SMqOlpyGAVRFLTDOEQTBH6aLJKg12-3Q76IB-2-1w5EzmXmkxy-0suWHjqR5uXSYDJDCJvTJDdGLgsh3HHYn5fIpHq44HPJixvdcmcPnez_AtBoiYI">