<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/54889>54889</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Inaccuracies in the znver1 scheduling model: `vpmov*`, `vtestp*`, `vps*v*`, `vcmp*`

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          fabian-r

      </td>

    </tr>

</table>

<pre>

    We encountered several more inaccuracies in the znver1 scheduling model:

- The `vpmov(s|z)x(b|w|q)` instructions that write to ymm registers are predicted faster by llvm-mca than they run, e.g. (numbers are inverse throughput):

```

# LLVM-EXEGESIS-DEFREG YMM10 42

# LLVM-EXEGESIS-DEFREG YMM7 42

.intel_syntax noprefix

vpmovsxbw ymm10, xmm7  # llvm-mca: 0.5cy, llvm-exegesis: 2.0cy (3 uops)

```

It seems like they use the information for the xmm version, which is faster according to [uops.info](https://uops.info/table.html?search=vpmovsxbw&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ZENp=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on).

AMD's table doesn't include these versions of the instructions.

- Similarly, the `vtestp(s|d)` instructions with ymm operands are predicted faster by llvm-mca than they run, e.g.:

```

# LLVM-EXEGESIS-DEFREG YMM10 42

# LLVM-EXEGESIS-DEFREG YMM7 43

.intel_syntax noprefix

vtestps ymm10, ymm7  # llvm-mca: 0.28cy, llvm-exegesis: 2.0cy (3 uops)

```

For the xmm version, llvm-mca predicts the same whereas llvm-exegesis measures an inverse throughput of 1.0.

The AMD table claims a throughput of 2, i.e. an inverse throughput of 0.5, which agrees with neither of those.

[uops.info](https://uops.info/table.html?search=vtestp&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ZENp=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_bmi=on&cb_fma=on&cb_mmx=on&cb_sse=on) agrees with the llvm-exegesis measurements.

- The AVX2 variable shifts `vps(llvd|llvq|ravd|ravq|rlvd|rlvq)` with 3 register operands or 2 register operands and a memory operand are predicted too fast by llvm-mca, e.g.:

```

# LLVM-EXEGESIS-DEFREG XMM13 43

# LLVM-EXEGESIS-DEFREG XMM15 42

# LLVM-EXEGESIS-LIVEIN RDI

.intel_syntax noprefix

vpsllvd xmm13, xmm15, xmmword ptr [rdi+42]  # llvm-mca: 0.6cy, llvm-exegesis: 2.0cy (1*ZnFPU1)

```

and

```

# LLVM-EXEGESIS-DEFREG YMM13 43

# LLVM-EXEGESIS-DEFREG YMM15 42

# LLVM-EXEGESIS-DEFREG YMM7 8

.intel_syntax noprefix

vpsllvd ymm13, ymm15, ymm7  # llvm-mca: 0.5, llvm-exegesis: 4.0 (2*ZnFPU1)

```

The AMD table does not mention those instructions, [the uops.info measurements](https://uops.info/table.html?search=vpsllvd&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ZENp=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_bmi=on&cb_fma=on&cb_mmx=on&cb_sse=on) agree with llvm-exegesis on the throughput, but not the port usage.

- SSE and AVX CMP instructions `(V)CMPcc(SS|PS|PD|SD)` have wrong (inverse) throughput / resource usage and latency:

For throughput e.g.:

```

# LLVM-EXEGESIS-DEFREG XMM1 43

# LLVM-EXEGESIS-DEFREG XMM2 42

# LLVM-EXEGESIS-DEFREG XMM3 42

.intel_syntax noprefix

vcmpss xmm3, xmm1, xmm2, 1  # llvm-mca: 1.0cy, llvm-exegesis: 0.5cy

```

and for latency:

```

# LLVM-EXEGESIS-DEFREG XMM1 43

# LLVM-EXEGESIS-DEFREG XMM2 42

.intel_syntax noprefix

vcmpss xmm1, xmm1, xmm2, 1  # llvm-mca: 3.0cy, llvm-exegesis: 1.0cy

```

AMD's table reports them, consistently with llvm-exegesis, as having a latency of 1 and a throughput of 2, since they use only one FPU0/1 uop (that is for the xmm version, for the ymm version the throughput is 1 with two such uops).

Sorry for the long issue; sadly, there seems to be a lot to find.

Please do tell me if I should separate these into multiple issues!

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzdWNlu2zgU_Rr5hYigxeuDH9LYLgw0RVDPZNq-FLRE25yRRJekvPTr51xKtuXUadwUBQYDJJS4iLzbOffSc5Xuh38JJopElYUVWqTMiI3QPGO50oLJgidJqXkihUGH2ZVg3wosCJlJViItM1kssTQVmRffesHIC-r2hv2BtV432KxztfGivvF6d9-8aLDD-xzvW_x_RR9LsLOxukysVIXBGdyyrZZWMKvYPs-ZFktpIJ1hHDKtIaVMLERdcBpl8z3Lsk1-kyecPnZS7pkuCy-6Y8Jf-gxHFmU-P-wgSQOD7VdalcvVurQkx1H-blD_Vd0oZu_ePd7fjD-O345n09nNaDz5MH7LPt3fhwFrRy8v6x1X-RJWzr6YfWH5jhUKuizkrppzhjK7-ZZ0DgMSfpfnPcZo64OCkJIFfifZ07QbFDuxFEYamon8INmTtjEr1dqQVhdVcu3UwtciNyyT_4jKZqUzChlooXTOyR8Mb24MsjAyG8bo7O1KJismzcEJiBOlU4oGOM3rvKHzfdrH64wg0MraNYnoRRP8nSajieXzTPgrmyOCJkZwnay8eHQ0hhd1k_mXjFsM0snUs-tGxyl66q6Vts3-5_H75upccFNqkYvibFWqkkZvzo1odPlmd96L6u7Aryx5ew8Ve4hcUoWlShhM9izMmGRl6kwKw9bGM0wtaiOfgt4_h85M5jLjOnNetjWOrDBQvAJSegk4W2lXDi9qDQQX6Wvh8nuREF-BBKeqOeFg_wwOov6vAWFyObaP5qltZ9wiw3OBsAdJcnN-IqujCgYvLpALOTz0g9rHRIsImDpYkoxLQJA_WR6RGNIX_vM7ggVOOORLLUQdAYVACz-7MFNGHGLr1zFZR-B_F5BubS4bvUXOm4flzQ_NcdPBmQHJ2Rf964T0v89yt48fI7bhWjqXmpVcIGRc6kPk9bFVCsji8RUPzV0PD9er5jTNVYh2EsTHlHfCMkI1ujCMBtGTC6Tr_WH0Ce6tUg77TeS_GusfgfX4iOIfLuv8gBLeTR_H0_fsw2h6TWI0ZENCaRjXaTHs1C9bZB22tppyDvKPF73BoZ3RJbbovkQWoRfdfi4mD3-GzxEGjPvz1HiFuT69YK4mg_avt9j-YLH9wWLPMWnnomXafkB2iV60yzmpUQaERJYRYKiGcDx0lqzoOHiMoHYknjOYvbZscIr_jzmqIohzelJVYd4sZ-_YHGmCfEAzpB-KO74UZ-x1LDdmY8cj4DF2d_9wXlWQj6P-IwTAVJLgfTYDZT24ZoRmNqqpa8U3EE8r1IBYVWctEryRuOA5kJhRpU5EJZE7Ga7CHWR_ZKMqMR-_ejVTXUVU0cvAw6r4uiI-ydfGEDEdmap-upQefo-9kNjnIvqqOv8ZGnJ1-VOz_UbTXKd0eKXS8bNKV-a4pM15la2FAy2Fd04bJYhVSo2FzfYXMEJLULYhROmKwg-Wc6VZnUMvVGAGFXzjaqQK7K0KwUCFAMUkJO6iWHdXVroLXa4nD8P70_ATwNLHYV19bBUzJWq6unA9Q-xMaaT5w34ZQU0aU4Ij3qA-TY8XBuT_6m6Hy9hckL7EBCgDZJHWGz5kYDXiaga_4roPfl6wKaoXVWb0K8Caa24Pdxc4H_RcZlauYXt3JGQLq51a6TBOB_GAt6y0mRhOf-pHg8ZvBM7jlBhOt53mEJnj9skyRF410Cp1NjxPF0uYs5z7iUKATCga6sfNWqu_RQKWnBw0mXTa_f6gtRqm3bDfTtvdftCJO_0wbrejsJcOejyNFt1Bt9PK-FxkZojc5UVRIba1_SMqOlpyGAVRFLTDOEQTBH6aLJKg12-3Q76IB-2-1w5EzmXmkxy-0suWHjqR5uXSYDJDCJvTJDdGLgsh3HHYn5fIpHq44HPJixvdcmcPnez_AtBoiYI">