<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/152008>152008</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [MCA][X86] llvm-mca very inaccurate for pop instructions
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            tools:llvm-mca,
            backend:X86 Scheduler Models
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
            boomanaiden154
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          boomanaiden154
      </td>
    </tr>
</table>

<pre>
    (Doing the below in terms of reciprocal throughput to make it easier to compare different tools/benchmark results).

For the following snippet of code:
```asm
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
```

llvm-mca predicts a reciprocal throughput of 30 cycles:
```
Iterations: 1000
Instructions:      5000
Total Cycles:      30003
Total uOps: 10000

Dispatch Width:    6
uOps Per Cycle:    0.33
IPC: 0.17
Block RThroughput: 2.5
```
```
Iterations: 2000
Instructions:      10000
Total Cycles:      60003
Total uOps: 20000
```

(60003-30003)/1000=30 cycles per iteration.

`llvm-exegesis` measures a reciprocal throughput of about 6.5-7 cycles, but runs into a bunch of cache misses since all of the cache lines it is touching haven't been loaded yet:
```asm
# LLVM-EXEGESIS-DEFREG RSP 20000
# LLVM-EXEGESIS-MEM-DEF test1 131072 7fffffff
# LLVM-EXEGESIS-MEM-MAP test1 131072
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
```
```
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=10000 -validation-counter=l1d-cache-load-misses
---
mode: latency
key:
  instructions:
    - 'POP64r RAX'
    - 'POP64r RCX'
    - 'POP64r RDX'
    - 'POP64r RBX'
    - 'POP64r R12'
  config:          ''
 register_initial_values:
    - 'RSP=0x20000'
cpu_name: skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 1.3718, per_snippet_value: 6.859, validation_counters:
      l1d-cache-load-misses: 3744 }
error: ''
info:            ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
```
```
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=5000 -validation-counter=l1d-cache-load-misses
---
mode: latency
key:
  instructions:
    - 'POP64r RAX'
    - 'POP64r RCX'
    - 'POP64r RDX'
    - 'POP64r RBX'
    - 'POP64r R12'
  config: ''
  register_initial_values:
    - 'RSP=0x20000'
cpu_name: skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 5000
measurements:
  - { key: latency, value: 1.4052, per_snippet_value: 7.026, validation_counters:
      l1d-cache-load-misses: 2169 }
error: ''
info:            ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
```

uiCA predicts a reciprocal throughput of 2.5 cycles per iteration, although I'm not (currently) convinced that is completely accurate.

I'm pretty sure MCA is seeing the dependency on `%rsp` and delaying the instructions because of that, although it seems like the hardware is able to figure that out without delaying execution. If we reset `%rsp` every iteration, MCA seems to do much better. Given this is a real register dependency, I'm not sure this is super easy to fix.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWF9vpDgS_zTOS4mWMZg_D_0AdLMaaaONZlZ38xYZqG58AczZJpP-9idDN0n3JKO9W510fwaNJu0qV_n3qypc2MIYeRwQt4TnhLFKqV4MQjY4-DwkjBG-uxOTbZXeXuvuKtWctoQlOyWHI9gWocJOfQM5gEXdG1AH0FjLUatadGBbraZjO04WrIJePCFICyiMRO0ktepHoREaeTigxsFNU50hrKxwqNte6CfQaKbOGsLSDaEZoVmp9LzyQXWd-uZwmEGOI1q3eK0aJIGbRiK6_BOmJzQb1fh3IIxr8fJ2VF-NmqtRdTXy2VunC5Sue-69vhYwamxkbQ2ID-irAwQU6lPdobnBR2j2yaIWVqrB6cCndBYOxuqpXsXzwxfd78qKDoqLu0UXUEqDVTn9Nq7eznB30ozC1i38VTa2PdtFhGZuLjygXjyeFXQTOG-fHgonoBs_JjTLO1U_weffV2ZOxzb8NjYfkmM_IHeB-i676F127MLuOjGEJfN8b4kJSwkr57AGuzUNMKIGeQF3Li4S0Tmp-IJHNNKQiEKPwkwaf5hcUanJQrThXnzJMiugmizoaTAgB6tAQDUNdTuXqahbhF4agwaMHGoE0XVO4yp70XZyQOPeF2nAqqluXam34hkHwmILFeIAnRINNnBC-27RExbAr7_-5d7bf93_sv_y6Yu325ef97_A5y8Pr6H7btL9_t5NBIvG-uAHPo0ZxIfl-dDiPnu4svj3vHRvflthngxa8GrgQFhp-5Gw8ip94PXzhrDrhMWhPoF33iuMd5Cu0nernYO-MeBM68lVxMXUTJVLORoD3roteWeRV48TCXYReBpHtPKtYafUCF4vB09eVfturnPwnkUnm7n4vFpNg0XtjPzGm_Pvudx6S4kQmnmeR2i2eM7gTIfQ7AlPS-oBrldZZAAeEBY__PYQhRo-Z18Ji99TFB8pdh8p8o8UPrsoajUc5HF9g93jVItW41Eai_pRDtJK0T0-i27CW-CfvzyQYEdflmKdLetxehxEP8fBPJ068YSeeH7hc6247D9aLcd1G4OXJHqMQu-oLT6HXieH6cU7DpOLphweb4K27kHnl77Hwa6YPCBxDkvE1xSwAmbks_EmiP3EiUbUj-dSe1zV0Sbh6dngnPnHc-bf8AZ4vwaCDII4DIHEO0Iz1FppJ1sjKoeDuor1m2gLY7CvOmwuoNzE0Oc89HnIgzBN0rKY_-dhkpf05gmTZBfw11Hh7_e0mP9G7m_ixtQ_a_OE5jcOSsrDItnRVx9Fku7jsKB-Ec9-yrO_ePFXxG_87Z2Dcn7icpGxtNx_uNLMYH9jka8IX3nkGQ1umKZ55vs3sl2S7pMwzdPv45In9Fa6IqCUsqu5-3dkfwxBmic0vGHEk3zW_DOoittIcj_0ecApj3nE2XuRS_KgnLEHLPwuerc4kzzxb-GUlPOM73nJk9DnOU8XHDeRcAzZtSFPeMoznoc-L97-DtMkKGhZxny__yFi_19H_P5bkLuFv2e3oPL5rnAfKJvN5n-xafH_t5712qr-U3sV_1OtKqScfdiq4g1l0Z9pVcyP0p-t6meroj9b1X9Vq6LZJIvsD11osA1_9yjttg3R2dZNhU-ExT0MygJhST1pjYPtToSlbqd9diffBmwr5lNurfqxQ4vdCURdT1pYPB_LFy-jRmtP4DY7uC8yZ2IQLzdRDY44NHPDVAM4OoxrM7oTvBgaaLATp8vctzspVFiLyeBy-hb2Cr60boneQCefcDZthW6-CY1udVF1CFbBQR4dppmHmix8k87cvq659ucNfDrANwSN7iPgCiQ-oz5dR9GRXJa3ChoF_VS3UKG1qDfwi3zGAWwrzQwFNIpu7VRvouEcvWbBLEAXIzO5xKEwp4XFy-au2QZNGqTiDrd-zMOApXGQ3rXbVAQpbUIRRqHPqiQUUZIcfL9uIn5IeCru5JZRxmlCQz8OE55umqSJIr8SVdBwP4l8ElLshew2ru9tlD7eSWMm3PqcUZrcdaLCzpyvJJdrwGC9YCPMtSrCWCXqJxwaEmRfkwi-1C02U4ca7lWDnTnfXertbFdNR0NC2kljzeuiVtpuvvm8LzLCd4TnX5OI8B2sl3lLHoZLCcJBaRjVeFU0d5Putq218y0UYSVh5VHadqo2terPH3SX77pRq79hbQkrZ8KGsPLM-XnL_hEAAP__taf4hQ">