<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/152008>152008</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[MCA][X86] llvm-mca very inaccurate for pop instructions
</td>
</tr>
<tr>
<th>Labels</th>
<td>
tools:llvm-mca,
backend:X86 Scheduler Models
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
boomanaiden154
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
boomanaiden154
</td>
</tr>
</table>
<pre>
(Doing the below in terms of reciprocal throughput to make it easier to compare different tools/benchmark results).
For the following snippet of code:
```asm
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
```
llvm-mca predicts a reciprocal throughput of 30 cycles:
```
Iterations: 1000
Instructions: 5000
Total Cycles: 30003
Total uOps: 10000
Dispatch Width: 6
uOps Per Cycle: 0.33
IPC: 0.17
Block RThroughput: 2.5
```
```
Iterations: 2000
Instructions: 10000
Total Cycles: 60003
Total uOps: 20000
```
(60003-30003)/1000=30 cycles per iteration.
`llvm-exegesis` measures a reciprocal throughput of about 6.5-7 cycles, but runs into a bunch of cache misses since all of the cache lines it is touching haven't been loaded yet:
```asm
# LLVM-EXEGESIS-DEFREG RSP 20000
# LLVM-EXEGESIS-MEM-DEF test1 131072 7fffffff
# LLVM-EXEGESIS-MEM-MAP test1 131072
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
```
```
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=10000 -validation-counter=l1d-cache-load-misses
---
mode: latency
key:
instructions:
- 'POP64r RAX'
- 'POP64r RCX'
- 'POP64r RDX'
- 'POP64r RBX'
- 'POP64r R12'
config: ''
register_initial_values:
- 'RSP=0x20000'
cpu_name: skylake-avx512
llvm_triple: x86_64-grtev4-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 1.3718, per_snippet_value: 6.859, validation_counters:
l1d-cache-load-misses: 3744 }
error: ''
info: ''
assembled_snippet
...
```
```
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=5000 -validation-counter=l1d-cache-load-misses
---
mode: latency
key:
instructions:
- 'POP64r RAX'
- 'POP64r RCX'
- 'POP64r RDX'
- 'POP64r RBX'
- 'POP64r R12'
config: ''
register_initial_values:
- 'RSP=0x20000'
cpu_name: skylake-avx512
llvm_triple: x86_64-grtev4-linux-gnu
min_instructions: 5000
measurements:
- { key: latency, value: 1.4052, per_snippet_value: 7.026, validation_counters:
l1d-cache-load-misses: 2169 }
error: ''
info: ''
assembled_snippet
...
```
uiCA predicts a reciprocal throughput of 2.5 cycles per iteration, although I'm not (currently) convinced that is completely accurate.
I'm pretty sure MCA is seeing the dependency on `%rsp` and delaying the instructions because of that, although it seems like the hardware is able to figure that out without delaying execution. If we reset `%rsp` every iteration, MCA seems to do much better. Given this is a real register dependency, I'm not sure this is super easy to fix.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWF9vpDgS_zTOS4mWMZg_D_0AdLMaaaONZlZ38xYZqG58AczZJpP-9idDN0n3JKO9W510fwaNJu0qV_n3qypc2MIYeRwQt4TnhLFKqV4MQjY4-DwkjBG-uxOTbZXeXuvuKtWctoQlOyWHI9gWocJOfQM5gEXdG1AH0FjLUatadGBbraZjO04WrIJePCFICyiMRO0ktepHoREaeTigxsFNU50hrKxwqNte6CfQaKbOGsLSDaEZoVmp9LzyQXWd-uZwmEGOI1q3eK0aJIGbRiK6_BOmJzQb1fh3IIxr8fJ2VF-NmqtRdTXy2VunC5Sue-69vhYwamxkbQ2ID-irAwQU6lPdobnBR2j2yaIWVqrB6cCndBYOxuqpXsXzwxfd78qKDoqLu0UXUEqDVTn9Nq7eznB30ozC1i38VTa2PdtFhGZuLjygXjyeFXQTOG-fHgonoBs_JjTLO1U_weffV2ZOxzb8NjYfkmM_IHeB-i676F127MLuOjGEJfN8b4kJSwkr57AGuzUNMKIGeQF3Li4S0Tmp-IJHNNKQiEKPwkwaf5hcUanJQrThXnzJMiugmizoaTAgB6tAQDUNdTuXqahbhF4agwaMHGoE0XVO4yp70XZyQOPeF2nAqqluXam34hkHwmILFeIAnRINNnBC-27RExbAr7_-5d7bf93_sv_y6Yu325ef97_A5y8Pr6H7btL9_t5NBIvG-uAHPo0ZxIfl-dDiPnu4svj3vHRvflthngxa8GrgQFhp-5Gw8ip94PXzhrDrhMWhPoF33iuMd5Cu0nernYO-MeBM68lVxMXUTJVLORoD3roteWeRV48TCXYReBpHtPKtYafUCF4vB09eVfturnPwnkUnm7n4vFpNg0XtjPzGm_Pvudx6S4kQmnmeR2i2eM7gTIfQ7AlPS-oBrldZZAAeEBY__PYQhRo-Z18Ji99TFB8pdh8p8o8UPrsoajUc5HF9g93jVItW41Eai_pRDtJK0T0-i27CW-CfvzyQYEdflmKdLetxehxEP8fBPJ068YSeeH7hc6247D9aLcd1G4OXJHqMQu-oLT6HXieH6cU7DpOLphweb4K27kHnl77Hwa6YPCBxDkvE1xSwAmbks_EmiP3EiUbUj-dSe1zV0Sbh6dngnPnHc-bf8AZ4vwaCDII4DIHEO0Iz1FppJ1sjKoeDuor1m2gLY7CvOmwuoNzE0Oc89HnIgzBN0rKY_-dhkpf05gmTZBfw11Hh7_e0mP9G7m_ixtQ_a_OE5jcOSsrDItnRVx9Fku7jsKB-Ec9-yrO_ePFXxG_87Z2Dcn7icpGxtNx_uNLMYH9jka8IX3nkGQ1umKZ55vs3sl2S7pMwzdPv45In9Fa6IqCUsqu5-3dkfwxBmic0vGHEk3zW_DOoittIcj_0ecApj3nE2XuRS_KgnLEHLPwuerc4kzzxb-GUlPOM73nJk9DnOU8XHDeRcAzZtSFPeMoznoc-L97-DtMkKGhZxny__yFi_19H_P5bkLuFv2e3oPL5rnAfKJvN5n-xafH_t5712qr-U3sV_1OtKqScfdiq4g1l0Z9pVcyP0p-t6meroj9b1X9Vq6LZJIvsD11osA1_9yjttg3R2dZNhU-ExT0MygJhST1pjYPtToSlbqd9diffBmwr5lNurfqxQ4vdCURdT1pYPB_LFy-jRmtP4DY7uC8yZ2IQLzdRDY44NHPDVAM4OoxrM7oTvBgaaLATp8vctzspVFiLyeBy-hb2Cr60boneQCefcDZthW6-CY1udVF1CFbBQR4dppmHmix8k87cvq659ucNfDrANwSN7iPgCiQ-oz5dR9GRXJa3ChoF_VS3UKG1qDfwi3zGAWwrzQwFNIpu7VRvouEcvWbBLEAXIzO5xKEwp4XFy-au2QZNGqTiDrd-zMOApXGQ3rXbVAQpbUIRRqHPqiQUUZIcfL9uIn5IeCru5JZRxmlCQz8OE55umqSJIr8SVdBwP4l8ElLshew2ru9tlD7eSWMm3PqcUZrcdaLCzpyvJJdrwGC9YCPMtSrCWCXqJxwaEmRfkwi-1C02U4ca7lWDnTnfXertbFdNR0NC2kljzeuiVtpuvvm8LzLCd4TnX5OI8B2sl3lLHoZLCcJBaRjVeFU0d5Putq218y0UYSVh5VHadqo2terPH3SX77pRq79hbQkrZ8KGsPLM-XnL_hEAAP__taf4hQ">