[llvm] [llvm-exegesis] [AArch64] Add support for Load Instructions in subprocess execution mode (PR #144895)

Thu Sep 25 02:05:21 PDT 2025

lakshayk-nv wrote:

> Are we now able to measure a load/store correctly? I.e., is the measured value close to the SWOG number?
>>
No, 
I revisited the measurement values after this comment. Checked latency value for `LD1B` in subprocess and CMPHS (arbitrary opcode) with subprocess and inprocess mode. 
Reset counters (`configurePerfCounter`) is not correct as subprocess and inprocess have different latency measurements. 

Also, Verified mmap functions implementation are working i.e. auxiliary mmap and manual snippet mmap both works so thats a win atleast.

>
> Is there a way to dump the setup code ?
>>
Yes, we can debug the generated assembly by `--debug-only="print-gen-assembly"` or dump the object file itself `--dump-object-to-disk=a.o`.  Use `benchmark-phase=assemble-measured-code` in subprocess execution mode for build without libpfm as `--use-dummy-perf-counters` not supported for `--execution-mode=subprocess`.

> add a test that doesn't require running it? ... test which requires running it
>>
Testcases for added functionalities, if i am missing anything below please tell
1. Testcase checking LD1B executing without error i.e. not throwing segfault (added previously d6f2371)
2. Test for generated assembly checking for fixed up setup code.
This can moreover be for a manual snippet to check both manual snippet mmap and auxiliary mmap and ioctl syscalls.
```.arch armv8-a+sve
# LLVM-EXEGESIS-MEM-DEF test_mem 8 4096
# LLVM-EXEGESIS-MEM-MAP test_mem 65536

# LLVM-EXEGESIS-DEFREG X0 0
# LLVM-EXEGESIS-DEFREG X1 0

ldr x1, [x0, #0]
```

3. And, Yes we would not be able to have statistical check (i.e. check for latency measurement) as upstream build is without libpfm. 
But, still if we can have testcase for checking measurement value of LD1B to be nearly equal to SWOG guarantee. Given, if we run this testcase only when libpfm requirement is fulfilled for running testcase.

this be enough to test PR changes, right?

[For completeness] 
`Table 1` : Latency Measurements LD1B (with CMPHS as consumer instruction) , LD1B SWOG latency = 6

| Min Instructions | Combined Latency (per_snippet_value) | CMPHS_PPzZZ_H Baseline | Calculated LD1B Latency |
|------------------|--------------------------------------|------------------------|--------------------------|
| 10,000           | 16.8298                              | 2.005                  | 14.8248                  |
| 100,000          | 4.0312                               | 2.005                  | 2.0262                   |
| 1,000,000        | 3.56699                              | 2.005                  | 1.56199                  |

`Table 2` : Latency Measurements CMPGT_PPzZZ_H
| Min Instructions | Execution Mode | Latency Value |
|------------------|----------------|---------------|
| 10,000           | inprocess      | 2.0052        |
| 1,000,000        | inprocess      | 2.0314        |
| 10,000           | subprocess     | 9.9627        |
| 100,000          | subprocess     | 3.35004       |
| 1,000,000        | subprocess     | 2.61047       |

https://github.com/llvm/llvm-project/pull/144895