[llvm] [BOLT][AArch64] Introduce SPE mode in BasicAggregation (PR #120741)

Fri Jan 17 10:05:47 PST 2025

paschalis-mpeis wrote:

Hey Amir and Maks,

Thank you for taking a look at this!

> Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

Correct, in some preliminary internal tests we found both approaches to be close to each other.

Thanks for your suggestion to use `-infer-fall-throughs`. I thought LBR mode was inferring fall-through branches by default. But it looks like this has to be manually specified?

**Let me share my understanding on the LBR format to see if I got this right:**
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are **known** to be fall-throughs, which BOLT can infer.  eg, if we have:
- $\bf\textsf{\color{blue}TK1}$ -> $\textsf{\color{blue}TK2}$ -> $\textsf{\color{blue}TK3}$, then BOLT can propagate CFG hotness to:
- $\textsf{\color{blue}TK1}$ -> FT1a -> FT1b .. -> $\textsf{\color{blue}TK2}$ -> FT2a -> FT2b .. -> $\textsf{\color{blue}TK3}$

SPE on the other hand is a statistical sampling method, meaning all collected packets are not captured contiguously. Each pair comes from a packet that looks like:

```
.  00000040:         PC 0xABC el2 ns=1
.  00000049:         PAD
.  00000053:         B COND
.  00000055:         EV RETIRED NOT-TAKEN
.  0000005a:         LAT 7 ISSUE
.  0000005d:         LAT 8 TOT
.  00000060:         TGT 0xDEF el2 ns=1
.  00000069:         PAD
.  00000077:         TS 1234
```
(note: you can inspect native SPE packets w/ `perf script -D`)

>From this example we have  `0xABC` -> `0xDEF` (a src/tgt pair), where `0xABC` is a branch that was NOT-TAKEN. 
The tgt `0xDEF` is a target address of some block (ie, not a branch). We have no information whether the branch of that target block will be taken or not. Therefore, my understanding is that we cannot infer any branches in-between src/tgt. And I believe that is why we found the two approaches to be close to each other.

Please do share your thoughts on this.

Do you think there are any other benefits when using the LBR format? It can additionally utilize prediction information (miss/hit), but we haven't found this to be that beneficial for the quite-limited SPE branch data (when compared to LBR traces).

https://github.com/llvm/llvm-project/pull/120741