[llvm] [AArch64][BOLT] Ensure tentative code layout for cold BBs runs. (PR #96609)

Tue Jun 25 05:57:26 PDT 2024

paschalis-mpeis wrote:

This PR is a draft to initiate a discussion on the following two issues:
1. On a large input binary (~60MiB of code), we encountered a case where LongJmp was expanding some branches it shouldn't (**the current patch aims to solve this**).
2. This expansion seemed to be leading to a runtime crash.

A more detailed discussion follows.

---

## (1) Expanding branches that shouldn't have

On AArch64, when using `-lite=1` and `-split-functions` in relocation mode, we encountered cases where functions that come after the hot/cold frontier are not 'emittable', so when LongJmp needs to generate stubs between hot/cold basic blocks, it has incomplete information.

This happens because `tentativeLayoutRelocMode` never calls `tentativeLayoutRelocColdPart`, causing the estimated dot address of the first cold BB for such functions to be `0x0`. As a result, when emitting stubs to jump from/to cold BBs, BOLT estimates much larger PC-relative addresses.

**Example:** one such function that had 6 hot and 7 cold basic blocks. The tentative addresses computed per basic block were:
```
 HOT BB: .LBB01750    0x8000054
 HOT BB: .Ltmp26442   0x8000074
 HOT BB: .Ltmp26443   0x800007c
 HOT BB: .Ltmp26444   0x8000090
 HOT BB: .LFT24377    0x800009c
 HOT BB: .LLP3525     0x80000a4
COLD BB: .LFT24370    0x0
COLD BB: .LFT24372    0x14
COLD BB: .LFT24378    0x30
COLD BB: .LFT24380    0x40
COLD BB: .LFT24381    0x48
COLD BB: .Ltmp26445   0x5c
COLD BB: .LFT24382    0x68
```

So for a jump from hot BB `.Ltmp26443` to cold `.LFT24378`, we get:
- Dot Target Address: `0x800007c`
- Dot Source Address: `0x30` (ie bb where we jump from)
This makes the computed `PCRelTgtAddress` `0x800004c`, which is above `0x8000000` ($2^{27}$), forcing `relaxStub` to emit a short-jump as it does not fit in the `b` instruction.

The same happens vice versa, e.g., when jumping from a cold to a hot BB.

A third example, which did not expand by luck, was:
- Dot Target Address: `0x8000090`
- Dot Source Address: `0x96`
This makes `PCRelTgtAddress=0x7fffffa` (`0x8000090-0x96`), which falls just under the limit and fits into a single instruction.

Please note that without lite mode, `tentativeLayoutRelocColdPart` runs at least once as all functions are forced to be 'emittable'. Therefore, we don't encounter any wrong dot address estimations. Despite having to perform longer jumps (as each function is duplicated and optionally split into hot/cold), it just so happens that for the same binary all jumps fit in a single instruction and no 'relaxation' is needed.

**The current patch**, forces tentative address estimations on cold BBs to run at least once, otherwise it exits with a 'BOLT-ERROR'. This change should decrease stub 'relaxations' with lite mode. It can lead to slight decreased code size (in our example 9% and 28% of hot and cold sections respectively) and may have a very slight impact on performance as well (given this concerns cold blocks; mostly if the profile is not good or if CPU caches are affected).

---

## (2) Relaxing shouldn't have caused a runtime crash

Due to the wrong estimations described in (1), some branches are expanded/relaxed$^1$. However, it is not clear how such 'relaxation' could be responsible for a runtime crash. Despite being unnecessary, the below replacement looks like legit code. Given that, **any ideas on what may have caused this?**

**Example:** in the stub below, the branch was 'relaxed' from:
   ```armasm
    b       .Ltmp1234
```

to:
```armasm
adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16
```

**NOTE:** The issue described in **(1)** ('relaxing' jumps where it shouldn't), seems to be found in other binaries that use `-lite=1 -split-functions`. However, the crash described here was not observed in those cases. It is currently unclear if this was due to pure luck (i.e., those cold stubs were never reached) or if there is something else that affects only larger binaries.

---

### $^1$ Concept of Relaxation vs Expanding?

Actually, the `relaxStub` emits a stub, which might then be 'expanded' to a short jump or (for very long distances on non-PIC executables) to a long jump. Is the naming here 'stale' due to several refactorings that happened, or am I missing any theoretical concepts? In my mind, 'relaxing' would have meant using the safest/slowest jump first, and then adjusting to the lightest that still gets the job done.

https://github.com/llvm/llvm-project/pull/96609