[llvm] [AArch64][BOLT] Ensure tentative code layout for cold BBs runs. (PR #96609)
Paschalis Mpeis via llvm-commits
llvm-commits at lists.llvm.org
Tue Jun 25 05:57:26 PDT 2024
paschalis-mpeis wrote:
This PR is a draft to initiate a discussion on the following two issues:
1. On a large input binary (~60MiB of code), we encountered a case where LongJmp was expanding some branches it shouldn't (**the current patch aims to solve this**).
2. This expansion seemed to be leading to a runtime crash.
A more detailed discussion follows.
---
## (1) Expanding branches that shouldn't have
On AArch64, when using `-lite=1` and `-split-functions` in relocation mode, we encountered cases where functions that come after the hot/cold frontier are not 'emittable', so when LongJmp needs to generate stubs between hot/cold basic blocks, it has incomplete information.
This happens because `tentativeLayoutRelocMode` never calls `tentativeLayoutRelocColdPart`, causing the estimated dot address of the first cold BB for such functions to be `0x0`. As a result, when emitting stubs to jump from/to cold BBs, BOLT estimates much larger PC-relative addresses.
**Example:** one such function that had 6 hot and 7 cold basic blocks. The tentative addresses computed per basic block were:
```
HOT BB: .LBB01750 0x8000054
HOT BB: .Ltmp26442 0x8000074
HOT BB: .Ltmp26443 0x800007c
HOT BB: .Ltmp26444 0x8000090
HOT BB: .LFT24377 0x800009c
HOT BB: .LLP3525 0x80000a4
COLD BB: .LFT24370 0x0
COLD BB: .LFT24372 0x14
COLD BB: .LFT24378 0x30
COLD BB: .LFT24380 0x40
COLD BB: .LFT24381 0x48
COLD BB: .Ltmp26445 0x5c
COLD BB: .LFT24382 0x68
```
So for a jump from hot BB `.Ltmp26443` to cold `.LFT24378`, we get:
- Dot Target Address: `0x800007c`
- Dot Source Address: `0x30` (ie bb where we jump from)
This makes the computed `PCRelTgtAddress` `0x800004c`, which is above `0x8000000` ($2^{27}$), forcing `relaxStub` to emit a short-jump as it does not fit in the `b` instruction.
The same happens vice versa, e.g., when jumping from a cold to a hot BB.
A third example, which did not expand by luck, was:
- Dot Target Address: `0x8000090`
- Dot Source Address: `0x96`
This makes `PCRelTgtAddress=0x7fffffa` (`0x8000090-0x96`), which falls just under the limit and fits into a single instruction.
Please note that without lite mode, `tentativeLayoutRelocColdPart` runs at least once as all functions are forced to be 'emittable'. Therefore, we don't encounter any wrong dot address estimations. Despite having to perform longer jumps (as each function is duplicated and optionally split into hot/cold), it just so happens that for the same binary all jumps fit in a single instruction and no 'relaxation' is needed.
**The current patch**, forces tentative address estimations on cold BBs to run at least once, otherwise it exits with a 'BOLT-ERROR'. This change should decrease stub 'relaxations' with lite mode. It can lead to slight decreased code size (in our example 9% and 28% of hot and cold sections respectively) and may have a very slight impact on performance as well (given this concerns cold blocks; mostly if the profile is not good or if CPU caches are affected).
---
## (2) Relaxing shouldn't have caused a runtime crash
Due to the wrong estimations described in (1), some branches are expanded/relaxed$^1$. However, it is not clear how such 'relaxation' could be responsible for a runtime crash. Despite being unnecessary, the below replacement looks like legit code. Given that, **any ideas on what may have caused this?**
**Example:** in the stub below, the branch was 'relaxed' from:
```armasm
b .Ltmp1234
```
to:
```armasm
adrp x16, .Ltmp1234
add x16, x16, :lo12:.Ltmp1234
br x16
```
**NOTE:** The issue described in **(1)** ('relaxing' jumps where it shouldn't), seems to be found in other binaries that use `-lite=1 -split-functions`. However, the crash described here was not observed in those cases. It is currently unclear if this was due to pure luck (i.e., those cold stubs were never reached) or if there is something else that affects only larger binaries.
---
### $^1$ Concept of Relaxation vs Expanding?
Actually, the `relaxStub` emits a stub, which might then be 'expanded' to a short jump or (for very long distances on non-PIC executables) to a long jump. Is the naming here 'stale' due to several refactorings that happened, or am I missing any theoretical concepts? In my mind, 'relaxing' would have meant using the safest/slowest jump first, and then adjusting to the lightest that still gets the job done.
https://github.com/llvm/llvm-project/pull/96609
More information about the llvm-commits
mailing list