[llvm] [aarch64][win] Add support for import call optimization (equivalent to MSVC /d2ImportCallOptimization) (PR #121516)

Fri Jan 3 02:58:04 PST 2025

mstorsjo wrote:

A couple questions on the mechanism:
- As the flag `/d2ImportCallOptimization` is undocumented, I presume that this format is undocumnted too. This covers two separate formats as far as I can see - the `.impcall` section contents, which this PR generates, and which the linker consumes. This format is mostly a convention - an agreement between compiler and linker; this currently uses the `Imp_Call_V1` identifier. And then secondly, the dynamic relocations that the linker generates, which ends up in the final binary, which is consumed by the windows loader. This format isn't touched upon here (as this only covers code generation, not linking). The format of those dynamic relocations is much more fixed, as it's handled by the OS. Although this is only an optimization, so running on older Windows versions which doesn't recognize it, should be fine too?
- When the linker uses these dynamic relocations, it changes a `br` or `blr` indirect branch instruction into a direct `b` or `bl`, if the target address is close enough within the address space - right? (The other instructions for loading the address are left untouched, as those instructions can be anywhere detached from the branch, and we can't speculate on whether the register contents is needed elsewhere.)
- As the 64 bit address space is kinda large, and the `b` and `bl` instructions only have a +/- 128 MB range, I guess that this optimization simply can't be applied, if the target is too far away? Intuitively, it feels like this wouldn't end up effective all that often? Then again, I guess most DLLs are loaded close to each other in the virtual memory layout, and the base EXE, with dynamic base enabled (as always on aarch64) also would end up somewhere close.
- Is the only thing we gain here, the performance benefits of a direct branch, which is easier for the branch predictor, compared to an indirect branch? While that obviously is better, this feels like a whole lot of extra work and structures, for something that feels like relatively small gains. Is there something else to be gained in relation to this as well that I'm missing (and/or that isn't mentioned yet)? Or is the gains from better branch prediction much bigger than what I'm thinking here?
- For cases when dll imported functions are called without being marked as dllimport in headers, we jump via a thunk (from the import library, and/or linker generated). Can the same thing be applied to them? Does that require including similar `.impcall` sections in the import libraries for the thunks? (In the case of lld, it actually doesn't use the import library contents here but just synthesize it on its own - there we could do the same right away without needing to update import libraries with this metadata.)


https://github.com/llvm/llvm-project/pull/121516