[PATCH] D100818: [lld-macho] Implement branch-range-extension thunks

Mon May 10 17:12:53 PDT 2021

int3 added a comment.

Hopefully the last round of comments :)

Request: Can we profile this against some arm64 program that's not large enough to actually need thunks? Just want to be sure we're not creating lots of overhead when we don't need it. Though if it's a small regression, we can fine-tune it in follow-up diffs.

================
Comment at: lld/MachO/Arch/ARM64.cpp:80

+// A thunk is the always-relaxed variation of stubCode
+static constexpr uint32_t thunkCode[] = {
----------------
nit: 'always-relaxed' sounds like stubCode is sometimes itself relaxed... which I don't think is true. How about just 'relaxed'?

Also it would be nice to spell out why this is the case (since the destination address of a thunk is always statically known)

================
Comment at: lld/MachO/Arch/ARM64.cpp:81
+// A thunk is the always-relaxed variation of stubCode
+static constexpr uint32_t thunkCode[] = {
+    0x90000010, // 00: adrp  x16, <thunk.ptr>@page
----------------
nit: can we put this directly above `populateThunk`?

================
Comment at: lld/MachO/Arch/ARM64.cpp:115

 ARM64::ARM64() : ARM64Common(LP64()) {
   cpuType = CPU_TYPE_ARM64;
----------------
nit: can we keep the constructor at the bottom, right above `createARM64TargetInfo`?

================
Comment at: lld/MachO/MergedOutputSection.cpp:58
+// * Exploits the full range of call instructions - forward & backward
+//
+// Data:
----------------
might be worth mentioning somewhere that thunks themselves are presumed to have an effectively unlimited range, so thunks do not need to jump into other thunks

================
Comment at: lld/MachO/MergedOutputSection.cpp:76
+//   pointing to (b) an InputSection holding machine instructions
+//   (same code as a MachO stub), and (c) Reloc(s) that reference the
+//   real function for fixing-up the stub code.
----------------

================
Comment at: lld/MachO/MergedOutputSection.cpp:121
+// beyond which stubs are within range of a simple forward branch.
+
+uint64_t MergedOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
----------------
nit: rm the newline? I usually take a newline to indicate "the above comment may apply to more than just the function immediately below"

================
Comment at: lld/MachO/MergedOutputSection.cpp:131
+  size_t maxPotentialThunks = 0;
+  for (ThunkMapPair &tp : thunkMap) {
+    ThunkInfo &ti = tp.second;
----------------
hm so we are calling `estimateStubsInRangeVA` itself in a loop... is this potentially expensive? I guess `thunkMap` and `endIdx - callIdx` aren't typically large, so the nested loops are probably fine... but might be worth calling out explicitly in a comment

================
Comment at: lld/MachO/MergedOutputSection.cpp:133
+    ThunkInfo &ti = tp.second;
+    maxPotentialThunks += (!ti.hasStub && ti.callSitesUsed < ti.callSiteCount);
+  }
----------------
unnecessary parens

================
Comment at: lld/MachO/MergedOutputSection.cpp:145-151
+  log("thunks = " + std::to_string(thunkMap.size()) +
+      ", potential = " + std::to_string(maxPotentialThunks) +
+      ", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +
+      to_hexString(isecVA) + ", threshold = " + to_hexString(stubsInRangeVA) +
+      ", isecEnd = " + to_hexString(isecEnd) +
+      ", tail = " + to_hexString(isecEnd - isecVA) +
+      ", slop = " + to_hexString(branchRange - (isecEnd - isecVA)));
----------------
all these `to_hexString()` calls makes me wonder if we should have a `LOG` macro that doesn't actually evaluate its arguments till they are needed...

(I still think we should avoid adding `Config::verbose`, we can just extend ErrorHandler)

if this turns out to be necessary let's do it in a stacked diff... there's too much going on in this one already

================
Comment at: lld/MachO/MergedOutputSection.cpp:183
+  // call site can reach a stub with a forward brach.
+  bool isStubsFinal = in.stubs->isFinal;
+  uint64_t branchRange = target->branchRange;
----------------
can we drop this functionality for now, until if/when we decide to make stubs-before-text an option?

================
Comment at: lld/MachO/MergedOutputSection.cpp:217
+    std::vector<Reloc> &relocs = isec->relocs;
+    assert(std::is_sorted(relocs.begin(), relocs.end(), [](Reloc &a, Reloc &b) {
+      return a.offset > b.offset;
----------------
from STLExtras.h

================
Comment at: lld/MachO/MergedOutputSection.cpp:225
+      ++callSiteCount;
+      // calculate branch reachability boundaries
+      uint64_t callVA = isecVA + r.offset;
----------------
since you capitalized everything else

================
Comment at: lld/MachO/MergedOutputSection.cpp:229
+      uint64_t highVA = callVA + branchRange;
+      // Calculate our call referent address
+      auto *funcSym = r.referent.get<Symbol *>();
----------------
nit aside... I'm a bit confused by this comment. What address computation is going on? Isn't it more like "Determine whether the call's referent is reachable" or "Determine if the referent should be replaced by a thunk"?

================
Comment at: lld/MachO/MergedOutputSection.cpp:241
+      if (lowVA < funcVA && funcVA < highVA) {
+        // The is referent reachable with a simple call instruction.
+        continue;
----------------

================
Comment at: lld/MachO/MergedOutputSection.cpp:263
+        // unfinalized inputs[finalIdx].
+        fatal(Twine(__FUNCTION__) + ": FIXME: thunk range overrun");
+      }
----------------
I think this should work

================
Comment at: lld/MachO/MergedOutputSection.h:67
+//
+// The remaining members--bools and counters--apply to the collection
+// of thunks associated with the real function.
----------------
nit :p 

================
Comment at: lld/MachO/MergedOutputSection.h:71
+struct ThunkInfo {
+  // These denote the active thunk. isec
+  Defined *sym = nullptr;       // private-extern symbol for active thunk
----------------

================
Comment at: lld/MachO/MergedOutputSection.h:80
+  uint8_t sequence = 0;        // how many thunks created so-far?
+  bool hasStub = false;        // does the real function have a stub?
+};
----------------
is this different from `Symbol::isInStubs()`?

================
Comment at: lld/MachO/MergedOutputSection.h:83
+
+using ThunkMapPair = std::pair<Symbol *, ThunkInfo>;
+extern llvm::DenseMap<Symbol *, ThunkInfo> thunkMap;
----------------
looks like this is only used in one place... IMO we could just use `auto` there, but up to you

================
Comment at: lld/MachO/Target.h:93-95
+  // We contrive this value as sufficiently far from any valid address
+  // that it will always be out-of-range for any architecture.
+  static constexpr uint64_t outOfRangeVA = 0xfull << 60;
----------------
This seems unnecessarily cute... Can we just use `numeric_limits<uint64_t>::max()`? Also it's not really target-specific so I'd prefer it not be put in Target

================
Comment at: lld/MachO/Writer.cpp:516
 // DylibSymbol.)
+
 static void prepareBranchTarget(Symbol *sym) {
----------------
rm newline plz

================
Comment at: lld/MachO/Writer.cpp:557-560
+    ThunkInfo &thunkInfo = thunkMap[sym];
+    thunkInfo.hasStub = needsBinding(sym);
+    ++thunkInfo.callSiteCount;
+    ++isec->callSiteCount;
----------------
should we wrap this in an `if (target->needsThunks())`?

================
Comment at: lld/test/MachO/arm64-thunks.s:12
+# RUN: rm -rf %t; split-file %s %t
+# RUN: for x in {a..i}; do \
+# RUN:   llvm-mc -filetype=obj -triple=arm64-apple-darwin %t/$x.s -o %t/$x.o; \
----------------
looks like this is failing on Windows. Not sure there's a cross-platform way to loop, but writing 9 `llvm-mc` invocations doesn't seem too terrible either

out of curiosity: what happens if we put everything into the same file, instead of multiple files -- what does `llvm-mc` generate?

================
Comment at: lld/test/MachO/arm64-thunks.s:17-18
+
+# RUN: llvm-objdump -d --no-show-raw-insn %t/thunk > %t/thunk.d
+# RUN: FileCheck %s < %t/thunk.d
+
----------------
seems like you could pipe it directly into FileCheck

================
Comment at: lld/test/MachO/arm64-thunks.s:22
+
+CHECK: {{0*}}[[#%.6x, A_PAGE:]][[#%.3x, A_OFFSET:]] <_a>:
+CHECK:  bl 0x[[#%x, A:]] <_a>
----------------
since you're using the `%.6x` syntax, the `{{0*}}` seems unnecessary... (you might have to change `.6` to `.8` though)

================
Comment at: lld/test/MachO/tools/generate-thunkable-program.py:15-17
+# This list comes from libSystem.tbd and contains a sizeable subset
+# of dylib calls available for all MacOS target archs.
+libSystem_calls = (
----------------
Couldn't we just generate a bunch of random strings for the symbols? This list isn't really helping us exercise new code paths in LLD...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100818/new/

https://reviews.llvm.org/D100818