[lld] [lld][InstrProf] Profile guided function order (PR #96268)

Mon Nov 25 11:10:39 PST 2024

================
@@ -0,0 +1,413 @@
+//===- BPSectionOrderer.cpp--------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "BPSectionOrderer.h"
+#include "InputSection.h"
+#include "lld/Common/ErrorHandler.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/StringMap.h"
+#include "llvm/ProfileData/InstrProfReader.h"
+#include "llvm/Support/BalancedPartitioning.h"
+#include "llvm/Support/TimeProfiler.h"
+#include "llvm/Support/VirtualFileSystem.h"
+#include "llvm/Support/xxhash.h"
+
+#define DEBUG_TYPE "bp-section-orderer"
+using namespace llvm;
+using namespace lld::macho;
+
+/// Symbols can be appended with "(.__uniq.xxxx)?.llvm.yyyy" where "xxxx" and
+/// "yyyy" are numbers that could change between builds. We need to use the root
+/// symbol name before this suffix so these symbols can be matched with profiles
+/// which may have different suffixes.
+static StringRef getRootSymbol(StringRef Name) {
+  auto [P0, S0] = Name.rsplit(".llvm.");
+  auto [P1, S1] = P0.rsplit(".__uniq.");
+  return P1;
+}
+
+static uint64_t getRelocHash(StringRef kind, uint64_t sectionIdx,
+                             uint64_t offset, uint64_t addend) {
+  return xxHash64((kind + ": " + Twine::utohexstr(sectionIdx) + " + " +
+                   Twine::utohexstr(offset) + " + " + Twine::utohexstr(addend))
+                      .str());
+}
+
+static uint64_t
+getRelocHash(const Reloc &reloc,
+             const DenseMap<const InputSection *, uint64_t> &sectionToIdx) {
+  auto *isec = reloc.getReferentInputSection();
+  std::optional<uint64_t> sectionIdx;
+  auto sectionIdxIt = sectionToIdx.find(isec);
+  if (sectionIdxIt != sectionToIdx.end())
+    sectionIdx = sectionIdxIt->getSecond();
+  std::string kind;
+  if (isec)
+    kind = ("Section " + Twine(isec->kind())).str();
+  if (auto *sym = reloc.referent.dyn_cast<Symbol *>()) {
+    kind += (" Symbol " + Twine(sym->kind())).str();
+    if (auto *d = dyn_cast<Defined>(sym)) {
+      if (isa_and_nonnull<CStringInputSection>(isec))
+        return getRelocHash(kind, 0, isec->getOffset(d->value), reloc.addend);
+      return getRelocHash(kind, sectionIdx.value_or(0), d->value, reloc.addend);
+    }
+  }
+  return getRelocHash(kind, sectionIdx.value_or(0), 0, reloc.addend);
+}
+
+static void constructNodesForCompression(
+    const SmallVector<const InputSection *> &sections,
+    const DenseMap<const InputSection *, uint64_t> &sectionToIdx,
+    const SmallVector<unsigned> &sectionIdxs,
+    std::vector<BPFunctionNode> &nodes,
+    DenseMap<unsigned, SmallVector<unsigned>> &duplicateSectionIdxs,
+    BPFunctionNode::UtilityNodeT &maxUN) {
+  TimeTraceScope timeScope("Build nodes for compression");
+
+  SmallVector<std::pair<unsigned, SmallVector<uint64_t>>> sectionHashes;
+  sectionHashes.reserve(sectionIdxs.size());
+  SmallVector<uint64_t> hashes;
+  for (unsigned sectionIdx : sectionIdxs) {
+    const auto *isec = sections[sectionIdx];
+    constexpr unsigned windowSize = 4;
+
+    for (size_t i = 0; i < isec->data.size(); i++) {
+      auto window = isec->data.drop_front(i).take_front(windowSize);
+      hashes.push_back(xxHash64(window));
+    }
+    for (const auto &r : isec->relocs) {
----------------
ellishg wrote:

The BP algorithm uses a set of hashes, not an ordered list. This is why we sort and remove duplicates on line 94-95. This may be confusing because two functions can have the same set of hashes even if their instructions have different orders. But in fact, this is what we want. Compression algorithms will compress well if two functions close together have shared instructions (k-mers), not necessarily the exact instruction order.

This code collects two "types" of hashes.
1. A sliding window of the section data. You can think of this as having a single hash for each instruction, although this is not technically correct.
2. A hash for each relocation. I try to hash the target of the relocation rather than the bits themselves.

In an earlier version, I actually did compute 1 and 2 at the same time and combined the relocations into the hash of the data hash, but I found the code to be more complicated and the results were worse.

I suspect that separating 1 and 2 improves compression because some functions are similar, but their call, load, or store targets are different. These functions should be ordered close together, even if their instructions aren't identical.

You are welcome to experiment with this. The nice thing about BP is that using a different set of hashes can never break anything. At worst, it can regress startup performance or compression. Or it can improve things. Let me know if you have ideas and I can also try on our apps.

https://github.com/llvm/llvm-project/pull/96268