[PATCH] D93267: [lld-macho] Handle overflow beyond the 127 common encodings limit

Wed Dec 16 19:19:26 PST 2020

int3 added inline comments.

================
Comment at: lld/MachO/UnwindInfoSection.cpp:142-146
               if (a.second == b.second)
                 // When frequencies match, secondarily sort on encoding
                 // to maintain parity with validate-unwind-info.py
                 return a.first > b.first;
               return a.second > b.second;
----------------
I know this is existing code, but isn't this the default behavior of `std::pair::operator>()`?

================
Comment at: lld/MachO/UnwindInfoSection.cpp:149

-  // Split folded encodings into pages, limited by capacity of a page
-  // and the 24-bit range of function offset
-  //
-  // Record the page splits as a vector of iterators on cuPtrVector
-  // such that successive elements form a semi-open interval. E.g.,
-  // page X's bounds are thus: [ pageBounds[X] .. pageBounds[X+1] )
-  //
-  // Note that pageBounds.size() is one greater than the number of
-  // pages, and pageBounds.back() holds the sentinel cuPtrVector.cend()
-  pageBounds.push_back(cuPtrVector.cbegin());
-  // TODO(gkm): cut 1st page entries short to accommodate section headers ???
-  CompactUnwindEntry64 cuEntryKey;
-  for (size_t i = 0;;) {
-    // Limit the search to entries that can fit within a 4 KiB page.
-    const auto pageBegin = pageBounds[0] + i;
-    const auto pageMax =
-        pageBounds[0] +
-        std::min(i + UNWIND_INFO_COMPRESSED_SECOND_LEVEL_ENTRIES_MAX,
-                 cuPtrVector.size());
-    // Exclude entries with functionOffset that would overflow 24 bits
-    cuEntryKey.functionAddress = (*pageBegin)->functionAddress +
-                                 UNWIND_INFO_COMPRESSED_ENTRY_FUNC_OFFSET_MASK;
-    const auto pageBreak = std::lower_bound(
-        pageBegin, pageMax, &cuEntryKey,
-        [](const CompactUnwindEntry64 *a, const CompactUnwindEntry64 *b) {
-          return a->functionAddress < b->functionAddress;
-        });
-    pageBounds.push_back(pageBreak);
-    if (pageBreak == cuPtrVector.cend())
-      break;
-    i = pageBreak - cuPtrVector.cbegin();
+  // Truncate the vector to 127 elements.
+  // Common encoding indexes are limited to 0..126, while enconding
----------------
depending on how large the vector is, might be a useful perf optimization to use quickselect to find the 127th largest element and just sort over those smaller than it... but we can consider that later

================
Comment at: lld/MachO/UnwindInfoSection.cpp:187
+        bytesRemaining -= sizeof(uint32_t);
+      } else if (bytesRemaining >= 2 * sizeof(uint32_t) &&
+                 n < UNWIND_INFO_COMPACT_ENCODINGS_MAX) {
----------------
`2 * sizeof(uint32_t)` is due to the size of `unwind_info_regular_second_level_entry`, right? how about using `sizeof(unwind_info_regular_second_level_entry)` instead?

================
Comment at: lld/MachO/UnwindInfoSection.cpp:193-194
+        bytesRemaining -= 2 * sizeof(uint32_t);
+      } else
+        break;
+    }
----------------
let's use braces here too since the other if-else clauses have them

================
Comment at: lld/MachO/UnwindInfoSection.cpp:197
+    page.entryCount = i - page.entryIndex;
+    if (i < cuPtrVector.size() &&
+        page.entryCount < UNWIND_INFO_REGULAR_SECOND_LEVEL_ENTRIES_MAX) {
----------------
why does it matter if `i < cuPtrVector.size()`?

================
Comment at: lld/MachO/UnwindInfoSection.cpp:217
       personalities.size() * sizeof(uint32_t) +
-      pageBounds.size() * sizeof(unwind_info_section_header_index_entry) +
+      (secondLevelPages.size() + 1) *
+          sizeof(unwind_info_section_header_index_entry) +
----------------
why the `+ 1`?

================
Comment at: lld/MachO/UnwindInfoSection.cpp:255
   auto *iep = reinterpret_cast<unwind_info_section_header_index_entry *>(i32p);
-  for (size_t i = 0; i < pageBounds.size() - 1; i++) {
-    iep->functionOffset = (*pageBounds[i])->functionAddress;
+  for (const auto &page : secondLevelPages) {
+    iep->functionOffset = cuPtrVector[page.entryIndex]->functionAddress;
----------------
avoid `auto` for types with simple names

================
Comment at: lld/MachO/UnwindInfoSection.cpp:279
+  auto *ep = reinterpret_cast<uint32_t *>(lep);
+  for (const auto &page : secondLevelPages) {
+    if (page.kind == UNWIND_SECOND_LEVEL_COMPRESSED) {
----------------
avoid `auto`

================
Comment at: lld/MachO/UnwindInfoSection.cpp:303-304
+      }
+      for (auto &encoding : page.localEncodings)
+        *ep++ = encoding;
+    } else {
----------------
could use `memcpy`

================
Comment at: lld/MachO/UnwindInfoSection.cpp:319
     }
-    p2p =
-        reinterpret_cast<unwind_info_compressed_second_level_page_header *>(ep);
+    memset(ep, 0, page.padByteCount);
+    ep += (page.padByteCount / sizeof(uint32_t));
----------------
so I don't know if this is spec-guaranteed behavior, but there are already several places in LLD where we assume that an unwritten section of an `mmap`'ed buffer will be zero. I.e. we don't zero things out explicitly, and it seems to work in practice. I think doing that would allow us to eliminate `padByteCount`?

================
Comment at: lld/MachO/UnwindInfoSection.cpp:322-323
   }
-  assert(getSize() ==
-         static_cast<size_t>((reinterpret_cast<uint8_t *>(p2p) - buf)));
+  // assert(getSize() ==
+  // static_cast<size_t>((reinterpret_cast<uint8_t *>(p2p) - buf)));
 }
----------------
rm / uncomment?

================
Comment at: lld/MachO/UnwindInfoSection.h:17
 #include "llvm/ADT/DenseMap.h"
+#include <unordered_map>

----------------
doesn't seem like `unordered_map` is actually used

================
Comment at: lld/MachO/UnwindInfoSection.h:42-47
+// This is a wrapper class that is otherwise like size_t, but has a
+// default constructor that yields ~0 rather than 0. I need it so that
+// lookup() of absent keys from DenseMap<..., size_tx> return ~0,
+// since 0 is a valid value.
+// TODO: If someone knows nicer way to accomplish this, please advise.
+// A simpler alternative: bias all values by 1.
----------------
why not use `find()` instead of `lookup()`?

================
Comment at: lld/MachO/UnwindInfoSection.h:97-112
 #define UNWIND_INFO_COMMON_ENCODINGS_MAX 127
+#define UNWIND_INFO_COMPACT_ENCODINGS_MAX 256

 #define UNWIND_INFO_SECOND_LEVEL_PAGE_SIZE 4096
 #define UNWIND_INFO_REGULAR_SECOND_LEVEL_ENTRIES_MAX                           \
   ((UNWIND_INFO_SECOND_LEVEL_PAGE_SIZE -                                       \
     sizeof(unwind_info_regular_second_level_page_header)) /                    \
----------------
general aesthetic suggestion: if we moved this into the .cpp file, we could do away with all the `UNWIND_INFO_` prefixes and reduce some of the (extensive) line wrapping. Probably better in a separate diff though

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93267/new/

https://reviews.llvm.org/D93267