[PATCH] D77283: scudo: Add support for diagnosing memory errors when memory tagging is enabled.

Fri Apr 3 20:02:28 PDT 2020

pcc added inline comments.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:39
+// level 10000).
+extern "C" size_t android_unsafe_frame_pointer_chase(scudo::uptr *buf,
+                                                     size_t num_entries);
----------------
hctim wrote:
> QQ - do we want to use per-plaftorm implementations, or do we want to cargo-cult the FP unwinder here? We'd then be able to provide stack traces (even without MTE) to all platforms. My gut says per-platform implementations should be the special case, not the generic. I don't feel strongly about this, but think it's worth raising the question.
This function needs to know the stack bounds, which is not available in general. Unless we change that, it needs to be implemented in libc.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:236
+#if SCUDO_ANDROID && __ANDROID_API__ == 10000
+    enum { kStackSize = 64 };
+    uptr Stack[kStackSize];
----------------
hctim wrote:
> Why not `constexpr`?
Oops, used the wrong style here. Fixed.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:237
+    enum { kStackSize = 64 };
+    uptr Stack[kStackSize];
+    uptr Size = android_unsafe_frame_pointer_chase(Stack, kStackSize);
----------------
cryptoad wrote:
> Style wise, the whole code base avoids the `kConstant` naming.
Ditto.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:385
           resizeTaggedChunk(PrevEnd, TaggedUserPtr + Size, BlockEnd);
+          if (ZeroContents && Size) {
+            // Clear any stack metadata that may have previously been stored in
----------------
hctim wrote:
> Zeroing the contents seems too important to be default off now that we store the previous tag (and thus the previous tag is accessible through an uninitialized-heap read). IMHO this should be done by default for MTE.
I'm fine either way, so I changed this to `if (Size)`.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:388
+            // the chunk data.
+            memset(TaggedPtr, 0, 16);
+          }
----------------
cryptoad wrote:
> Should the `16` be a constant?
Yes, we can use the granule size since we probably can't realistically fit more than a granule's worth of data here. I made this clear in the `MemTag*Index` comment as well.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:723

+  void setTrackAllocationStacks(bool Track) {
+    Options.TrackAllocationStacks = Track;
----------------
eugenis wrote:
> There is no synchronization. Is this meant to be used in single thread or stop-the-world context only?
> If we turn Options into a relaxed atomic, this could be made thread-safe, with the understanding that the other threads will change their behavior "eventually".
> 
> Either way this should be mentioned in the declaration comment.
It's single threaded for now at least, and I've added a comment in wrappers_c.inc. It seems probable that we will only need this in a stop-the-world context, so I don't think we should convert it to use atomics for now.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:724
+  void setTrackAllocationStacks(bool Track) {
+    Options.TrackAllocationStacks = Track;
+  }
----------------
hctim wrote:
> `TrackAllocationStacks` is also used to track deallocation stacks. Behaviour seems reasonable, but consider a more descriptive name?
I read "allocation stacks" as "stacks relating to an allocation" which covers deallocation stacks. Something like "allocation info" is slightly more accurate but also vague. I'll think about the name.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:739
+
+  static void getErrorInfo(struct scudo_error_info *ErrorInfo, uintptr_t Ptr,
+                           const char *DepotPtr, const char *RegionInfoPtr,
----------------
hctim wrote:
> This function makes me a little nervous, as it makes the assumption that the pointers we're provided are valid structs.
> 
> Especially in async mode, we can't make that assumption. Given that an attacker may have arbitrary write for some amount of time between fault and trap time, they have full access to whatever structs the crash handler is about to blindly use.
> 
> We should validate all these input structs for validity before operating on them. This function should be highly tested, and highly fuzzed. Thankfully, the crash handler on Android is always spawned with less privileges than the crashed process, but RCE in the crash handler is still no fun.
I think it makes sense to set up a fuzzer for this. As for testing, I feel that integration tests (like the ones that I'm adding to system/core in AOSP) are the appropriate level of testing here, because of the number of moving pieces that need to be set up for this function to work.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:751
+
+    auto GetGranule = [&](uptr Addr, const char **Data, uint8_t *Tag) -> bool {
+      if (Addr < MemoryAddr || Addr >= MemoryAddr + MemorySize)
----------------
cryptoad wrote:
> Not sure if the lambda should abide by the usual style, eg `getGranule`, up to you.
I think the style for these normally matches the style for local variables, because that's what they are in the end.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:799
+          Header.State != Chunk::State::Allocated &&
+          Data[MemTagPrevTagIndex] == PtrTag) {
+        auto *R = &ErrorInfo->reports[NextErrorReport++];
----------------
eugenis wrote:
> TBH, I'm not sure about relying on delayed block reuse for use-after-free reports. Not aggressively reusing the most recently freed block will hurt both RAM and CPU (because of worse heap locality). It feels that for any given target probability (however defined) of deallocation info being available in the event of a use-after-free, persisting smaller records (like in hwasan history buffer) would be a better trade-off.
> 
> Of course, the exception from this would be use-immediately-after-free.
That certainly seems like a concern. I thought that the quarantine might save us but it looks like it's disabled by default (due to 
https://cs.android.com/android/platform/superproject/+/master:external/scudo/standalone/flags.inc;l=13
https://cs.android.com/android/platform/superproject/+/master:external/scudo/standalone/flags.inc;l=18
both defaulting to 0). We would at least diagnose immediate UAF with this which is better than nothing.

I'm inclined to start with this since it's the simplest approach and we can experiment with other approaches in followups.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:1016
+  void storeAllocationStackMaybe(void *Ptr) {
+    if (!UNLIKELY(Options.TrackAllocationStacks))
+      return;
----------------
hctim wrote:
> Why `UNLIKELY` here? (and `storeDeallocationStackMaybe`)
Because in the usual case we aren't collecting stack traces.

================
Comment at: compiler-rt/lib/scudo/standalone/combined.h:1031
+    auto *Ptr32 = reinterpret_cast<u32 *>(Ptr);
+    Ptr32[MemTagDeallocationTraceIndex] = collectStackTrace();
+    Ptr32[MemTagDeallocationTidIndex] = getThreadID();
----------------
eugenis wrote:
> Why not store the entire deallocation stack trace here if it fits, and reduce the stack depot load by 2x?
That's an interesting idea. Since it's not clear at this point whether we're going to go with inline deallocation info this seems like followup territory.

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:30
+// the error_info data structure. Up to three possible causes are returned in
+// the reports array, in decreasing order of probability. The remaining elements
+// of reports are zero-initialized.
----------------
hctim wrote:
> Sorry, 'remaining elements' here meaning any `scudo_error_report` struct that's unused?
Yes.

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:37
+//
+// ptr is the fault address. On aarch64 this is available in the system register
+// FAR_ELx, or far_context.far in an upcoming release of the Linux kernel. This
----------------
hctim wrote:
> Why `ptr` then? Why not `fault_address`?
Yes, that would be clearer. Done (`fault_addr` to be consistent).

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:45
+// obtained by calling the function __scudo_get_stack_depot_addr() in the
+// crashing process. The size of the stack depot is available by calling the
+// function __scudo_get_stack_depot_size().
----------------
hctim wrote:
> "in the crashing process, generally ahead-of-time (at libc init, for example)."
It can be done at any time, even in the signal handler (these functions are just returning pointers into globals, which is safe to do anywhere). It's not clear that any particular time is better (we do it early on Android because of circumstances that are specific to bionic, and arguably later is better in order to reduce the window of memory corruption) so I wouldn't make a specific recommendation here.

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:55
+// The more memory available via this pointer, the more likely it is that the
+// function will be able to analyze a crash correctly.
+//
----------------
hctim wrote:
> Can you make a recommendation on how much memory we need? Should that be defined by the platform, or by Scudo (via. a callback)? I assume we need at least 3 allocations (one to the left and one to the right) to make a reasonable guess, but how many do you see us having? A constant 16? Is it some function on the sizeof the allocation?
> 
> Our top primary size class on Android is 64KiB + 16B. Assuming a static 16 allocations (8 on either side) for the crash handler to do its thing, that's a megabyte per crash. Not too bad.
> 
> Once we support secondary tagged allocations though (assuming we intend on doing this), we'd have to rethink this, no? Would it be worth having a callback of `__scudo_get_surrounding_metadata()` that, for some fault address, returns the metadata (header + any deallocation data left over) for some number of surrounding allocations alone?
I've changed this to recommend a constant 16 * the largest primary allocation size. It's true that it may be better to introduce an API here to allow scudo to specify the size of the memory region, but let's save that for a followup.

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:59
+// via memory. Each byte of this array corresponds to a region of memory of size
+// equal to the architecturally defined memory tag granule size (16 on aarch64).
+//
----------------
hctim wrote:
> Maybe add a comment where this is normally obtained from?
Hmm, there isn't really one at the moment. On Android I'm using an unsubmitted kernel patch to read the tags via ptrace: https://github.com/pcc/linux/commit/969d2d71099372ef68c05f29545b88bd90b2a528

Catalin will be proposing an official ptrace API for this, and at that point I will update the comment.

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:65
+// pointer.
+void __scudo_get_error_info(struct scudo_error_info *error_info, uintptr_t ptr,
+                            const char *stack_depot, const char *region_info,
----------------
hctim wrote:
> Meta note - I'd also add an explicit disclaimer here that this is not intended to work across API level boundaries (version_of_scudo_allocator === version_of_scudo_crash_handler).
I added it near the top.

================
Comment at: compiler-rt/lib/scudo/standalone/include/scudo/interface.h:70
+
+enum scudo_error_type {
+  UNKNOWN,
----------------
hctim wrote:
> We now have two ways that scudo reports errors. For UaF/overflow, as per this, you get the report from `__scudo_get_error_info`. For internally detected errors (invalid free, double free), you still get an abort message. We should probably unify the behaviour here, because:
> 
>  1. The metadata we collect about tags/threads/traces is useful in debugging double frees, etc.
>  2. Having different exit conditions is harder to keep track of. I can easily see a double-free that ends up doing some load/store with incorrect tags before it hits the internal trap, and subsequently we report it incorrectly (i.e. a double free now looks like an UNKNOWN or BUFFER_OVERFLOW, etc.).
This may be a good idea, but let's save it for a follow up because there would likely need to be an orthogonal mechanism for communicating the crashes to the crash handler process even if the end result looks the same to the user.

> easily see a double-free that ends up doing some load/store with incorrect tags

An invalid free maybe, but not a double free I don't think (except, possibly, for secondary allocations, but that would be a different fault). A double free passes the same pointer twice, and at both times the metadata, which is untagged, will be in the same location and accessed via a pointer which will be untagged before accessing it. Accesses to user memory, e.g. to zero the memory or store deallocation metadata, are unchecked.

================
Comment at: compiler-rt/lib/scudo/standalone/memtag.h:185
+inline uint8_t extractTag(uptr Ptr) {
+  return Ptr >> 56;
+}
----------------
hctim wrote:
> Mask TBI bits here? We don't need them, and we should allow users to have them.
I'm not sure that that's the only change that we would need to make in order to support passing TBI bits into the allocator, so I'd prefer not to unless we have some kind of test for that capability.

================
Comment at: compiler-rt/lib/scudo/standalone/stack_depot.h:42
+  HybridMutex ring_end_mu;
+  u32 ring_end;
+
----------------
hctim wrote:
> I know that this is zero-initialized by the linker in its current inclusion in the allocator class, but it's a bit roundabout to detemine this. Can you explicitly initialize these varaiables?
I think the style in scudo is to let fields be zero-initialized and only explicitly initialize (e.g. via `initLinkerInitialized`) if necessary.

================
Comment at: compiler-rt/lib/scudo/standalone/stack_depot.h:44
+
+  static const uptr kTabBits = 16;
+  static const uptr kTabSize = 1 << kTabBits;
----------------
eugenis wrote:
> hctim wrote:
> > I worry that these values aren't configurable. We might want to have a larger stack depot on some devices, if:
> > 
> >   # Scudo ends up using a stack depot for a MTE-similar feature on another architecture.
> >   # Someone ends up using stack trace collection for non-MTE scenarios (I mean, it's a cheap way to do heap profiling).
> >   # We have different devices that have more/less sensitivity to memory overhead.
> > 
> > Should we make this templated?
> > 
> > 
> Not before there are actual users of this.
Agreed.

================
Comment at: compiler-rt/lib/scudo/standalone/stack_depot.h:50
+  static const uptr kRingBits = 19;
+  static const uptr kRingSize = 1 << kRingBits;
+  static const uptr kRingMask = kRingSize - 1;
----------------
hctim wrote:
> ```
> #ifdef __LP64__
>   static const uptr kRingSize = kRingBits * 8;
> #else
>   static const uptr kRingSize = kRingBits * 4;
> #endif
> ```
> 
> ... as we can ask for trace collection on 32-bit?
You mean `static const uptr kRingSize = (1 << kRingBits) * 8;` etc.? This is the number of elements, not the size in bytes. If we wanted to shrink this on 32-bit there would be more needed than this. We would probably start by using `uptr` for the type of `ring`, but that's not enough because we're using the upper 32 bits of ring header entries to store the size. This code isn't being used on 32-bit so I'd prefer to waste most of the upper 32 bits on 32-bit for now, rather than write dead code.

================
Comment at: compiler-rt/lib/scudo/standalone/stack_depot.h:64
+    u64 entry = ring[ring_pos];
+    u64 id = (u64(end - begin) << 33) | (u64(hash) << 1) | 1;
+    if (entry == id)
----------------
hctim wrote:
> Can we please add some more documentation to this class? I get that this is saving some unique-ID to prevent us from saving the same allocation trace twice, but it would be nice if that was immediately obvious.
I've added a large comment near the top and the individual functions.

================
Comment at: compiler-rt/lib/scudo/standalone/stack_depot.h:85
+    u64 hash_with_tag_bit = (u64(hash) << 1) | 1;
+    if ((entry & 0x1ffffffff) != hash_with_tag_bit)
+      return false;
----------------
hctim wrote:
> Same thing here. What makes up this cookie should be more obvious IMHO. I get the top 31 bits are roughly a UID for the stack trace, the middle 32 bits are hash (and thus these two combined provide some hash-collision resistance), and the bottom bit is to uniquely identifiy cookies as the FP-address shouldn't have the LSB set (except for in thumb...), but putting this somewhere and typedefing this u64 seems like a good idea.
This is now covered in  the documentation. Note that the top 31 bits (the size) cannot act as a "UID" as such, they are actually necessary in order to interpret the hash, as there is no other way to determine the size.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77283/new/

https://reviews.llvm.org/D77283