[PATCH] D64903: [ELF] Pad the last page of last PF_X PT_LOAD with traps when -z separate-code is specified

Sun Jul 21 20:09:53 PDT 2019

MaskRay added a comment.

> First, all the differences are demonstrated on uncompressed binaries, but if the binary size was really important for you, you'd always use compression in practice.

This works on platforms where you control the whole stack, but the practice may not apply elsewhere, e.g. if you provide components to another Android package when they require the pre-compression component to be smaller than 1M. This is a fairly reasonable request because the package itself will be zipped, so they ask you to control the pre-compression size. You also pay the cost to uncompress the binaries before executing them. It the binary is not uncompressed on the disk, if you run multiple instances of the same executable, the image cannot be shared.

> With D64906 <https://reviews.llvm.org/D64906> and D64930 <https://reviews.llvm.org/D64930> applied:

Thanks for your testing! I think the result was measured before I updated a newer revision where I removed one more alignment boundary (after I realized the two RW (relro and non-relro) can overlap as well).

In my experiments, the size difference is now 200760-6672 bytes, which is close to 3*65536, the theoretical maximum amount of saving due the removal of 3 alignment boundaries.

If we can only utilize half of the maxmimum saving, 3*65536/2 = 96k for each binary will still be significant for a distribution (think: every /usr/bin/ and /usr/lib/ binary can benefit from it).

> Second, with -z noseparate-code, you'll end up with a partial page wasting additional memory at runtime So the difference between -z separate-code and -z noseparate-code isn't really about security vs space as is presented in the description, it's primarily a trade-off between memory and disk space.

The -z noseparate-code case may or may not load one more page. If the map aligned to a page boundary occupies [0x10000, 0x10200), with an offset, it occupies [0x11000, 0x11200), both are mapped as [0x10000, 0x20000) at runtime. If you uncompress your binary before loading it, I bet there is no timing difference at all because the uncompressed image is already in the buffer cache. The PT_LOAD segments overlap. Moreover, the binary size with the -z noseparate-code case is smaller, therefore the amount of disk IO is less.

> Third, there's the security aspect. I agree that the reduction of the number of ROP gadgets is limited, but that doesn't mean we shouldn't do it. I've consulted this change with our security team today and their response is that we always want -z separate-code! Why? It's a basic security hygiene, best practice, defense in depth kind of measure. It's the same reason why we fill space between functions with traps, why we use stack protectors, why we use ASLR, etc. We have known attacks against each of those mechanisms, and not all of them have proven as effective as was believed in the past, but that doesn't mean we're going to disable them. The same applies here.

I am concerned with security and I know a little writing ROP gadgets, but I am not sure the the protection provided by -z separate-code can be compared with other time-tested mitigations like ASLR/stack protector/read_implies_exec (PT_GNU_STACK). My CTF team (venerable on ctftime.org) has lots of experience exploiting binaries. I have consulted them and the feedback is that one does not to exploit the traditional RO part (mapped as RX) to obtain various types of gadgets. Only in one case they had to use a svc instruction in the RO part in thumb mode. However, I am still in favor of lld's current R & RX layout: data traps immediately if you try to call it as functions.

> With all that being said, if there's a combination of flag that would allow preserving the current behavior, we'd set those in our driver like we already do for -z rodynamic, so I'm not going to block this change.

There is one of my favorite lld specific option :) If gdb doesn't use DT_DEBUG, I really hope we can switch to -z rodynamic for all architectures. The writable .dynamic was pretty much a mistake.

> One potential solution might be to control the default behavior with a CMake option which is similar to the solution also used in binutils by ld (--enable-separate-code/--disable-separate-code). While this isn't being used to control the defaults in lld at the moment, it's used elsewhere in LLVM for that purpose (e.g. ENABLE_LINKER_BUILD_ID or ENABLE_X86_RELAX_RELOCATIONS in Clang) and enables each toolchain vendor to choose their defaults.

Making clangDriver handle the option sounds a good idea. Doing it in lld may not be a very good solution (see more discussions on D56650 <https://reviews.llvm.org/D56650>).

You have a great understanding on the tooling and what defaults are the best on your platform. As a linker default, I think we should probably care more about what regular users want (you have a platform where you control the whole stack and easily tune llvm/clang/libcxx/compiler-rt default whenever applicable but a regular user/distribution doesn't have to motivation/interests to tune the defaults). Increasing every /usr/bin /usr/lib binary by 192kb (or 96kb) for the reduction in the potential (hypothetical) gadgets in RO regions, IMHO is not a worthwhile trade-off for many users. (That said, I do appreciate your efforts to make your platform as secure as possible.) This -z noseparate choice, as the most practical default today, can of course by flipped if it is proven to be a bad choice in the future.

Repository:
  rLLD LLVM Linker

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64903/new/

https://reviews.llvm.org/D64903