[llvm-dev] [RFC] Pagerando: Page-granularity code randomization

Wed Jun 14 16:03:11 PDT 2017

Thanks for the ideas. I particularly like the GOT access via masking.
However I do have some security concerns over completely eliminating
the POT.

On Mon, Jun 12, 2017 at 5:48 PM, Sean Silva <chisophugis at gmail.com> wrote:
> As long as the DSO is under some fixed size (say 1GB or 4GB or whatever)
> then with dynamic linker collaboration you can find the GOT by rounding down
> the current instruction pointer, eliminating the need for the POT. This
> should save the need for the internal ABI stuff. As long as you are
> shuffling sections and not spewing them all over memory you can implement
> the randomization as an in-place shuffling of the pages and thus not
> increase the maximal distance to the GOT.

I think this is a great idea for referencing the GOT and global data.
We should be careful that keeping the DSO in a fixed range and placing
.rodata at a fixed alignment still allows sufficient entropy to
mitigate guessing and disclosure attacks. Shuffling in place is
problematic without execute-only (non-readable) code page permissions,
since an attacker could simply do a linear scan of the DSO's code,
disassemble and reuse code in that DSO. On platforms that support
execute-only permissions, I think an in-place shuffle is fine.

I'm not sure we can keep code page pointers in the GOT/global segment
and still keep them hidden from an attacker with a read primitive. An
attacker who has any global data pointer can trivially find the GOT
and thus code page addresses if we keep them in the GOT. Even if we
were to decouple the GOT from other address-taken global data but
still place the GOT at a predictable location (masking off low bits),
then it should still be fairly easy for an attacker to locate it.

Even if we have to keep the POT, eliminating the extra load from the
POT for global access by masking the PC address should be a
significant performance optimization.

> So in the end the needed changes would be:
> 1. compiler change to have it break up sections into 4K (or whatever)
> chunks, inserting appropriate round-down-PC sequences for GOT access and a
> possibly a new relocation type for such GOT accesses. Add a new section flag
> to indicate that sections should be placed in output sections of at most 4K
> (or whatever is appropriate for the target). For -ffunction-sections
> -fdata-sections this should only require splitting a small number of
> sections (i.e. larger than 4K sections). There is no binning in the
> compiler.
> 2. linker change to respect the section flag and split output sections
> containing input sections with such flags into multiple 4K output sections.
> Also, set the PF_RAND_ADDR flag on such 4K output sections for communicating
> to the dynamic linker. (extra credit: linker optimization to relax GOT
> accesses within pages of output sections that will be split)
> 3. runtime loader change to collect the set of PT_LOAD's marked with
> PF_RAND_ADDR and perform an in-place shuffle of their load addresses (or
> some other randomization that doesn't massively expand the VA footprint so
> that round-down-PC GOT accesses will work) and also any glue needed for
> round-down-PC GOT accesses to work.
>
> This asking the linker to split an output section into multiple smaller ones
> seems like reasonably general functionality, so it should be reasonable to
> build it right into gold (and hopefully LLD! in fact you may find LLD easier
> to hack on at first). This also should interoperate fairly transparently
> with any profile-guided or other section ordering heuristics the linker is
> using as it constructs the initial output sections, eliminating the need for
> custom LTO binning passes or custom LTO integration.

I originally prototyped pagerando kind of similar to this. The linker
took individual function sections and binned them into pages,
inserting the POT indirection at call sites by appending small stubs
that looked up the function address and jumped to it. These stubs
added too much overhead (code size and runtime), so I wanted to insert
page inter-work at code generation time.

As you suggest, the compiler could certainly add the indirection for
every global access and call and leave final binning up to the linker
itself. However, if the compiler does not know which functions will be
binned together, it must indirect every function call, even for
callees that will be in the same bin as the caller. Binning in the
compiler allows us to optimize function calls inside the same bin to
direct, PC-relative calls, which I think is a critical optimization
for hot call sites.

If we could somehow teach the linker how to rewrite indirect
inter-page calls to direct intra-page calls, binning in the linker
would be perfectly viable. However, I'm concerned that we can't do
that safely in general because doing so would require correct
disassembly and rewriting of the call site. The computation of the
callee address may be spread across the function or stored in a
register (e.g. for repeated calls to the same function). To me,
rewriting these calls needs to be done at code-generation time,
although of course I'm open to alternatives.

Thanks,
Stephen