[llvm-dev] [RFC] Pagerando: Page-granularity code randomization

Wed Jun 14 20:28:33 PDT 2017

On Wed, Jun 14, 2017 at 4:03 PM, Stephen Crane <sjc at immunant.com> wrote:

> Thanks for the ideas. I particularly like the GOT access via masking.
> However I do have some security concerns over completely eliminating
> the POT.
>

IANA security person, so take all my advice with a grain of salt!

>
> On Mon, Jun 12, 2017 at 5:48 PM, Sean Silva <chisophugis at gmail.com> wrote:
> > As long as the DSO is under some fixed size (say 1GB or 4GB or whatever)
> > then with dynamic linker collaboration you can find the GOT by rounding
> down
> > the current instruction pointer, eliminating the need for the POT. This
> > should save the need for the internal ABI stuff. As long as you are
> > shuffling sections and not spewing them all over memory you can implement
> > the randomization as an in-place shuffling of the pages and thus not
> > increase the maximal distance to the GOT.
>
> I think this is a great idea for referencing the GOT and global data.
> We should be careful that keeping the DSO in a fixed range and placing
> .rodata at a fixed alignment still allows sufficient entropy to
> mitigate guessing and disclosure attacks. Shuffling in place is
> problematic without execute-only (non-readable) code page permissions,
> since an attacker could simply do a linear scan of the DSO's code,
> disassemble and reuse code in that DSO. On platforms that support
> execute-only permissions, I think an in-place shuffle is fine.
>

The PF_RAND_PAGE flag could be defined to semantically permit the loader to
change the addresses. It wouldn't have to be exactly in place (I meant that
more as a very simple concrete thing the loader could do). In place is an
extreme case where the VA is not increased at all. At a performance cost,
you could insert unmapped pages or whatever you need for security. Just be
careful that expanding the VA will cause a performance degradation (how
much remains to be measured). The cost is basically that when the hardware
page table walker will have to do an extra serially dependent memory access
during its page table lookups. As long as the working set fits in iTLB it
won't have any effect, but beyond that you will suffer some performance
hit. If you already have a prototype working, can you try taking some
measurement by just spacing out the (assumed 4k) pages 2M apart vs some
much smaller separation? That should get a reasonable measurement of the
overhead. Also, make sure to measure on a program with a substantial icache
footprint, like clang or some other large complex program.

>
> I'm not sure we can keep code page pointers in the GOT/global segment
> and still keep them hidden from an attacker with a read primitive. An
> attacker who has any global data pointer can trivially find the GOT
> and thus code page addresses if we keep them in the GOT. Even if we
> were to decouple the GOT from other address-taken global data but
> still place the GOT at a predictable location (masking off low bits),
> then it should still be fairly easy for an attacker to locate it.
>
>

This would affect POT's too, which would be easy to find from the text, no?
At the end of the day something has to be at a predictable offset from the
text (unless you are using a TLS register or something to hold it, but that
has other larger issues as pointed out by one of your references)

E.g. the externally facing ABI in your proposal would be all stubs loading
the internal ABI POT register. Then the attacker would just look at what
the stub does to access it.
(or would the stubs use TLS or something expensive to get access to the POT
address?)

I'm just saying this because adding a new "GOT-like" thing is actually
pretty annoying (e.g. mips multi-GOT). If the changes can be limited to
putting .got.plt in a separate PT_LOAD that would be pretty easy and
non-invasive comparatively.

> Even if we have to keep the POT, eliminating the extra load from the
> POT for global access by masking the PC address should be a
> significant performance optimization.
>
> > So in the end the needed changes would be:
> > 1. compiler change to have it break up sections into 4K (or whatever)
> > chunks, inserting appropriate round-down-PC sequences for GOT access and
> a
> > possibly a new relocation type for such GOT accesses. Add a new section
> flag
> > to indicate that sections should be placed in output sections of at most
> 4K
> > (or whatever is appropriate for the target). For -ffunction-sections
> > -fdata-sections this should only require splitting a small number of
> > sections (i.e. larger than 4K sections). There is no binning in the
> > compiler.
> > 2. linker change to respect the section flag and split output sections
> > containing input sections with such flags into multiple 4K output
> sections.
> > Also, set the PF_RAND_ADDR flag on such 4K output sections for
> communicating
> > to the dynamic linker. (extra credit: linker optimization to relax GOT
> > accesses within pages of output sections that will be split)
> > 3. runtime loader change to collect the set of PT_LOAD's marked with
> > PF_RAND_ADDR and perform an in-place shuffle of their load addresses (or
> > some other randomization that doesn't massively expand the VA footprint
> so
> > that round-down-PC GOT accesses will work) and also any glue needed for
> > round-down-PC GOT accesses to work.
> >
> > This asking the linker to split an output section into multiple smaller
> ones
> > seems like reasonably general functionality, so it should be reasonable
> to
> > build it right into gold (and hopefully LLD! in fact you may find LLD
> easier
> > to hack on at first). This also should interoperate fairly transparently
> > with any profile-guided or other section ordering heuristics the linker
> is
> > using as it constructs the initial output sections, eliminating the need
> for
> > custom LTO binning passes or custom LTO integration.
>
> I originally prototyped pagerando kind of similar to this. The linker
> took individual function sections and binned them into pages,
> inserting the POT indirection at call sites by appending small stubs
> that looked up the function address and jumped to it. These stubs
> added too much overhead (code size and runtime), so I wanted to insert
> page inter-work at code generation time.
>
> As you suggest, the compiler could certainly add the indirection for
> every global access and call and leave final binning up to the linker
> itself. However, if the compiler does not know which functions will be
> binned together, it must indirect every function call, even for
> callees that will be in the same bin as the caller. Binning in the
> compiler allows us to optimize function calls inside the same bin to
> direct, PC-relative calls, which I think is a critical optimization
> for hot call sites.
>
> If we could somehow teach the linker how to rewrite indirect
> inter-page calls to direct intra-page calls, binning in the linker
> would be perfectly viable. However, I'm concerned that we can't do
> that safely in general because doing so would require correct
> disassembly and rewriting of the call site. The computation of the
> callee address may be spread across the function or stored in a
> register (e.g. for repeated calls to the same function). To me,
> rewriting these calls needs to be done at code-generation time,
> although of course I'm open to alternatives.
>
>
Eliminating GOT access is a standard linker optimization these days. Look
at e.g. R_X86_64_GOTPCRELX
(the psABI doc has examples; see "B.2 Optimize GOTPCRELX Relocations" in
https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r252.pdf)
The linker does not need to disassemble anything because the compiler has
emitted a special relocation.
The new relocations for pagerando would indicate to the linker the
relaxation semantics for pagerando call sites.

-- Sean Silva

> Thanks,
> Stephen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170614/2d41a89f/attachment.html>