[llvm-dev] [RFC] Moving RELRO segment

Thu Sep 5 10:17:14 PDT 2019

On Thu, Sep 5, 2019 at 2:16 AM Rui Ueyama <ruiu at google.com> wrote:

> On Wed, Sep 4, 2019 at 2:59 AM Vic (Chun-Ju) Yang via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On 30/08/2019 13:27, David Chisnall via llvm-dev wrote:
>>
>> > On 28/08/2019 18:58, Vic (Chun-Ju) Yang via llvm-dev wrote:
>> > >* This is an RFC for moving RELRO segment. Currently, lld orders ELF
>> *> >* sections in the following order: R, RX, RWX, RW, and RW contains RELRO.
>> *> >* At run time, after RELRO is write-protected, we'd have VMAs in the order
>> *> >* of: R, RX, RWX, R (RELRO), RW. I'd like to propose that we move RELRO to
>> *> >* be immediately after the read-only sections, so that the order of VMAs
>> *> >* become: R, R (RELRO), RX, RWX, RW, and the dynamic linker would have the
>> *> >* option to merge the two read-only VMAs to reduce bookkeeping costs.
>> *>
>> > I am not convinced by this change.  With current hardware, to make any
>> > mapping more efficient, you need both the virtual to physical
>> > translation and the permissions to be the same.
>> >
>> > Anything that is writeable at any point will be a CoW mapping that, when
>> > written, will be replaced by a different page.  Anything that is not
>> > ever writeable will be the same physical pages.  This means that the old
>> > order is (S for shared, P for private):
>> >
>> > S S P P
>> >
>> > The new order is:
>> >
>> > S P S P P
>> >
>> > This means that the translation for the shared part is *definitely* not
>> > contiguous.  Modern architectures currently (though not necessarily
>> > indefinitely) conflate protection and translation and so both versions
>> > require the same number of page table and TLB entries.
>> >
>> > This; however, is true only when you think about single-level
>> > translation.  When you consider nested paging in a VM, things get more
>> > complex because the translation is a two-stage lookup and the protection
>> > is based on the intersection of the permissions at each level.
>> >
>> > The hypervisor will typically try to use superpages for the second-level
>> > translation and so both of the shared pages have a high probability of
>> > hitting in the same PTE for the second-level translation.  The same is
>> > true for the RW and RELRO segments, because they will be allocated at
>> > the same time and any OS that does transparent superpage promotion (I
>> > think Linux does now?  FreeBSD has for almost a decade) will therefore
>> > try to allocate contiguous physical memory for the mappings if possible.
>> >
>> > I would expect your scheme to translate to more memory traffic from
>> > page-table walks in any virtualised environment and I don't see (given
>> > that you have increased address space fragmentation) where you are
>> > seeing a saving.  With RELRO as part of RW, the kernel is free to split
>> > and recombine adjacent VM objects, with the new layout it is not able to
>> > combine adjacent objects because they are backed by different storage.
>> >
>> > David
>>
>> Indeed I did not think about this case. Thanks for pointing this out! I agree that with superpages this can result in worse performance and memory usage. Perhaps we can consider putting this change behind a build time flag? As much as I'd like to avoid adding flags, it seems to me from this thread that there are some real world cases that benefit from this change and some that suffer
>>
>> If "build time" in the above sentence means a build-time configuration of
> the linker (i.e. changing a linker default setting when lld is configured
> and built), we don't have that kind of configuration in lld at all, and
> that is (I believe) considered a good thing. As long as two lld binaries
> are of the same version, they behave exactly the same however they were
> built on any OS. So, if we need to make it configurable, we should add it
> as a linker flag.
>
Yes, a linker flag is what I meant. (i.e. same lld binary)

>
> If the proposed new layout works better than the current one on a
> non-virtualized environment and behave poorly on a virtualized environment,
> that is a tricky situation. We usually run the exact same OS and
> applications on both environments, so we have to choose one. Perhaps, we
> should first verify that the performance degradation on a VM is not
> hypothetical but real?
>
Agreed. I'll see if I can figure out how to verify this. If anyone has any
pointers on how I can get a VM to use a huge page, that'd be most welcomed.

At the same time, I'd also like to point out that there are cases where a
shared library is only used in a non-virtualized environment and never in a
virtualized environment. For example, Android has different build targets
for real devices vs virtual devices, and if we ended up with the linker
flag, one can choose to enable it on real devices and keep the current
behavior on virtual devices if the performance of virtual devices is of
concern.

>
> Vic
>>
>>
>> On Tue, Sep 3, 2019 at 10:40 AM Vic (Chun-Ju) Yang <victoryang at google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Aug 30, 2019 at 4:54 AM Fāng-ruì Sòng <maskray at google.com>
>>> wrote:
>>>
>>>> > > Old: R RX RW(RELRO) RW
>>>> > > New: R(R+RELRO) RX RW;      R includes the traditional R part and
>>>> the
>>>> > > RELRO part
>>>> > > Runtime (before relocation resolving): RW RX RW
>>>> > > Runtime (after relocation resolving): R RX RW
>>>> > >
>>>> > I actually see two ways of implementing this, and yes what you
>>>> mentioned
>>>> > here is one of them:
>>>> >   1. Move RELRO to before RX, and merge it with R segment. This is
>>>> what you
>>>> > said above.
>>>> >   2. Move RELRO to before RX, but keep it as a separate segment. This
>>>> is
>>>> > what I implemented in my test.
>>>> > As I mentioned in my reply to Peter, option 1 would allow existing
>>>> > implementations to take advantage of this without any change. While I
>>>> think
>>>> > this optimization is well worth it, if we go with option 1, the
>>>> dynamic
>>>> > linkers won't have a choice to keep RO separate if they want to for
>>>> > whatever reason (e.g. less VM commit, finer granularity in VM maps,
>>>> not
>>>> > wanting to have RO as writable even if for a short while.) So there's
>>>> a
>>>> > trade-off to be made here (or an option to be added, even though we
>>>> all
>>>> > want to avoid that if we can.)
>>>>
>>>> Then you probably meant:
>>>>
>>>> Old: R RX RW(RELRO) RW
>>>> New: R | RW(RELRO) RX RW
>>>> Runtime (before relocation resolving): R RW RX RW
>>>> Runtime (after relocation resolving): R R RX RW   ; the two R cannot be
>>>> merged
>>>>
>>>> | means a maxpagesize alignment. I am not sure whether you are going to
>>>> add it
>>>> because I still do not understand where the saving comes from.
>>>>
>>>
>>>> If the alignment is added, the R and RW maps can get contiguous
>>>> (non-overlapping) p_offset ranges. However, the RW map is private dirty,
>>>> it cannot be merged with adjacent maps so I am not clear how it can
>>>> save kernel memory.
>>>>
>>>
>>> My understanding (and my test result shows so) is that two VMAs can be
>>> merged even when one of them contains dirty pages. As far as I can tell
>>> from reading vma_merge() in mm/mmap.c in Linux kernel, there's nothing
>>> preventing merging consecutively mmaped regions in that case. That said, we
>>> may not care about this case too much if we decide that this change should
>>> be put behind a flag, because in that case, I think we can just go with
>>> option 1.
>>>
>>>
>>>>
>>>> If the alignment is not added, the two maps will get overlapping
>>>> p_offset ranges.
>>>>
>>>> > My test showed an overall ~1MB decrease in kernel slab memory usage on
>>>> > vm_area_struct, with about 150 processes running. For this to work, I
>>>> had
>>>> > to modify the dynamic linker:
>>>>
>>>> Can you elaborate how this decreases the kernel slab memory usage on
>>>> vm_area_struct?  References to source code are very welcomed :) This is
>>>> contrary to my intuition because the second R is private dirty.  The
>>>> number of
>>>> VMAs do not decrease.
>>>>
>>> In mm/mprotect.c, merging is done in mprotect_fixup(), which calls
>>> vma_merge() to do the actual work. In the same function you can also see
>>> VM_ACCOUNT flag is set for writable VMA, which is why I had to modify the
>>> dynamic linker to make R section temporarily writable for it to be
>>> mergeable with RELRO (they need to have the same flags to be merged.)
>>> Again, IMO all these somewhat indirect manipulations of VMAs were because I
>>> was hoping to give the dynamic linker an option to choose whether to take
>>> advantage of this or not. If for any reason, we put this behind a build
>>> time flag, there's no reason to jump through these hoops instead of just
>>> going with option 1.
>>>
>>>>
>>>> >   1. The dynamic linker needs to make the read-only VMA briefly
>>>> writable in
>>>> > order for it to have the same VM flags with the RELRO VMA so that
>>>> they can
>>>> > be merged. Specifically VM_ACCOUNT is set when a VMA is made writable.
>>>>
>>>> Same question. I hope you can give a bit more details.
>>>>
>>>> > > How to layout the segments if --no-rosegment is specified?
>>>> > > Runtime (before relocation resolving): RX RW   ;      some people
>>>> may be
>>>> > > concered with writable stuff (relocated part) being made executable
>>>> > Indeed I think weakening in the security aspect may be a problem if
>>>> we are
>>>> > to merge RELRO into RX. Keeping the old layout would be more
>>>> > preferable IMHO.
>>>>
>>>> This means the new layout conflicts with --no-rosegment.
>>>> In Driver.cpp, there should be a "... cannot be used together" error.
>>>>
>>>> > > Another problem is that in the default -z relro -z lazy (-z now not
>>>> > > specified) layout, .got and .got.plt will be separated by
>>>> potentially huge
>>>> > > code sections (e.g. .text). I'm still thinking what problems this
>>>> layout
>>>> > > change may bring.
>>>> > >
>>>> > Not sure if this is the same issue as what you mentioned here, but I
>>>> also
>>>> > see a comment in lld/ELF/Writer.cpp about how .rodata and .eh_frame
>>>> should
>>>> > be as close to .text as possible due to fear of relocation overflow.
>>>> If we
>>>> > go with option 2 above, the distance would have to be made larger.
>>>> With
>>>> > option 1, we may still have some leeway in how to order sections
>>>> within the
>>>> > merged RELRO segment.
>>>>
>>>> For huge executables (>2G or 3G), it may cause relocation overflows
>>>> between .text and .rodata if other large sections like .dynsym and
>>>> .dynstr are
>>>> placed in between.
>>>>
>>>> I do not worry too much about overflows potentially caused by moving
>>>> PT_GNU_RELRO around.  PT_GNU_RELRO is usually less than 10% of the size
>>>> of the
>>>> RX PT_LOAD.
>>>>
>>> That's good to know!
>>>
>>>>
>>>> > This would be a somewhat tedious change (especially the part about
>>>> having
>>>> > to update all the unit tests), but the benefit is pretty good,
>>>> especially
>>>> > considering the kernel slab memory is not swappable/evictable. Please
>>>> let
>>>> > me know your thoughts!
>>>>
>>>> Definitely! I have prototyped this and find ~260 tests will need
>>>> address changing..
>>>>
>>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190905/dca933a6/attachment.html>