[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries
Peter Collingbourne via llvm-dev
llvm-dev at lists.llvm.org
Wed Feb 27 13:07:33 PST 2019
On Wed, Feb 27, 2019 at 4:19 AM Peter Smith <peter.smith at linaro.org> wrote:
> When I first read this I thought that this is kind of like overlays,
> but optimising for flash size rather than memory size and with the
> dynamic linker as a simplified non-evicting overlay manager. However I
> think it might be more accurate to describe as a way of integrating
> features implemented in shared libraries into an application?
Yes, this is sort of like overlays, except that they're automatic (unlike
the linker script OVERLAY feature, which relies on manually partitioning
the program between object files) and non-overlapping (so that you can load
multiple of them). The intent is exactly that features would live in the
Do you have an idea of how the development process would work for such
> an application? It sounds to me like this isn't going to work well for
> "here is an executable that I've already written wrote go find the
> bits can be split off into partitions". I'd expect that this is more
> of a, "here is an application that has been implemented with some
> independent features implemented in shared libraries (natural but
> heavy weight partitions) that I want to be optimised together at the
> cost of (realistically) losing the shared part of the libraries.". I'm
> thinking that one possible development model for you would be first
> implement your independent feature as a shared library, the entry
> points for the partition could then be extracted from it, with perhaps
> the inputs to the shared library link step being extracted for the
> main application (eliminating duplicates between the partitions would
> be interesting).
Yes, the intent is that a developer adopting this feature could start with
an application that is already split into multiple conventional DSOs, and
then adjust their cflags and ldflags in order to link both the application
and the features in a single step. That by itself could result in code size
savings from any code that was already being statically linked into
multiple binaries. Then they would be in a position to start switching
symbol visibilities over to hidden in order to remove dynsym entries and
cause more code to be moved into the loadable partitions.
My main concern with this is that this could end being fragile, with
> subtle bugs that would be difficult to reason about ahead of time and
> without the feature implemented on a desktop OS it could be very
> difficult to test that a new feature would work with it, or whether a
> change to an existing feature would break it. I think we'd need some
> extensive documentation to go with it at the least.
Yes, that's a concern for me as well. The situation is not too different
from the relocation packing features, which AFAIK are only supported by the
Android and ChromeOS dynamic loaders, but admittedly this proposed feature
will probably end up touching more code. I think it is mitigated by the
fact that one of the first intended users of this feature will be part of
Chromium, and Chromium is continuously tested with top-of-tree LLVM on most
platforms including Android, which means that we should find out about any
breakage quickly. I do intend to contribute user documentation since it
will be needed in order to understand how to use the feature and we can't
simply refer to GNU documentation for this.
> Some other thoughts below:
> > We could certainly consider having multiple GOTs which are allocated to
> partitions in the same way as sections are. This might be useful if for
> example one of the partitions references a DSO that is unused by the main
> program and we need to avoid having the main program depend on the DSO. But
> I consider > this an optimization over the proposed approach and not
> something that would be strictly required for correctness. I chose to omit
> this for now for the sake of simplicity and because my customer does not
> require it for now.
> Introducing multiple GOTs might make it difficult to reason about GOT
> relative relocations, is it always unambiguous which GOT a relocation
> should refer to? Can we answer that question for all targets?
I think it should be unambiguous: a GOT entry would be allocated to the
partition containing all GOT generating relocations to the symbol if there
is such a partition, or the main partition if not. I'm not sure what should
happen on MIPS which has its own weird multi-GOT thing. I'll probably just
make this feature unsupported on MIPS.
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
> I'd recommend not going there unless you have a really good model of
> how it will work and what type of scripts are compatible with it. It
> would be incredibly easy to write something to break the assumptions
> that this is relying on.
> An ifunc in one of the loadable partitions could be problematic as
> there is only one PLT and GOT in the application, with the ifunc
> dynamic relocation running eagerly at application launch but no
> resolver present in memory at the time.
Yes, I probably should have realised that ifuncs could be an issue. They'll
probably all need to be placed in the main partition unless we have
You'll also need to be careful
> to prevent any sharing of things like strings between the partitions.
In my prototype all mergeable sections will end up in the main partition
but I think we'd eventually want them to also participate in the graph
colouring (a good fraction of the loadable partition is expected to be made
up of strings). Strings would be placed similarly to GOT entries: in the
same partition if all refs come from that partition or in the main
partition if not.
> On Wed, 27 Feb 2019 at 01:34, Peter Collingbourne <peter at pcc.me.uk> wrote:
> > Hi folks,
> > I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
> > == Problem statement ==
> > Embedded devices such as cell phones, especially lower end devices, are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
> > - Each exported symbol by itself imposes additional binary size costs,
> as it requires the name of the symbol and a dynamic symbol table entry to
> be stored in both the exporting and importing DSO, and on the importing
> side a dynamic relocation, a GOT entry and likely a PLT entry must be
> present. These additional costs go some way towards defeating the purpose
> of splitting the program into pieces in the first place, and can also
> impact program startup and overall performance because of the additional
> > - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
> > - Since the linker does not see the whole program at once and links each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
> > There are ways around at least the first point. For example, the program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
> > == Proposed solution ==
> > I propose to extend lld so that it can perform the required partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
> > The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
> > When assigning input sections to output sections, we take into account,
> in addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
> > Main partition:
> > 0x0000 ELF header, phdrs
> > 0x1000 .rodata
> > 0x2000 .dynsym
> > 0x3000 .text
> > Loadable partition 1:
> > 0x4000 ELF header, phdrs
> > 0x5000 .rodata
> > 0x6000 .dynsym
> > 0x7000 .text
> > Loadable partition 2:
> > 0x8000 ELF header, phdrs
> > 0x9000 .rodata
> > 0xa000 .dynsym
> > 0xb000 .text
> > Non-SHF_ALLOC sections from all partitions:
> > .comment
> > .debug_info
> > (etc.)
> > Now linking proceeds mostly as usual, and we’re effectively left with a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
> > The envisaged usage of this feature is as follows:
> > $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
> > $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
> > $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
> > $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
> > $ llvm-objcopy libcombined.so libfeature.so
> > On Android, the loadable partitions can be loaded with the
> android_dlopen_ext function passing ANDROID_DLEXT_RESERVED_ADDRESS to force
> it to be loaded at the correct address relative to the main partition.
> Other platforms that wish to support this feature will likely either need
> to add a similar feature to their dynamic loader or (in order to support
> loading the partitions with a regular dlopen) define a custom dynamic tag
> that will cause the dynamic loader to first load the main partition and
> then the loadable partition at the correct relative address.
> > == In more detail ==
> > Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
> > In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
> > I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
> > Some sections will need to be present in each partition (e.g. .interp
> and .note sections). Probably the most straightforward way to do this will
> be to cause the linker to create a clone of these sections for each
> > == Other use cases ==
> > An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
> > == Prototype ==
> > A prototype/proof of concept of this feature has been implemented here:
> > There is a test app in the test-progs/app directory that demonstrates
> the feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> > Note that the feature as exposed by the prototype is different from what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
> > Thanks,
> > --
> > --
> > Peter
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev