[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Tue Feb 26 17:34:42 PST 2019

Hi folks,

I'd like to propose adding a feature to ELF lld for automatically
partitioning a program into multiple binaries. (This will also involve
adding a feature to clang, so I've cc'd cfe-dev as well.)

== Problem statement ==

Embedded devices such as cell phones, especially lower end devices, are
typically highly resource constrained. Users of cell phone applications
must pay a cost (in terms of download size as well as storage space) for
all features that the application implements, even for features that are
only used by a minority of users. Therefore, there is a desire to split
applications into multiple pieces that can be downloaded independently, so
that the majority of users only pay the cost of the commonly used features.
This can technically be achieved using traditional ELF dynamic linking: the
main part of the program can be compiled as an executable or DSO that
exports symbols that are then imported by a separate DSO containing the
part of the program implementing the optional feature. However, this itself
imposes costs:
- Each exported symbol by itself imposes additional binary size costs, as
it requires the name of the symbol and a dynamic symbol table entry to be
stored in both the exporting and importing DSO, and on the importing side a
dynamic relocation, a GOT entry and likely a PLT entry must be present.
These additional costs go some way towards defeating the purpose of
splitting the program into pieces in the first place, and can also impact
program startup and overall performance because of the additional
indirections.
- It can result in more code needing to appear in the main part of the
program than necessary. For example, imagine that both the feature and the
main program make use of a common (statically linked) library, but they
call different subsets of the functions in that library. With traditional
ELF linking we are forced to either link and export the entire library from
the main program (even the functions unused by either part of the program)
or carefully maintain a list of functions that are used by the other parts
of the program.
- Since the linker does not see the whole program at once and links each
piece independently, a number of link-time optimizations and features stop
working, such as LTO across partition boundaries, whole-program
devirtualization and non-cross-DSO control flow integrity (control flow
integrity has a cross-DSO mode, but that also imposes binary size costs
because a significant amount of metadata needs to appear in each DSO).
There are ways around at least the first point. For example, the program
could arrange to use a custom mechanism for binding references between the
main program and the feature code, such as a table of entry points.
However, this can impose maintenance costs (for example, the binding
mechanism can be intrusive in the source code and typically has to be
maintained manually), and it still does not address the last point.

== Proposed solution ==

I propose to extend lld so that it can perform the required partitioning
automatically, given a set of entry points for each part of the program.
The end product of linking will be a main program (which can be either an
executable or a DSO) combined with a set of DSOs that must be loaded at
fixed addresses relative to the base address of the main program. These
binaries will all share a virtual address space so that they can refer to
one another directly using PC-relative references or RELATIVE dynamic
relocations as if they were all statically linked together in the first
place, rather than via the GOT (or custom GOT-equivalent).

The way that it will work is that we can extend the graph reachability
algorithm currently implemented by the linker for --gc-sections. The entry
points for each partition are marked up with a string naming the partition,
either at the source level with an attribute on the function or global
variable, or by passing a flag to the compiler (this string becomes the
partition's soname). These symbols will act as the GC roots for the
partition and will be exported from its dynsym. Assuming that there is a
single partition, let's call this set of symbols S2, while all other GC
roots (e.g. non-marked-up exported symbols, sections in .init_array) we
call S1. Any sections reachable from S1 are allocated to the main
partition, while sections reachable only from S2 but not from S1 are
allocated to S2's partition. We can extend this idea to multiple loadable
partitions by defining S3, S4 and so on, but any sections reachable from
multiple loadable partitions are allocated to the main partition even if
they aren’t reachable from the main partition.

When assigning input sections to output sections, we take into account, in
addition to the name of the input section, the partition that the input
section is assigned to. The SHF_ALLOC output sections are first sorted by
partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
sections appear last and are not sorted by partition. In the end we are
left with a collection of output sections that might look like this:

Main partition:
0x0000 ELF header, phdrs
0x1000 .rodata
0x2000 .dynsym
0x3000 .text

Loadable partition 1:
0x4000 ELF header, phdrs
0x5000 .rodata
0x6000 .dynsym
0x7000 .text

Loadable partition 2:
0x8000 ELF header, phdrs
0x9000 .rodata
0xa000 .dynsym
0xb000 .text

Non-SHF_ALLOC sections from all partitions:
.comment
.debug_info
(etc.)

Now linking proceeds mostly as usual, and we’re effectively left with a
single .so that contains all of the partitions concatenated together. This
isn’t very useful on its own and is likely to confuse tools (e.g. due to
the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
that will extract the individual partitions from the output file
essentially by taking a slice of the combined .so file. These slices can
also be fed to tools such as debuggers provided that the non-SHF_ALLOC
sections are left in place.

The envisaged usage of this feature is as follows:
$ clang -ffunction-sections -fdata-sections -c main.c # compile the main
program
$ clang -ffunction-sections -fdata-sections
-fsymbol-partition=libfeature.so -c feature.c # compile the feature
$ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
-Wl,-soname,libmain.so -Wl,--gc-sections
$ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
$ llvm-objcopy libcombined.so libfeature.so
--extract-partition=libfeature.so

On Android, the loadable partitions can be loaded with the
android_dlopen_ext <https://developer.android.com/ndk/reference/group/libdl>
function passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at
the correct address relative to the main partition. Other platforms that
wish to support this feature will likely either need to add a similar
feature to their dynamic loader or (in order to support loading the
partitions with a regular dlopen) define a custom dynamic tag that will
cause the dynamic loader to first load the main partition and then the
loadable partition at the correct relative address.

== In more detail ==

Each loadable partition will require its own sections to support the
dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
.dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
.interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
with the main partition. This means that all addresses associated with
symbols will continue to be fixed.

In order to cause the dynamic loader to reserve address space for the
loadable partitions so that they can be loaded at the correct address
later, a PT_LOAD segment is added to the main partition that allocates a
page of bss at the address one byte past the end of the last address in the
last partition. In the Android dynamic loader at least, this is enough to
cause the required space to be reserved. Other platforms would need to
ensure that their dynamic loader implements similar behaviour.

I haven't thought about how this feature will interact with linker scripts.
At least to start with we will likely need to forbid using this feature
together with the PHDRS or SECTIONS linker script directives.

Some sections will need to be present in each partition (e.g. .interp and
.note sections). Probably the most straightforward way to do this will be
to cause the linker to create a clone of these sections for each partition.

== Other use cases ==

An example of another use case for this feature could be an operating
system API which is exposed across multiple DSOs. Typically these DSOs will
be implemented using private APIs that are not exposed to the application.
This feature would allow you to create a common DSO that contains the
shared code implementing the private APIs (i.e. the main partition),
together with individual DSOs (i.e. the loadable partitions) that use the
private APIs and expose the public ones, but without actually exposing the
private APIs in the dynamic symbol table or paying the binary size cost of
doing so.

== Prototype ==

A prototype/proof of concept of this feature has been implemented here:
https://github.com/pcc/llvm-project/tree/lld-module-symbols
There is a test app in the test-progs/app directory that demonstrates the
feature on Android with a simple hello world app (based on
https://www.hanshq.net/command-line-android.html ). I have successfully
tested debugging the loadable partition with gdb (e.g. setting breakpoints
and printing globals), but getting unwinding working will need a bit more
work.

Note that the feature as exposed by the prototype is different from what
I'm proposing here, e.g. it uses a linker flag to specify which symbols go
in which partitions. I think the best place to specify this information is
at either the source level or the compiler flag level, so that is what I
intend to implement.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/85c880b0/attachment.html>