<div dir="ltr"><div dir="ltr"><div dir="ltr">On Tue, Feb 26, 2019 at 6:41 PM Eli Friedman <<a href="mailto:efriedma@quicinc.com">efriedma@quicinc.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div lang="EN-US">

<div class="gmail-m_-8133768991413943469WordSection1">

<p class="MsoNormal">This seems like a very complicated approach… do you have some numbers to give some idea how much of an improvement we’re talking about here over a more conventional solution involving shared libraries?  Or have you not gotten that far?</p></div></div></blockquote><div><br></div><div>I can talk to my internal customer to see what kind of overhead they were seeing. But I do know that at the start of the project they did evaluate using regular dynamic linking for the feature partitions, and that was quickly rejected in favour of other approaches due to the code size and maintenance overhead. And with control flow integrity the binary size of the cross-DSO metadata dwarfed the binary size savings that they were hoping to gain by splitting their program in two.</div><div><br></div><div>Furthermore, there are things that you simply cannot do with a more conventional approach, such as optimizations relying on whole-program information (like whole-program devirtualization, which helps significantly in my customer's program).</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang="EN-US"><div class="gmail-m_-8133768991413943469WordSection1">

<p class="MsoNormal">What’s the tradeoff involved in the specific sections you chose to split?  It seems like it would be possible to, for example, split the GOT, or avoid splitting the relocation/EH/etc. sections.  Some variation would require different runtime

 support, I guess.</p></div></div></blockquote><div><br></div><div>We could certainly consider having multiple GOTs which are allocated to partitions in the same way as sections are. This might be useful if for example one of the partitions references a DSO that is unused by the main program and we need to avoid having the main program depend on the DSO. But I consider this an optimization over the proposed approach and not something that would be strictly required for correctness. I chose to omit this for now for the sake of simplicity and because my customer does not require it for now.</div><div><br></div><div>I think we need to split the dynamic relocation section because otherwise the dynamic loader will try to relocate the unreadable memory of the other partitions and cause a SIGSEGV. Similarly, we need to split the EH sections because unwinders will generally expect to be able to find the unwind info for a function by enumerating PT_LOADs to map an address onto a DSO and then using that DSO's PT_ARM_EXIDX/PT_GNU_EH_FRAME to find the unwind info. See for example what libunwind does:</div><div><div><a href="https://github.com/llvm/llvm-project/blob/e739ac0e255597d818c907223034ddf3bc18a593/libunwind/src/AddressSpace.hpp#L523">https://github.com/llvm/llvm-project/blob/e739ac0e255597d818c907223034ddf3bc18a593/libunwind/src/AddressSpace.hpp#L523</a></div></div><div><br></div><div>As you point out, the latter part could vary based on the runtime, but I don't see a strong reason to do it another way.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang="EN-US"><div class="gmail-m_-8133768991413943469WordSection1">

<p class="MsoNormal">It looks like this doesn’t include a proposal for the corresponding LLVM IR extension?  I think it might be sort of complicated to define correctly… specifically, in terms of what it means to “use” a function or global from a different

 partition (so the program doesn’t try to speculatively access something which isn’t loaded).  This could come up even without LTO if you have C++ inline functions, since all functions with weak linkage have to be in the first partition.  (At least, I think

 they do, unless you invent a new kind of “partition” visibility for this.)</p></div></div></blockquote><div><br></div><div>The idea here is that for code to "use" a function or global is exactly the same thing as having a relocation pointing to it. This is the same principle that is used to implement --gc-sections.</div><div><br></div><div>So for a program to end up accessing a section (speculatively or otherwise) there needs to be a chain of relocations referring to it from the entry points. That would force the section into either the main partition or the same partition as the referent.</div><div><br></div><div>Another way to think about it is: when I load the main partition into memory, I have loaded all code that is reachable from the main partition's entry points. Now I dynamically load a feature partition. I've now loaded all code that is reachable from the combination of the main partition and the feature partition's entry points. That's pretty much the same thing as having first loaded a conventional ELF DSO linked with --gc-sections with just the main partition's entry points, and then replacing it with a second DSO linked with --gc-sections with the main partition + feature partition's entry points, except that none of the addresses in the main partition happen to have changed. So if --gc-sections works, this should work too.</div><div><br></div><div>You might be wondering: what happens if I directly reference one of the feature partition's entry points from the main partition? Well, something interesting will happen. The feature partition's dynamic symbol table will contain an entry for the entry point, but the entry's address will point inside the main partition. This should work out just fine because the main partition is guaranteed to be loaded if the feature partition is also loaded. (Which is the same reason why direct pc-relative references from the feature partition to the main partition will also work.)</div><div><br></div><div>I don't think any significant IR extensions are necessary here, except perhaps for the part involving attaching the -fsymbol-partition names to globals, but I think that part is mostly trivial and it would probably end up looking like the custom section name field.</div><div><br></div><div>I'm not sure I understand how weak linkage is impacted here. With this nothing special happens inside the linker until we start handling --gc-sections, and by that time weak/strong resolution has already happened. In ELF, dynamic loaders do not care about symbol bindings (except for weak undefined symbols), so we get the same result whether the symbols are weak or not.</div><div><br></div><div>Peter</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang="EN-US"><div class="gmail-m_-8133768991413943469WordSection1"><p class="MsoNormal"><u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">-Eli<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<div style="border-top:none;border-right:none;border-bottom:none;border-left:1.5pt solid blue;padding:0in 0in 0in 4pt">

<div>

<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0in 0in">

<p class="MsoNormal"><b>From:</b> cfe-dev <<a href="mailto:cfe-dev-bounces@lists.llvm.org" target="_blank">cfe-dev-bounces@lists.llvm.org</a>> <b>On Behalf Of

</b>Peter Collingbourne via cfe-dev<br>

<b>Sent:</b> Tuesday, February 26, 2019 5:35 PM<br>

<b>To:</b> llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>; cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>><br>

<b>Cc:</b> George Rimar <<a href="mailto:grimar@accesssoftek.com" target="_blank">grimar@accesssoftek.com</a>><br>

<b>Subject:</b> [EXT] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries<u></u><u></u></p>

</div>

</div>

<p class="MsoNormal"><u></u> <u></u></p>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<p class="MsoNormal">Hi folks,<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">I'd like to propose adding a feature to ELF lld for automatically partitioning a program into multiple binaries. (This will also involve adding a feature to clang, so I've cc'd cfe-dev as well.)<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">== Problem statement ==<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Embedded devices such as cell phones, especially lower end devices, are typically highly resource constrained. Users of cell phone applications must pay a cost (in terms of download size as well as storage space) for all features that the

 application implements, even for features that are only used by a minority of users. Therefore, there is a desire to split applications into multiple pieces that can be downloaded independently, so that the majority of users only pay the cost of the commonly

 used features. This can technically be achieved using traditional ELF dynamic linking: the main part of the program can be compiled as an executable or DSO that exports symbols that are then imported by a separate DSO containing the part of the program implementing

 the optional feature. However, this itself imposes costs:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">- Each exported symbol by itself imposes additional binary size costs, as it requires the name of the symbol and a dynamic symbol table entry to be stored in both the exporting and importing DSO, and on the importing side a dynamic relocation,

 a GOT entry and likely a PLT entry must be present. These additional costs go some way towards defeating the purpose of splitting the program into pieces in the first place, and can also impact program startup and overall performance because of the additional

 indirections.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">- It can result in more code needing to appear in the main part of the program than necessary. For example, imagine that both the feature and the main program make use of a common (statically linked) library, but they call different subsets

 of the functions in that library. With traditional ELF linking we are forced to either link and export the entire library from the main program (even the functions unused by either part of the program) or carefully maintain a list of functions that are used

 by the other parts of the program.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">- Since the linker does not see the whole program at once and links each piece independently, a number of link-time optimizations and features stop working, such as LTO across partition boundaries, whole-program devirtualization and non-cross-DSO

 control flow integrity (control flow integrity has a cross-DSO mode, but that also imposes binary size costs because a significant amount of metadata needs to appear in each DSO).<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">There are ways around at least the first point. For example, the program could arrange to use a custom mechanism for binding references between the main program and the feature code, such as a table of entry points. However, this can impose

 maintenance costs (for example, the binding mechanism can be intrusive in the source code and typically has to be maintained manually), and it still does not address the last point.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">== Proposed solution ==<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">I propose to extend lld so that it can perform the required partitioning automatically, given a set of entry points for each part of the program. The end product of linking will be a main program (which can be either an executable or a

 DSO) combined with a set of DSOs that must be loaded at fixed addresses relative to the base address of the main program. These binaries will all share a virtual address space so that they can refer to one another directly using PC-relative references or RELATIVE

 dynamic relocations as if they were all statically linked together in the first place, rather than via the GOT (or custom GOT-equivalent).<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">The way that it will work is that we can extend the graph reachability algorithm currently implemented by the linker for --gc-sections. The entry points for each partition are marked up with a string naming the partition, either at the

 source level with an attribute on the function or global variable, or by passing a flag to the compiler (this string becomes the partition's soname). These symbols will act as the GC roots for the partition and will be exported from its dynsym. Assuming that

 there is a single partition, let's call this set of symbols S2, while all other GC roots (e.g. non-marked-up exported symbols, sections in .init_array) we call S1. Any sections reachable from S1 are allocated to the main partition, while sections reachable

 only from S2 but not from S1 are allocated to S2's partition. We can extend this idea to multiple loadable partitions by defining S3, S4 and so on, but any sections reachable from multiple loadable partitions are allocated to the main partition even if they

 aren’t reachable from the main partition.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<div>

<p class="MsoNormal">When assigning input sections to output sections, we take into account, in addition to the name of the input section, the partition that the input section is assigned to. The SHF_ALLOC output sections are first sorted by partition, and

 then by the usual sorting rules. As usual, non-SHF_ALLOC sections appear last and are not sorted by partition. In the end we are left with a collection of output sections that might look like this:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Main partition:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x0000 ELF header, phdrs<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x1000 .rodata<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x2000 .dynsym<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x3000 .text<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Loadable partition 1:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x4000 ELF header, phdrs<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x5000 .rodata<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x6000 .dynsym<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x7000 .text<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Loadable partition 2:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x8000 ELF header, phdrs<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0x9000 .rodata<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0xa000 .dynsym<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">0xb000 .text<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Non-SHF_ALLOC sections from all partitions:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">.comment<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">.debug_info<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">(etc.)<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Now linking proceeds mostly as usual, and we’re effectively left with a single .so that contains all of the partitions concatenated together. This isn’t very useful on its own and is likely to confuse tools (e.g. due to the presence of

 multiple .dynsyms); we can add a feature to llvm-objcopy that will extract the individual partitions from the output file essentially by taking a slice of the combined .so file. These slices can also be fed to tools such as debuggers provided that the non-SHF_ALLOC

 sections are left in place.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">The envisaged usage of this feature is as follows:<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">$ clang -ffunction-sections -fdata-sections -c main.c # compile the main program<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">$ clang -ffunction-sections -fdata-sections -fsymbol-partition=libfeature.so -c feature.c # compile the feature<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">$ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so -Wl,-soname,libmain.so -Wl,--gc-sections<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">$ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">$ llvm-objcopy libcombined.so libfeature.so --extract-partition=libfeature.so<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">On Android, the loadable partitions can be loaded with the <a href="https://developer.android.com/ndk/reference/group/libdl" target="_blank">

android_dlopen_ext</a> function passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the correct address relative to the main partition. Other platforms that wish to support this feature will likely either need to add a similar feature to their

 dynamic loader or (in order to support loading the partitions with a regular dlopen) define a custom dynamic tag that will cause the dynamic loader to first load the main partition and then the loadable partition at the correct relative address.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">== In more detail ==<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Each loadable partition will require its own sections to support the dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr, .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash, .interp, .rela.dyn, .relr.dyn),

 but will be able to share a GOT and PLT with the main partition. This means that all addresses associated with symbols will continue to be fixed.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">In order to cause the dynamic loader to reserve address space for the loadable partitions so that they can be loaded at the correct address later, a PT_LOAD segment is added to the main partition that allocates a page of bss at the address

 one byte past the end of the last address in the last partition. In the Android dynamic loader at least, this is enough to cause the required space to be reserved. Other platforms would need to ensure that their dynamic loader implements similar behaviour.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">I haven't thought about how this feature will interact with linker scripts. At least to start with we will likely need to forbid using this feature together with the PHDRS or SECTIONS linker script directives.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Some sections will need to be present in each partition (e.g. .interp and .note sections). Probably the most straightforward way to do this will be to cause the linker to create a clone of these sections for each partition.<u></u><u></u></p>

</div>

<div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">== Other use cases ==<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">An example of another use case for this feature could be an operating system API which is exposed across multiple DSOs. Typically these DSOs will be implemented using private APIs that are not exposed to the application. This feature would

 allow you to create a common DSO that contains the shared code implementing the private APIs (i.e. the main partition), together with individual DSOs (i.e. the loadable partitions) that use the private APIs and expose the public ones, but without actually

 exposing the private APIs in the dynamic symbol table or paying the binary size cost of doing so.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">== Prototype ==<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">A prototype/proof of concept of this feature has been implemented here: <a href="https://github.com/pcc/llvm-project/tree/lld-module-symbols" target="_blank">https://github.com/pcc/llvm-project/tree/lld-module-symbols</a><u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">There is a test app in the test-progs/app directory that demonstrates the feature on Android with a simple hello world app (based on <a href="https://www.hanshq.net/command-line-android.html" target="_blank">https://www.hanshq.net/command-line-android.html</a>

 ). I have successfully tested debugging the loadable partition with gdb (e.g. setting breakpoints and printing globals), but getting unwinding working will need a bit more work.<u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Note that the feature as exposed by the prototype is different from what I'm proposing here, e.g. it uses a linker flag to specify which symbols go in which partitions. I think the best place to specify this information is at either the

 source level or the compiler flag level, so that is what I intend to implement.<u></u><u></u></p>

</div>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">Thanks,<u></u><u></u></p>

</div>

</div>

<div>

<p class="MsoNormal">-- <u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">-- <u></u><u></u></p>

</div>

<div>

<p class="MsoNormal">Peter<u></u><u></u></p>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>


</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">-- <div>Peter</div></div></div></div></div>