[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

Mon Nov 7 13:54:44 PST 2011

On Nov 2, 2011, at 11:05 PM, Don Quixote de la Mancha wrote:

> A helpful link-time optimization would be to place subroutines that
> are used close together in time also close together in the executable
> file.  That also goes for data that is in the executable file, whether
> initialized (.data segment) or zero-initialized (.bss).
> 
> If the unit of linkage of code is the function rather than the
> compilation module, and the unit of linkage of data is the individual
> data item rather than all the .data and .bss items together that are
> in a compilation unit, you could rearrange them at will.
> 
> For architectures such as ARM that cannot make jumps to faraway
> addresses, you could make the destinations of subroutine calls close
> to the caller so you would not need so many trampolines.
> 
> The locality improves the speed because the program would use the code
> and data caches more efficiently, and would page in data and code from
> disk less often.
> 
> Having fewer physically resident pages also saves on precious kernel
> memory.  I read in O'Reilly's "Understanding the Linux Kernel" that on
> the i386 architecture, the kernel's page tables consume most of the
> physical memory in the computer, leaving very little physical memory
> for user processes!
> 
> A first cut would be to start with the runtime program startup code,
> which for C program then calls main().  The subroutines that main
> calls would be placed next in the file.  Suppose main calls Foo() and
> then Bar().  One would then place each of the subroutines that Foo()
> calls all together, then each of the subroutines that Bar() calls.
This static analysis does not capture virtual calls (either C++ or
Objective-C).  It may also causing error handling code to be moved into
the "hot" area.   

> 
> It would be best if some static analysis were performed to determine
> in what order subroutines are called, and in what order .data and .bss
> memory is accessed.
> 
> Getting that analysis right for the general case would not be easy, as
> the time-order in which subroutines are called may of course depend on
> the input data.  To improve the locality, one could produce an
> instrumented executable which saved a stack trace at the entry of each
> subroutine.  Examination of all the stack traces would enable a
> post-processing tool to generate a linker script that would be used
> for a second pass of the linker.  This is a form of profiler-guided
> optimization.
At Apple we generate "order files" by running a program under dtrace. 
We use a feature of dtrace that sets a "one shot" break point on the 
start of every function.  You then run the program (under dtrace) and
you get a list of functions in the order they were first called with 
no need to build a special version of the program.  

Given that the optimal ordering is dependent on what the user does
to exercise different parts of the program, we've concluded the minimal
ordering is to just get initialization functions ordered.  This also helps
programs launch faster (less paging).

> 
> For extra credit one could prepare multiple input files (or for
> interactive programs, several distinctly different GUI robot scripts).
> Then the tool that prepared the linker script would try to optimize
> for the average case for most code.
>