[PATCH] D61613: [LLD][ELF] Add the -z ifunc-noplt option

Wed May 8 08:51:04 PDT 2019

markj added a comment.

In D61613#1494472 <https://reviews.llvm.org/D61613#1494472>, @MaskRay wrote:

> > FreeBSD makes use of the option when linking x86 kernels, which now make heavy use of ifuncs in hot paths. Without the option, each ifunc results in the generation of a PLT entry, so calls needlessly incur the extra overhead of an indirect jump. With the option, ifunc calls have the same cost as ordinary function calls.
>
> Oh I recall this one and find your efforts to push forward this.
>
> I remember I asked if you got numbers of the improvement and you didn't have an answer then. I still hope you can share the benchmark to justify this is a useful feature.

See below.

> According to my understanding, `-z ifunc-noplt` will introduce many text relocations (a kind of dynamic relocations), that will slow down the dynamic loader. The overhead resolving the text relocation pays off after the function has been called a few number of times. Surely you have hot paths, but many others are cold. The problem of `-z ifunc-noplt` is that it is too large a hammer - it applies to the whole module.

In our use-case the relocations are applied at boot time. My kernel has 3815 text relocations, all for ifuncs, which take 418672 cycles to resolve as measured by rdtsc on a Xeon E5-2630v3. The overhead is negligible and I expect it will stay that way.

I agree that -zifunc-noplt is not suitable for userspace applications, which must decide between the overhead of PLT calls and text relocations. In the kernel or other freestanding environments, text relocations do not have the same downsides, so the tradeoff is different.

> If you extend the usage to userspace applications you may have to tune the runtime code of static linked programs.
> 
> If you want a fast memcpy, I wonder why you can't define a global function pointer: `void (*fast_memcpy)(void*,const void*,size_t) __attribute__((visibility("hidden"))) = slow_memcpy;` and initializes it to a fast implementation at runtime.

This still requires an indirect branch and has the added disadvantage that control flow is directed by pointers residing in writable memory.

> Compared with the usual ifunc: GOT+PLT, this approach is similar to `-fno-plt`. I'd like to know why you think it doesn't meet your needs and you really need `-z ifunc-noplt`.

A function pointer still has extra, unnecessary overhead. We use ifuncs as a micro-optimization to avoid having CPU feature bit tests in commonly used functions. For instance, our copyout() implementation contains special logic for processors supporting SMAP. The overhead of testing the CPU feature bit is not high, but it is measurable in micro-benchmarks, and the test consumes CPU resources (extra I-cache footprint, branch predictor cache, etc.) to compute a result that is known at boot-time. Over time, as vendors introduce new CPU features, our code accrues more and more of these tests. ifuncs are an attractive mitigation for this problem, but if they use the PLT than we are still sacrificing CPU resources needlessly: the PLT occupies space in the instruction cache and the indirect call targets occupy space in a cache. -zifunc-noplt completely removes these overheads. I understand that they are small, but they accumulate as the code base grows and introduce cognitive overhead as programmers must think about the hidden cost behind what looks like an ordinary C function call.

In D61613#1494486 <https://reviews.llvm.org/D61613#1494486>, @MaskRay wrote:

> There is another issue: this feature is arch-specific and probably only applies on x86:
>
>   // ELF/Relocations.cpp#L880
>       } else if (RelType Rel = Target->getDynRel(Type)) {
>         In.RelaDyn->addReloc(Rel, &Sec, Offset, &Sym, Addend, R_ADDEND, Type);
>
>
> On non-x86, you probably can't find an appropriate dynamic relocation type that resolves your desired text relocations. This looks to me a really hacky feature (linker support + textrel special case + more dynamic relocation type handling) that is limited to x86. Its benefit is also unclear (I can't find benchmarks in several of your FreeBSD patches). On the other hand, the alternative (a global function pointer) seems equally good.

I tested the feature on arm64 and it works there as well. You are right that it requires support for static relocations in the kernel linker, but I do not see this to be a problem. We already implement a small "static" linker in the kernel (sys/kern/link_elf_obj.c) to handle the fact that loadable kernel modules are relocatable .o files instead of shared libs on some platforms (and the reason for this is again to avoid the PLT). IMO it is just another hint that the feature is not suitable for vanilla userspace.

To be clear, I do agree that this is hacky. It can not be used unless the application has some control over how it is linked; I tried to indicate this in the man page description. But, the implementation is extremely simple, and I believe the goal is conceptually reasonable. In the future, if the feature becomes too much work to maintain, it can be removed without affecting application correctness. In the meantime I intend to be available to support the patch if needed and refine it if possible.

Regarding benchmarks, I do not believe the patch will have an impact on any interesting macro-benchmark and did not attempt to measure. For the micro-benchmark, I used a program whose loop copies data out of the kernel at a high rate. In particular, it calls the copyout() ifunc frequently. I retried the test with the latest LLD sources. The code, data and summaries are here: https://people.freebsd.org/~markj/ifunc/

I tested with and without IBRS enabled, mostly to demonstrate that the optimization has an impact on microarchitectural resource usage. In particular, with IBRS enabled the improvement from -zifunc-noplt is more pronounced. The summaries show system CPU time consumed by the test program. "Patched" means -zifunc-noplt is configured.

https://people.freebsd.org/~markj/ifunc/ibrs.txt
https://people.freebsd.org/~markj/ifunc/noibrs.txt

The improvement is small and consistent.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61613/new/

https://reviews.llvm.org/D61613