[llvm-dev] GC for defsym'd symbols in LLD

Fāng-ruì Sòng via llvm-dev llvm-dev at lists.llvm.org
Thu Dec 5 14:17:06 PST 2019

I have made some further investigation. My conclusion is that GNU ld does
not do better than lld. Making the --defsym behavior ideal is difficult in
the current framework.

GNU ld has some unintended behaviors.

ld.bfd a.o --defsym 'd=foo' --gc-sections -o a => GNU ld retains .text_foo
ld.bfd a.o --defsym 'd=foo+3' --gc-sections -o a => GNU ld drops .text_foo
ld.bfd a.o --defsym 'd=bar-bar+foo' --gc-sections -o a => GNU ld drops

I traced its logic under a debugger. Here is the stack trace:


asection *
_bfd_elf_gc_mark_hook (asection *sec,
  case bfd_link_hash_defined:
  case bfd_link_hash_defweak:
    // It points to .text_foo for --defsym d=foo, but *ABS* for --defsym
d=bar-bar+foo or --defsym d=foo+3
    return h->root.u.def.section;

GNU ld evaluates symbol assignments in many passes, the representation of a
symbol (section+offset) can vary among passes.
In the GC pass, its rule only works for simple expressions like --defsym
d=foo, but not any slightly complex expressions.

In lld, it would be difficult to drop the following rule in MarkLive.cpp:

  for (StringRef s : script->referencedSymbols)

The issue can be demonstrated by the following call tree:

      // Defined::section is nullptr for `d` because the assignment d=foo
hasn't been evaluated yet.
          // Symbol section+offset are evaluated here.

It seems that github issues may be a good place to record the problem. I
just created https://github.com/llvm/llvm-project/issues/52
I wanted to mark it low priority, but there is no such label.

On Wed, Dec 4, 2019 at 8:51 AM Shoaib Meenai <smeenai at fb.com> wrote:

> I completely agree that --defsym foo=bar should keep bar (or more
> precisely the section containing bar) alive if foo is referenced.
> My mental model of how --defsym foo=bar behaves is that (assuming bar is a
> defined symbol) we create a symbol foo that points to the same location as
> bar (as in it has the same section + address within that section). Any
> reference to foo should therefore prevent that section from getting garbage
> collected. bar doesn't need to enter the picture directly (and we don't
> need to store any sort of explicit link between foo and bar); its section
> getting preserved just naturally falls out of foo getting preserved.
> For example, in Fāng-ruì's movabs example, the symbol _start (which is the
> entry point and therefore a GC root) will have a relocation against d, so d
> will be kept alive too. With --defsym d=foo, the symbol d should point to
> the same section as foo, so that section will be preserved; it doesn't
> matter if the symbol foo itself is preserved (unless there are other
> non-dead references to it, of course, but then those references should
> cause foo to be marked alive as well).
> I haven't actually studied how LLD models a defsym though, so my mental
> model might be way off. I apologize for not having done so before replying,
> but it'll be at least a few days before I have the chance to get to that.
> If my mental model is accurate, preserving the needed section for defsym
> should just fall out naturally from it (without needing to give the target
> of a defsym any special treatment), but if not, the whole thing might be
> much more complicated and not worth it.
> On 12/4/19, 1:35 AM, "Peter Smith" <peter.smith at linaro.org> wrote:
>     On Wed, 4 Dec 2019 at 07:05, Fāng-ruì Sòng <maskray at google.com> wrote:
>     >
>     > On Tue, Dec 3, 2019 at 7:02 PM Shoaib Meenai via llvm-dev
>     > <llvm-dev at lists.llvm.org> wrote:
>     > >
>     > > LLD treats any symbol referenced from a linker script as a GC
> root, which makes sense. Unfortunately, it also processes --defsym as a
> linker script fragment internally, so all target symbols of a --defsym also
> get treated as GC roots (i.e., if you have something like --defsym SRC=TGT,
> TGT will become a GC root). I believe this to be unnecessary for defsym
> specifically, since you're just aliasing a symbol, and if the original or
> aliased symbols are referenced from anywhere, the symbol's section will get
> preserved anyway. (There's also cases where the defsym target can be an
> expression instead of just a symbol name, which I admittedly haven't
> thought about too hard, but I believe the same logic  should hold in terms
> of any needed sections getting preserved regardless.) I want to change
> defsym targets specifically to not be considered as GC roots, so that they
> can be dead code eliminated. Does anyone foresee any issues with this?
>     >
>     > % cat a.s
>     > .globl _start, foo, bar
>     > .text; _start: movabs $d, %rax
>     > .section .text_foo,"ax"; foo: ret
>     > .section .text_bar,"ax"; bar: nop
>     > % as a.s -o a.o
>     >
>     > % ld.bfd a.o --defsym d=foo --gc-sections -o a => .text_foo is
> retained
>     > % ld.bfd a.o --defsym d=bar --gc-sections -o a => .text_bar is
> retained
>     > % ld.bfd a.o --defsym d=1 --gc-sections -o a => Neither .text_foo nor
>     > .text_bar is retained
>     > % ld.bfd a.o --defsym c=foo --defsym d=1 --gc-sections -o a =>
> Neither
>     > .text_foo nor .text_bar is retained; lld will retain .text_foo.
>     >
>     > For --defsym from=an_expression_with_to, GNU ld appears to add a
>     > reference from 'from' to 'to'. lld's behavior
>     > (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D34195&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=o3kDXzdBUE3ljQXKeTWOMw&m=MpiPCWMhZJFZg0s-e1lhHtcCr-BLzG6zbJ44d0isoMc&s=7j_hrwm8LBMCPNgU_IXbhye_YKPQFgGJlU3YMAtWGLE&e=
> ) is more conservative.
>     >
>     > If we stop treating script->referencedSymbols as GC roots,
>     > instructions like `movabs $d, %rax` will no longer be able to access
>     > the intended section. We can tweak our behavior to be like GNU ld,
> but
>     > the additional complexity may not be worthwhile.
>     I think it would be a step too far for defsym symbol=expression to
>     have no effect on GC. I'd expect that something like defsym foo=bar is
>     used because some live code refers to foo, but does not refer to bar,
>     so ideally we'd like defsym foo=bar to keep bar live. I've seen this
>     idiom used in embedded systems in the presence of binary only
>     libraries. It is true that the programmer can always go the extra mile
>     to force bar to be marked live, however I think the expectation would
>     be defsym foo=bar would do it.
>     I think the GNU ld behaviour is reasonable. If nothing refers to
>     either foo or bar then there is no reason to mark them live. On the
>     implementation cost-benefit trade off I guess we won't know until
>     there is a prototype, and some idea of what implementing it will save
>     on a real example.
>     Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191205/1cf7f3cd/attachment.html>

More information about the llvm-dev mailing list