[llvm-dev] Switch to ld.bfd tombstone behavior by default

Mon Jul 20 12:59:10 PDT 2020

On Mon, Jul 20, 2020 at 10:32 AM Alexey Lapshin
<alapshin at accesssoftek.com> wrote:
>
>
>
> >On Fri, Jul 17, 2020 at 1:55 PM Alexey Lapshin <a.v.lapshin at mail.ru> wrote:
> >>
> >> >Пятница, 17 июля 2020, 19:42 +03:00 от David Blaikie <dblaikie at gmail.com>:
> >> >
> >> >On Fri, Jul 17, 2020 at 12:03 AM Fangrui Song <maskray at google.com> wrote:
> >> >>
> >> >> Thanks for the write-up!
> >> >>
> >> >> On 2020-07-16, David Blaikie wrote:
> >> >> >In short: Perhaps we should switch lld to the bfd-style tombstoning
> >> >> >behavior for a release or two, letting users opt-in to testing with the new
> >> >> >-1/-2 tombstoning in the interim, before switching to the new tombstone by
> >> >> >default (while still having the flag to switch back when users find
> >> >> >surprise places that can't handle the new behavior).
> >> >> >
> >> >> >In long:
> >> >> >https://reviews.llvm.org/D81784 and follow-on patches modified the behavior
> >> >> >of lld with regards to resolving relocations from debug sections to dead
> >> >> >code (either comdat deduplicated, or gc-sections use).
> >> >> >
> >> >> >A very quick summary of the situation:
> >> >> >
> >> >> >Original Behavior:
> >> >> >
> >> >> > - bfd: 1 for debug_ranges(0 would prematurely terminate the list), 0
> >> >> > elsewhere
> >> >> > - gold/lld: 0+addend everywhere
> >> >> >
> >> >> >Limitations/bugs:
> >> >> >
> >> >> > - bfd/gold/lld
> >> >> > - doesn't support 0 as a valid executable address without ambiguities
> >> >> > - gold/lld
> >> >> > - ambiguities with large gc'd functions combined with a .text mapping
> >> >> > that starts in relative low addresses
> >> >> > - premature debug_range termination with zero-length functions (Clang
> >> >> > produces these with __builtin_unreachable or non-void return
> >> >> >type functions
> >> >> > without a return statement)
> >> >> >
> >> >> >New behavior:
> >> >> >
> >> >> > - -2 for DWARFv4 debug_loc, debug_ranges (-1 is a base address specifier
> >> >> > there)
> >> >> > - -1 elsewhere
> >> >> > - linker flag to customize to other values if desired
> >> >> >
> >> >> >Known issues:
> >> >> >
> >> >> > - lldb's line table parsing can't handle -1 well at all (essentially
> >> >> > unusable)
> >> >>
> >> >> Pavel Labath will fix this soon https://reviews.llvm.org/D83957
> >> >> This is an unhandled address-space wraparound problem.
> >> >> This pattern is potentially common - and other downstream DWARF
> >> >> consumers might make similar line table handling mistakes.
> >> >
> >> >That's the thing - I'm not sure we can really classify them as
> >> >"mistakes". I think bfd.ld's tombstoning behavior is about the only
> >> >thing we can reasonably say DWARF consumers should /probably/ be
> >> >expected to handle - and I'd imagine in many cases they haven't been
> >> >written intentionally to handle it, but whatever behavior they have
> >> >has accidentally been sufficient for their needs.
> >>
> >> >> > - gdb's line table parsing ends up with different handling when breaking
> >> >> > on gc'd functions (minor functionality issue)
> >> >>
> >> >> This is just a behavior difference, not affecting users.
> >> >> It did break a test if linked with LLD (gdb intrinsically has lots of
> >> >> failing tests even if built with GCC+GNU ld).
> >> >>
> >> >> Previous behavior (when an address is zero): a breakpoint on a
> >> >> --gc-sections discarded function will be redirected to a larger line
> >> >> number with debug info, even if that line can be an unrelated different
> >> >> function.
> >> >> New behavior is that the breakpoint is on a wrapped-around small address.
> >> >>
> >> >> GDB 9.3 will restore the previous behavior
> >> >> (https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=a8caed5d7faa639a1e6769eba551d15d8ddd9510)
> >> >>
> >> >> >
> >> >> >I think there's enough risk in this work (even given the small number of
> >> >> >bugs found so far), given there's a pretty wide array of debug info
> >> >> >consumers out there, that we should change lld's default to match the
> >> >> >long-lived bfd strategy. This would address my original motivation for
> >> >> >raising all this (empty functions prematurely terminating the list), while
> >> >> >letting users who want to experiment with it, or need it (like Alexey), can
> >> >> >opt-in to the -1/-2 behavior.
> >> >>
> >> >> I think we can only confidently say that there is enough risk in using
> >> >> tombstone value -1 in .debug_line, but I'd not say tombstone value -1 in
> >> >> other .debug_* can cause problems.
> >>
> >> >Given how many DWARF parsers we've had to cleanup or migrate off in
> >> >transitioning to DWARFv5 inside Google, I think that's a fair bit of
> >> >evidence the set of parsers isn't as narrow/closed as we'd like and
> >> >thus the number of places that might have issues isn't known/easily
> >> >exhaustively tested. So I'm not so concerned about the bugs we've
> >> >seen, but what that might indicate about the things that we don't know
> >> >about and can't test (because we don't know about them). (of course,
> >> >the flipside of that is that if we don't know about them, we can't
> >> >tell people who own them to go check if they work with this opt-in
> >> >feature - so they'll break whenever we turn this on by default - but
> >> >perhaps in the interim we can get at least a few big LLD customers to
> >> >deploy the feature and flush out some of the issues - happy for Google
> >> >to use this internally, hopefully we can encourage Apple folks, Sony
> >> >of course already has this semantic so nothing to find there most
> >> >likely - Chromium, maybe Firefox, etc)
> >>
> >> From the other side — when we already switched new behavior ON as default —
> >> then it is easier to discover all these unknown cases. We have an option restoring
> >> old behavior(which should be fine for all of the current users). Thus, everybody
> >> who needs old behavior is able to continue using it.
>
> >The cost to those users figuring out that it's a problem and finding
> >the flags/adding them, etc, is non-zero.
> >But, yes, as I said - eventually some of that will happen, sooner or
> >later. But reducing how much of it happens is valuable too.
>
> I agree. Reducing the impact on customers is important.
> Though in this case, as you said, it would be done anyway(sooner or later).
> Finally, I do not know what is more important: a) get feedback faster with
> the cost of involved users and requiring them to make efforts
> to figure out the problem. or b) test with a smaller set of users,
> find more problematic cases, make new tombstone value to be default later.
> I am kind of preferring a). But, b) is more safe and is OK.
>
>
> >> That makes transition more understandable. All problems would be explicitly seen. Then,
> >> If we will know about new problems - we could adapt the current solution. Similar to this >suggestion:
> >>
> >> a) .debug_ranges&.debug_loc => -2, .debug_line => 0, other .debug_* -> -1
>
> >The issue I have with this is that it's reflective of the known
> >breakage in some tools that happen to have a quick turnaround/feedback
> >- and assumes all the other sections are significantly less likely to
> >have problems with this format - and I don't think we have enough data
> >to make that estimate.
>
> We could extend that rule when/if new problems become known.
> Or, customers could fix their code or use option returning old behavior
> and it would probably not be necessary to extend the rule.
>
> >I think having a clear flag for bfd behavior or new behavior would be
> >good, keeping it bfd-by-default for now, evangelizing (patch notes,
> >outreach to other projects and tool developers) the new functionality
> >& then changing the default shortly after an LLVM release (perhaps 12
> >- but perhaps longer) to give it time to bake for trunk-following
> >users (ample time for them to pick it up on their schedule, see what
> >other tools might break, whether they're isolated enough to expect a
> >quick turnaround or whether we should get them updated and then wait a
> >while longer.
>
> yes, doing clear flag for bfd behavior or new behavior,
> and keep it bfd-by-default for now would be OK and is more safe.
>
>
> >> This would help while dwarf users are preparing their code to the new solution.
> >>
> >> Is option restoring old behavior not enough to solve problems caused by new tombstone >value?
>
> >I don't personally think it is enough - given some of the known
> >breakage - I think that points to a fair bit of reasonably possible
> >unknown breakage. Having a few big (like Apple and Google) users
> >switch over by default & potentially flush out a few more issues (or
> >not -but build confidence that there aren't more issues) I think is an
> >appropriate way to roll out a change like this.
>
> >(minor doubt: I wonder how well the bfd tombstoning works in DWARFv5
> >(rnglists/loclists) or in debug_range/debug_loc that uses base address
> >specifiers, where the zero-without-addend doesn't have a chance to
> >make an empty range (because it's not a start/end pair, it's a
> >start+offset pair - so the offset remains non-zero)... if bfd
> >tombstoning breaks DWARFv5 parsing (or base address specifiers in
> >range/loc) in common consumers I might be more inclined to support
> >enabling new tombstoning by default sooner (though if we could enable
> >it just for DWARFv5 that might be nice - but not practical, since the
> >linker doesn't know what DWARF it's linking & could be linking
> >multiple versions))
>
> - From the point of view overlapping address ranges,
> both bfd-solution(using 1) and old lld solution(using 0) are equal.

gold and lld's solution was a bit more complicated than using zero (&
bfd's solution uses 1 in debug_ranges, but not in other debug
sections). lld and gold use 0+addend in ranges. 0 alone in ranges
would be more problematic than any of the other deployed solutions,
because it would lead to the debug_ranges contributions being
terminated early (0, 0 terminates the list - so debug_ranges for a CU
with one gc'd function mid-way through the range list would drop the
rest of the range lit) - that's why bfd uses 1 for debug_ranges rather
than 0 which it uses elsewhere (though the same fix should,
technically, be applied to debug_loc too for the same reasons as
debug_ranges). the 0+addend approach of gold/lld works OK (except for
the low address overlap) except when there's a zero-length function,
then 0+addend == 0 for the end of the range, andy ou get a 0,0 entry
with the premature range list termination issues.

> for the "start+offset pair" case they could result in overlapping address ranges.

Not quite - the way bfd does things, at least in debug_ranges without
base address specifiers (bit hard to isolate for a consumer, really -
though they could special case zero base addresses... sort of) the
empty ranges are (1, 1) - they're actually empty. Whereas lld's
approach ends up with (0+addend, 0+addend) which is usually (0, size)
which has issues with overlap on systems that have a valid low address
range (or if you have sufficiently large functions).

> - From the point of view correct parsing of DWARF 5 - it looks like they are equally good.

Yep, DWARFv5 is more complicated - bfd's approach would be marginally
better even for addr+offset encodings in debug_rnglists - since it'd
always produce a 0 for the addr, whereas gold/lld's behavior can in
some cases (eg: "void f1() { } __attribute__((nodebug)) void f2() { }
void f3() { }" with -fno-function-sections - you'll end up with the
range for f3 starting at a non-zero addend - this debug info will be
worse for bfd/ld (if the DWARF for this object was included, but the
whole .text section was discarded (eg: maybe there's a global variable
in this object which is linked in, but -gc-section is still able to
discard the whole .text section as unreferenced) then gold/lld's
approach would make f3 unidentifiable as dead code (because it doesn't
have a tombstone start) but bfd's approach would work (assuming zero
isn't a valid address))

> - From the point of view correct parsing of DWARF 4 - it looks bfd-solution(using 1)
> is better than the old lld solution(using 0). When range list entry contains
> addresses(start+end) which should be relocated and for the zero-length functions,
> bfd-solution would result in range list entry: {1, 1}, while old lld solution
> would result in {0, 0}, and match with the end of list entry.
> That is the original problem that started this thread.

Only comes up for zero-length functions, because gold/lld's approach
was 0+addend, not straight 0.

> Though it looks like there still exist case when range list could be terminated earlier:
>
> base address selection entry: {-1, address of deleted code}
> following range list entry: {0, 0} << points to the same address as set by base address selection entry and has zero size.

That's a bug in the producer (though a good point - I've probably made
that bug in LLVM) - the linker can't solve that problem, since the
linker can't touch the literal unrelocated 0, 0.

> after linker resolved relocations it would look like this, for bfd case:
>
> base address selection entry: {-1, 1}
> following range list entry: {0, 0} <<<<<<<<<<<
>
> So there still exists {0,0} entry which could be considered as the end of list entry.
>
> But old lld solution has the same problem, thus it would not be new.
>
> - Additionally, AFAIK gdb has special processing for overlapped address
> ranges starting from 0. Using bfd tombstone value could break that processing - I would check it.

Not sure I understand - presumably gdb's special processing is
intended to work with bfd's tombstoning, since it's been the most
common/prolific unix linker, the one intended to work with gdb (they
exist in the same repository) for decades, right?

 - Dave

>
> Alexey.
>
>
> >- Dave
>
> > >> With consideration for satefy for the upcoming release/11.x, we can make
> > >> two choices:
> > >>
> > >> a) .debug_ranges&.debug_loc => -2, .debug_line => 0, other .debug_* -> -1
> > >> b) .debug_ranges&.debug_loc => -2, other .debug_* => 0
> > >>
> > >> Delaying .debug_line => -1 for one or two release sounds good to me.
> > >> So LLD 11 or 12 linked binaries can be debugged by LLDB 10. This is a
> > >> nice property.
> > >>
> > >> This write-up proposes b), but I'd say a) is likely sufficient. With the
> > >> available information, I cannot yet say that a) will have more risk.
> > >
> > >Risk is about the unknowns - and it still seems like a lot of
> > >unknowns. While there are probably many more consumers that read
> > >.debug_line than other sections, reading debug_info (for instance) is
> > >necessary for inline frames in symbolizing - still probably one of the
> > >most common uses of DWARF I'd guess. (what about stack unwinding using
> > >debug_frame? that'd worry me a bit if anyone got /that/ wrong because
> > >of this change)
> > >
> > >> > - chromium/firefox have some tools that were broken:
> > >> > https://bugs.chromium.org/p/chromium/issues/detail?id=1102223#c5
> > >>
> > >> This is potentially related to other .debug_* (not .debug_line)
> > >> I hope Chromium developers can chime in here:) The breakage was
> > >> unfortunate but I don't know how we could have avoided that. IMHO this
> > >> is no different from "clang started to emit a new DW_FORM_* and a
> > >> postprocessing tool of .debug chokes on that" Whether we want to
> > >> suppress that particular DW_FORM_* definitely should depend on how
> > >> likely it can cause problems, but we can't yet say we have to hold off
> > >> on a feature for a solved (precisely, mitigated) problem.
> > >
> > >LLVM has no custom forms and I'd be super cautious about adding any
> > >that were on by default because of how bad that breakage would be.
> > >
> > >I'm not so concerned about the problems we know - but what they tell
> > >us about the problems that might arise from use cases we don't know.
> > >All the other projects out there that might have custom DWARF parsers
> > >to do some ad-hoc things.
> > >
> > >(also, ultimately - given how far-reaching this is, I think we'll want
> > >some tidier flags that are more user-focussed. I'd hope for a flag
> > >that gives BFD-like semantics (though I'd be OK with fixing debug_loc
> > >(using 1 instead of 0) to work the same as debug_ranges while we're
> > >there - a minor divergence from BFD, but highly likely to not cause
> > >problems/fall out naturally from a simple implementation of parsing
> > >that section) - something that's been in-use and tested by basically
> > >everyone for decades. And another flag for the new semantics (-2 for
> > >debug_loc/debug_ranges, -1 everywhere else). Customizable per-section
> > >expression-based support I think is a recipe for platform divergence &
> > >I'd rather it not be available/supported at all, but if you really
> > >want to keep it in, I'd at least rather it not be the feature we
> > >promote to users about how they can test/opt in/out of the behavior
> > >when they're seeing breakages or want to test the future semantics)
> > >
> > >> >I'm not sure how to get the word out to DWARF consumers that they should
> > >> >consider this new experimental behavior. Ray's done a good job
> > >> >evangelizing/discussing this with gdb and lldb at least - and of course
> > >> >having turned it on by default briefly has found some users (like Chromium)
> > >> >that we probably wouldn't have found no matter how long we left this as an
> > >> >experimental option... so some things are going to break when we switch no
> > >> >matter what.
> > >>
> > >> Thank you for following up with some GNU folks on their lists!
> > >> If folks want to follow along the thread:
> > >> https://sourceware.org/pipermail/binutils/2020-June/111376.html
> > >>
> > >> We have informed binutils, elfutils-devel (elfutils has a few debug
> > >> tools) and gdb. I don't recall that anyone has thought about problems
> > >> with a tombstone value.
> > >>
> > >> >
> > >> >P.S: Sony's already been using the -1 technique with their debugger and
> > >> >linker for a while, so they may want to keep this on by default for SCE -
> > >> >but I'm not sure how to do that in-tree.
> > >> >
> > >> >
> > >> >Clang doesn't know which lld
> > >> >version it's running, so whether the flag can be specified, I would think?
> > >> >(so it'd be hard to have Clang go "if SCE and LLD, pass the flag to use
> > >> >-1", I think) - if there is a way to make that decision in the compiler
> > >.> >driver+linker, then we'd have a question of "default new behavior except
> > >> >when tuning for LLDB and GDB" or "default bfd behavior except when tuning
> > >> >for SCE".
> > >>
> > >> I've been involed in another thread on SHF_LINK_ORDER (https://sourceware.org/pipermail/binutils/2020-
> > >July/112415.html ).
> > >> We may need a way to tell codegen about the used linker.
> > >>
> > >> pcc proposed -mbinutils-version= - This is nice in that some MC
> > >> decisions related to -fno-integrated-as can use this option as well.
> > >> jyknight proposed -mlinker-version= and syntax like -fuse-ld=bfd:2.34
> > >>
> > >> This may get more complex if the generated object file want to be linked
> > >> with more than one linker. This discussion probably deserves its own
> > >> thread.
> >
> >
> > --
> > Alexey Lapshin
> >
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev