[llvm-dev] [RFC] Generating LLD reproducers on crashes

Wed Apr 21 09:58:36 PDT 2021

On Wed, Apr 21, 2021 at 9:56 AM Fangrui Song <maskray at google.com> wrote:
>
> On 2021-04-21, Alexandre Ganea wrote:
> >Hello,
> >
> >Just chiming in :-) I think there are several options:
> >
> >1. Haowei's proposal is a good first step. A reproducer is good for 100% crashes. The reproducer archive is good for developers that need to iterate when fixing a crash. That can be achieved by mimicking the behavior of -fintegrated-cc1 and CC1Command (which is somehow equivalent to safeLldMain()), followed by something similar to generateCompilationDiagnostics(). All this controlled by an optional cmd-line flag. Also worth noting that some LLD drivers (if not all) support reproducers already.
> >
> >2. There are cases where the CrashRecoveryContext isn't perfect: a stack overflow, a severe heap corruption, corruption of any global state. These happen quite rarely, but would prevent the gen-reproducer from creating anything, however I expect we would still see an invalid return code in the build log. A 100% solution would need forking LLD at startup, along the lines of -fno-integrated-cc1. The new cmd-line flag from 1. could eventually support this multi-process/forking scenario as well.
> >
> >3. Finally, low repro crashes like Petr mentioned would need a different approach: the crash/core dump has to be saved when it occurs; creating a reproducer might not be enough to understand the issue. On our end, we use the mechanism provided by LLVM_ENABLE_CRASH_DUMPS on Windows to create full crash dumps from the global exception handler. We've been using the same thing in our runtimes and our tooling for the past 15 years, it's pretty reliable (that is, creating a .dmp from within the crashing executable). It'd be worth investigating if that cmake option could be made to work on Linux as well?
> >
> >I would go ahead with 1. for a starter, and 3. eventually if that's feasible. Option 2. is only if the other options aren't enough.
> >
> >Alex.
>
> 1 looks good to me as well, if sufficient parties find this useful.
> I am concerned with 2 due to fragility of same-process crash reporting.
> (I think the stack overflow problem can be mitigated by sigaltstack)

It seems to work pretty well for clang, I think?

> For 3: on Linux, there are many system effects which can suppress crash dumps...
> I am still abit concerned with the size of the core file..
> If users find this particular useful and want to go down this route, it
> is fine to me.
>
> >-----Message d'origine-----
> >De : llvm-dev <llvm-dev-bounces at lists.llvm.org> De la part de David Blaikie via llvm-dev
> >Envoyé : April 20, 2021 5:52 PM
> >À : Fangrui Song <maskray at google.com>
> >Cc : llvm-dev <llvm-dev at lists.llvm.org>
> >Objet : Re: [llvm-dev] [RFC] Generating LLD reproducers on crashes
> >
> >FWIW, I'll +1 the idea of support for more convenient crash repros & also for more library-izing lld (as there was a strong push for early on in lld's v2 development, and was pushed back on until it was production quality, which it is - so now any costs of library-izing & other functionality can be measured in terms of performance)
> >
> >While maintainability is important - in some ways, making the code more re-entrant friendly could improve maintainability (making it easier to unit test parts of the code, making it easier to reason about the code (by removing global shared state), etc).
> >
> >On Tue, Apr 20, 2021 at 1:00 PM Fangrui Song via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> >>
> >> On 2021-04-20, Haowei Wu via llvm-dev wrote:
> >> >As Petr pointed out, we can reuse what clang uses for crash handling
> >> >and move that functionality to the llvm library if needed. Clang
> >> >crash reproducer works fairly well and it is not using multi-process
> >> >architecture. We just propose to make lld to match clang's crash
> >> >reproducer's behavior, nothing more than that.
> >> >
> >> >Given we both agree it can be an optional feature, I don't think size
> >> >is an issue. For users who don't need this functionality, they can
> >> >make this feature disabled by not using related flags. For users who
> >> >need this feature for debugging purposes, disk space is the least of
> >> >the concern. The reproducer size issue won't solve itself if we choose to use shell scripts.
> >> >If you mean implementing a size dedup/regulator using shell script,
> >> >that is a big project and would be a nightmare to maintain.
> >>
> >> The main argument here is whether implementing it in the driver
> >> simplifies things. I doubt it and I care much about the maintainability.
> >>
> >> If the logic is simply calling CrashRecoveryContext::RunSafely
> >> (actually we have already done this, just set the environment variable
> >> LLD_IN_TEST=1) and the code can meld into the existing framework for
> >> running the lld entrypoint more than once, this could look fine.
> >> But you actually needed to parse an option in the generic driver code
> >> and do some non-trivial things there.
> >>
> >> If the logic is simply `if CrashRecoveryContext::RunSafely finds a
> >> failure, rerun with --reproduce`. I don't think implementing this in
> >> the driver is simpler than `(ld.lld "$@" 2> log; code=$?; if grep 'crash pattern, maybe Stack' log; then ld.lld "$@" --reproduce=somewhere; fi ...).
> >>
> >>
> >> BTW: if exitLld is called (via fatal, or sufficient number of
> >> error()), lld may enter a less robust state. It uses longjmp to
> >> recover from a crash.  AFAIK, only some JIT style library users do
> >> this. This is not sufficiently tested in production systems.
> >>
> >> >
> >> >We rarely use shell scripts in our (Fuchsia) build system as they
> >> >introduce complexity to the build system and are difficult to
> >> >maintain and it does add 1 extra fork per invocation, which may or
> >> >may not be an issue. But we prefer to avoid it. They are also very
> >> >difficult to be properly reused across different teams or tools. If
> >> >lld can integrate a crash reproducer, it can benefit other compilers that relies on lld as well.
> >> >
> >> >On Fri, Apr 16, 2021 at 12:37 AM pawel k. via llvm-dev <
> >> >llvm-dev at lists.llvm.org> wrote:
> >> >
> >> >> Reusing existing one sounds even better.
> >> >>
> >> >> Best regards,
> >> >> Pk
> >> >>
> >> >> pt., 16.04.2021, 08:14 użytkownik Petr Hosek via llvm-dev <
> >> >> llvm-dev at lists.llvm.org> napisał:
> >> >>
> >> >>> On Thu, Apr 15, 2021 at 9:30 PM Fangrui Song via llvm-dev <
> >> >>> llvm-dev at lists.llvm.org> wrote:
> >> >>>
> >> >>>> On 2021-04-15, Manoj Gupta via llvm-dev wrote:
> >> >>>> >LLD reproducers is something we'd like to have in Chrome OS as
> >> >>>> >well, see bug
> >> >>>> >https://bugs.chromium.org/p/chromium/issues/detail?id=1134940 (no activity yet).
> >> >>>> >Our plan is to create a shell wrapper and re-exec LLD if needed
> >> >>>> >with --reproduce. Obviously, if LLD supports creating
> >> >>>> >reproducers natively, that'd be great!
> >> >>>> >
> >> >>>> >-Manoj
> >> >>>>
> >> >>>> The crash report can be easily implemented via a shell script,
> >> >>>> but is difficult to implementat reliably in the process itself.
> >> >>>> When a process crashes, naturally not everything can work very
> >> >>>> robustly. The process wants to recover some state and starts a
> >> >>>> .tar writer, collects every touched file and places them in the
> >> >>>> .tar writer. There are many steps things can go afoul. I am
> >> >>>> worrying about the robustness. Of course, this may be solved by a
> >> >>>> multiprocess architecture, but I am not sure we want to pay the
> >> >>>> complexity in the LLD entrypoint itself.
> >> >>>>
> >> >>>
> >> >>> I'm hoping that we could reuse llvm::CrashRecoveryContext just
> >> >>> like Clang does without needing multi-process architecture.
> >> >>> Furthermore, we might be able to extract some of the common
> >> >>> infrastructure from Clang to LLVM and then use it in lld. Clang
> >> >>> has already solved this so there's no need to duplicate the effort.
> >> >>>
> >> >>>
> >> >>>> (Crashing LLD is not the idea I hear a lot. For some groups it
> >> >>>> has been very stable.
> >> >>>> The crashes are more frequently from some optimizations triggered
> >> >>>> by llvm/lib/LTO.
> >> >>>> The nature of the crashes is useful, if Fuchsia/ChromeOS folks
> >> >>>> would like to provide.)
> >> >>>>
> >> >>>> On the other hand, this task seems to require a fair amount of
> >> >>>> customization to me.  First we have the tarball size problem.
> >> >>>> Then say there is a common crash and 100 links of a similar kind
> >> >>>> crash at the same time, do we write 100 tarballs?  In a
> >> >>>> controlled environment, for example when there is some
> >> >>>> deduplicater or throttling this may be feasible. The output
> >> >>>> filename may want customization as well, and different groups may
> >> >>>> have different opinions.  It feels to me that a script, whether
> >> >>>> or not LLD has the built-in crash reporting feature, is
> >> >>>> indispensable. Then the built-in C++ crash reporter code in LLD
> >> >>>> does not convince me.
> >> >>>>
> >> >>>
> >> >>> I cannot speak for others, but at least in our case we are usually
> >> >>> limited by the available memory and on the machines we have we can
> >> >>> only run single to low double digit of link jobs in parallel. We
> >> >>> also never had this issue with Clang and we usually run hundreds of parallel jobs.
> >> >>>
> >> >>> Regarding the output name, I'd again take an inspiration from
> >> >>> Clang and simply use a temporary filename generated by
> >> >>> llvm::sys::fs::createTemporaryFile. That way we don't need to
> >> >>> worry about two failing link jobs trying to write to the same
> >> >>> file. We never needed the customization for Clang and I doubt we will for lld.
> >> >>>
> >> >>>
> >> >>>> >On Thu, Apr 15, 2021 at 11:23 AM David Blaikie via llvm-dev <
> >> >>>> >llvm-dev at lists.llvm.org> wrote:
> >> >>>> >
> >> >>>> >> On Thu, Apr 15, 2021 at 1:37 AM Petr Hosek via llvm-dev
> >> >>>> >> <llvm-dev at lists.llvm.org> wrote:
> >> >>>> >> >
> >> >>>> >> > lld crashes are more rare, but they do happen. For example,
> >> >>>> >> > we see
> >> >>>> lld
> >> >>>> >> segfaulting occasionally on our bots. I'd like to fix it, but
> >> >>>> >> I don't
> >> >>>> know
> >> >>>> >> how to reproduce this issue because we never managed to
> >> >>>> >> reproduce it locally. This is primarily where the motivation
> >> >>>> >> for this feature came
> >> >>>> from.
> >> >>>> >> In the case of Clang, we already configure our build to
> >> >>>> >> generate reproducers in a dedicated directory and at the end
> >> >>>> >> of the build we
> >> >>>> upload
> >> >>>> >> its content to a dedicated (short lived) storage bucket. We
> >> >>>> >> would
> >> >>>> like to
> >> >>>> >> do the same with lld and if this feature existed, we would use
> >> >>>> >> it in
> >> >>>> our
> >> >>>> >> build.
> >> >>>> >> >
> >> >>>> >> > The size of the reproducers is not really an issue; even if
> >> >>>> >> > they
> >> >>>> are a
> >> >>>> >> few gigabytes, they're still dwarfed by the size of the debug
> >> >>>> >> info, at least in our build.
> >> >>>> >> >
> >> >>>> >> > Passing -Wl,--reproduce is something a compiler engineer can
> >> >>>> >> > do when
> >> >>>> >> debugging an issue locally, but it's not something a bot can
> >> >>>> >> do. Even
> >> >>>> most
> >> >>>> >> developers on our team wouldn't know how to do it which is why
> >> >>>> >> the automatic crash reproducer generation in Clang is so
> >> >>>> >> valuable, all
> >> >>>> that
> >> >>>> >> developers need to do is to follow the instructions without
> >> >>>> >> having to modify the build and we've had great success with it
> >> >>>> >> in the case of
> >> >>>> Clang.
> >> >>>> >>
> >> >>>> >> Probably would help (if this isn't done already) this part at
> >> >>>> >> least
> >> >>>> >> (ie: users who don't have this newly proposed feature enabled)
> >> >>>> >> if lld's crash reporter printed the command line to run with
> >> >>>> >> the extra flag "to reproduce this run <this command>" for discoverability?
> >> >>>> >>
> >> >>>> >> (not to derail the primary discussion on this thread, which I
> >> >>>> >> don't have much opinion on)
> >> >>>> >>
> >> >>>> >> > I'm leaning towards the second option, that is implementing
> >> >>>> >> > this
> >> >>>> feature
> >> >>>> >> directly in lld. The reason is that we most often see lld
> >> >>>> >> crashes when linking Rust code. If we implemented this feature
> >> >>>> >> in the Clang
> >> >>>> driver, we
> >> >>>> >> would also need to do the same inside the Rust driver (and any
> >> >>>> >> other compiler driver that supports lld). If we implement it
> >> >>>> >> in lld, we
> >> >>>> only need
> >> >>>> >> to do it once, so it's more universal.
> >> >>>> >> >
> >> >>>> >> > On Wed, Apr 14, 2021 at 3:40 PM Fāng-ruì Sòng via llvm-dev <
> >> >>>> >> llvm-dev at lists.llvm.org> wrote:
> >> >>>> >> >>
> >> >>>> >> >> On Wed, Apr 14, 2021 at 3:27 PM Haowei Wu
> >> >>>> >> >> <haowei at google.com>
> >> >>>> wrote:
> >> >>>> >> >> >
> >> >>>> >> >> > > I am skeptical that users will want to have this
> >> >>>> >> >> > > behavior by
> >> >>>> >> default.
> >> >>>> >> >> > > If this behavior is guarded by an option, it might be fine.
> >> >>>> >> >> >
> >> >>>> >> >> > That's a good point. If the reproducer will be more than
> >> >>>> >> >> > a few
> >> >>>> >> hundreds MiBs, it is definitely not suitable to be enabled by
> >> >>>> default. I
> >> >>>> >> agree it's better to be guarded by an option flag such as
> >> >>>> >> `--gen-lld-crash-reproducer`.
> >> >>>> >> >> >
> >> >>>> >> >> > On Wed, Apr 14, 2021 at 2:40 PM Fangrui Song
> >> >>>> >> >> > <maskray at google.com
> >> >>>> >
> >> >>>> >> wrote:
> >> >>>> >> >> >>
> >> >>>> >> >> >>
> >> >>>> >> >> >> On 2021-04-14, Haowei Wu via llvm-dev wrote:
> >> >>>> >> >> >> >*Background / Motivation*
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >Both clang and lld have the ability to generate a
> >> >>>> >> >> >> >reproducer
> >> >>>> (an
> >> >>>> >> archive
> >> >>>> >> >> >> >with input files and invoker script to reproduce the
> >> >>>> >> >> >> >clang/lld
> >> >>>> >> build).
> >> >>>> >> >> >> >While clang will generate a reproducer archive when a
> >> >>>> >> >> >> >crash
> >> >>>> >> happens, lld
> >> >>>> >> >> >> >only generates a reproducer when '--reproduce' flag is
> >> >>>> explicitly
> >> >>>> >> set (this
> >> >>>> >> >> >> >is equivalent to Clang's -gen-reproducer flag). This is
> >> >>>> >> >> >> >not
> >> >>>> very
> >> >>>> >> helpful
> >> >>>> >> >> >> >for debugging lld bugs, particularly when the crash
> >> >>>> >> >> >> >happens in
> >> >>>> >> building big
> >> >>>> >> >> >> >projects, since it will be unrealistic to set
> >> >>>> >> >> >> >reproducer flags
> >> >>>> to
> >> >>>> >> generate
> >> >>>> >> >> >> >reproducer archives for every lld invocation. This
> >> >>>> >> >> >> >design also
> >> >>>> >> causes
> >> >>>> >> >> >> >troubles when the crash happens on bots only, as in
> >> >>>> >> >> >> >most cases,
> >> >>>> >> developers
> >> >>>> >> >> >> >do not have access to the file system of these bots. It
> >> >>>> >> >> >> >would
> >> >>>> be
> >> >>>> >> great to
> >> >>>> >> >> >> >improve the lld reproducer generation for easier
> >> >>>> >> >> >> >debugging in
> >> >>>> these
> >> >>>> >> >> >> >scenarios.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >*Proposal*
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >Given the use cases and status of clang and lld. I
> >> >>>> >> >> >> >think there
> >> >>>> are 2
> >> >>>> >> >> >> >possible solutions.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >*Extend Clang driver*
> >> >>>> >> >> >> >In most cases, lld is invoked by the clang driver
> >> >>>> >> >> >> >instead of
> >> >>>> being
> >> >>>> >> invoked
> >> >>>> >> >> >> >by the build system directly. Therefore, the clang
> >> >>>> >> >> >> >driver can
> >> >>>> be
> >> >>>> >> changed to
> >> >>>> >> >> >> >re-invoke lld with '--reproduce' flags when it detects
> >> >>>> >> >> >> >the lld
> >> >>>> >> subprocess
> >> >>>> >> >> >> >is crashed.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >Advantages:
> >> >>>> >> >> >> >    * It probably does not require any changes to the
> >> >>>> >> >> >> >lld and
> >> >>>> might
> >> >>>> >> be
> >> >>>> >> >> >> >easier than handling the crash directly in lld.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >Disadvantages:
> >> >>>> >> >> >> >    * In case when there is a racing condition in the
> >> >>>> >> >> >> >build
> >> >>>> system,
> >> >>>> >> the
> >> >>>> >> >> >> >input files might have changed between 1st lld crash
> >> >>>> >> >> >> >and 2nd
> >> >>>> lld
> >> >>>> >> rerun with
> >> >>>> >> >> >> >'--reproduce' flag. In this case, the generated lld
> >> >>>> >> >> >> >reproducer
> >> >>>> >> archive
> >> >>>> >> >> >> >might not be able to trigger a crash, makes it less useful.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >*Improve lld reproducer* Another way would be to make
> >> >>>> >> >> >> >lld generate a reproducer archive
> >> >>>> when
> >> >>>> >> it
> >> >>>> >> >> >> >crashes, just like what clang is doing.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >Advantages:
> >> >>>> >> >> >> >    * It will work no matter if lld is invoked from
> >> >>>> >> >> >> >Clang or
> >> >>>> from
> >> >>>> >> the build
> >> >>>> >> >> >> >system.
> >> >>>> >> >> >> >    * It will catch the input file in case the crash is
> >> >>>> >> >> >> >caused
> >> >>>> by
> >> >>>> >> build
> >> >>>> >> >> >> >races.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >Disadvantages:
> >> >>>> >> >> >> >    * It might need a lot of work if lld does not
> >> >>>> >> >> >> >already have
> >> >>>> a
> >> >>>> >> >> >> >sophisticated crash handler. It might still need some
> >> >>>> >> >> >> >plumbing
> >> >>>> >> changes in
> >> >>>> >> >> >> >clang driver so lld can honor the '-fcrash-diagnostic-dir'
> >> >>>> flag.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >*Comments?*
> >> >>>> >> >> >> >Which approach do you prefer? Feel free to share your opinions.
> >> >>>> >> >> >>
> >> >>>> >> >> >> There is a resource difference between clang
> >> >>>> >> >> >> -gen-reproducer / environment variable
> >> >>>> >> >> >> "FORCE_CLANG_DIAGNOSTICS_CRASH" and ld.lld
> >> >>>> >> --reproduce.
> >> >>>> >> >> >>
> >> >>>> >> >> >> clang -gen-reproducer produces a source file and a .sh
> >> >>>> >> >> >> file for
> >> >>>> one
> >> >>>> >> >> >> single translation unit, the space consumption is low.
> >> >>>> >> >> >> ld.lld --reproduce can potentially pack a large list of
> >> >>>> >> >> >> files,
> >> >>>> which
> >> >>>> >> may
> >> >>>> >> >> >> take hundreds of megabytes or several gigabytes.
> >> >>>> >> >> >>
> >> >>>> >> >> >> I am skeptical that users will want to have this
> >> >>>> >> >> >> behavior by
> >> >>>> default.
> >> >>>> >> >> >> If this behavior is guarded by an option, it might be fine.
> >> >>>> >> >>
> >> >>>> >> >> I'll retract my words about an option. This behavior looks
> >> >>>> >> >> like it needs a fair bit of customization and is build system dependent.
> >> >>>> >> >> You can replace the proposed option with a shell script
> >> >>>> >> >> wrapper,
> >> >>>> which
> >> >>>> >> >> is more convenient than implementing the restartable action
> >> >>>> >> >> in the clang driver.
> >> >>>> >> >> When dealing with linker problems, (I doubt there are many
> >> >>>> nowadays;
> >> >>>> >> >> when there are problems, mostly are LTO problems), I will
> >> >>>> >> >> usually change compiler/linker options a bit.
> >> >>>> >> >> If you do this, you may only specify the proposed option
> >> >>>> >> >> when all
> >> >>>> the
> >> >>>> >> >> stuff has been done, but then it is only a very small extra
> >> >>>> >> >> step to invoke the link again with -Wl,--reproduce.
> >> >>>> >> >> _______________________________________________
> >> >>>> >> >> LLVM Developers mailing list llvm-dev at lists.llvm.org
> >> >>>> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>> >> >
> >> >>>> >> > _______________________________________________
> >> >>>> >> > LLVM Developers mailing list llvm-dev at lists.llvm.org
> >> >>>> >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>> >> _______________________________________________
> >> >>>> >> LLVM Developers mailing list
> >> >>>> >> llvm-dev at lists.llvm.org
> >> >>>> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>> >>
> >> >>>>
> >> >>>> >_______________________________________________
> >> >>>> >LLVM Developers mailing list
> >> >>>> >llvm-dev at lists.llvm.org
> >> >>>> >https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> LLVM Developers mailing list
> >> >>>> llvm-dev at lists.llvm.org
> >> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>>
> >> >>> _______________________________________________
> >> >>> LLVM Developers mailing list
> >> >>> llvm-dev at lists.llvm.org
> >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>
> >> >> _______________________________________________
> >> >> LLVM Developers mailing list
> >> >> llvm-dev at lists.llvm.org
> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>
> >>
> >> >_______________________________________________
> >> >LLVM Developers mailing list
> >> >llvm-dev at lists.llvm.org
> >> >https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >_______________________________________________
> >LLVM Developers mailing list
> >llvm-dev at lists.llvm.org
> >https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev