[llvm-dev] [RFC] Error handling in LLVM libraries.

Wed Feb 10 18:58:24 PST 2016

Hi All,

Now that this thread has accumulated some feedback, I thought I'd try to
summarize my thoughts on error handling in LLVM, and why I think this
proposal is worth adopting:

(1) Failure to check an error *is* a programmatic error, and our error
system should reflect this. This makes it easy to spot mistakes in our
error handling and correct them.

(2) Error returns should describe all the information about an error that a
client could reasonably use to recover from the error. This requirement
goes beyond what a flat enum, or even an enum with a category (like
std::error_code) is capable of expressing. For example, it is possible to
express "missing_file" in an enum, but not "missing_file: /path/to/file". A
library client could meaningfully respond to the latter by re-generating
the file (if they knew how) and re-trying the operation.

(3) There may be many levels of stack between the point where an error is
raised, and the point where it can be handled. A generic error handling
system needs to bridge this gap, bringing all the information needed to
handle an error together in one place. In the process, the system should
unwind through all levels of the stack that can't handle the error allowing
resources to be freed, and ensuring execution does not continue into code
that can't handle the error. (A corollary to this is that a generic error
handling system should never respond to an error in user-supplied input by
aborting).

(4) Errors should be easy to break on in a debugger.

(5) Errors should have minimal / no overhead in the success case.

I think we have consensus that point (1) is a serious problem. My scheme
addresses this problem by including a 'checked' bit that verifies that
errors (and even success values) have been inspected before being
discarded. However, we could do something similar with a wrapper around
std::error_code without adopting the custom-error-class aspect of my
proposal.

Points (2) and (3) (which are closely related) can't be addressed with
std::error_code. My scheme does address these points by allowing users to
define custom error classes.

Point (4) is addressed in my scheme by setting breakpoints on error class
constructors. With a wrapper around std::error_code, the best you can do is
to set a conditional breakpoint looking for construction with a particular
enum value.

Point (5) is a wash - both TypedError and std::error_code are cheap in the
success case.

One of the concerns that has been raised with my system is complexity. I
don't think the system is overly complex conceptually (though the
implementation does involve some non-trivial template goop). I presented
the system in detail in my original email, but I'll try to summarize here:

This system provides an owning "error pointer" class, TypedError, that is
as cheap as std::error_code to move around, return from functions, etc.
Success values are indicated by a null "error pointer" value that is
extremely cheap to construct. Failure values are indicated by a non-null
error-pointer. The cost of constructing a failure value is just the cost of
constructing the pointee describing the error, plus the cost of handling
the error (if that happens). The pointee must be a user-defined class that
works with the typed-error RTTI system. We provide a utility class,
TypedErrorInfo, to make it easy to construct such classes:

class MyError : TypedErrorInfo<MyError> { ... };

User defined error classes must also implement a log method to render the
error to a stream, however this can usually be done in a couple of lines.
Since we expect the number of user-defined error classes to be small, this
overhead should be minimal. For usage examples on all this, please see the
error demo attached to my original email.

So, given the benefits of this proposal, I think the complexity is well
justified.

An alternative scheme that has been floated is using a combination of
error_code and diagnostic handlers, but I have two strong objections to
this as a general solution:

(1) Error returns are already required to ensure that code further up the
stack does not continue past an error. Given the fundamental necessity of
error returns, it makes sense to re-use them for diagnostics where
possible, as it avoids the need to *also* thread a diagnostic handler
through any function that could fail. If used as a general solution,
diagnostic handlers would quickly become unwieldy, especially when calling
between libraries. For example, if library A calls library B, which calls
library C, and library C can produce an error, you are now forced thread a
diagnostic handler through library A and B for library C's sake. This does
not scale well.

I recognize that diagnostic handlers have totally valid use cases. I am not
proposing that we remove any of the existing diagnostic handlers from LLVM,
and I'm not against introducing new ones where there is a use-case.
However, in use-cases where TypedError is necessary to enable recovery and
sufficient to produce good error messages, I think diagnostic handlers are
superfluous.

(2) By retaining std::error_code, the error_code + diagnostic-handler
scheme still falls short on points (2) and (3) on my original list. That
is, the combination of diagnostic handlers and std::error_code only works
if the best "recovery" you can do is report an error on a stream and then
bail out or continue, based on a coarse-grained error_code. That's good
enough for a lot of command-line utilities, but not good enough for general
library design.

Responses to this proposal seem mostly positive so far, and I hope this
summary addresses the concerns that have been raised, but I'm very happy to
continue the conversation if not.

That said, I'm really keen to start improving error handling in libObject
and the JIT, so in the interest of maintaining some momentum on this
proposal, here's how I plan to proceed if this proposal is adopted:

(1) Write up a set of unit-tests for the TypedError system, and a new
section of the LLVM Programmer's Manual describing its use.

(2) Re-submit the original prototype (with minor tweaks) plus the
unit-tests and programmers manual additions to llvm-commits fro review.

(3) Begin threading TypedError and TypedErrorOr through MachOObjectFile in
libObject, using ECError as a bridge between TypedError/TypedErrorOr and
std::error_code/ErrorOr.

The act of threading this through libObject will hit LLD, MCJIT and ORC,
DebugInfo, LTO, ProfilingData and a host of command-line utilities.
However, the ECError glue is fairly straightforward and idiomatic, and
provides us with an explicit (grepable) surface area that can be chipped
away at. Working with Kevin Enderby, my next step would be to plumb
TypedError through the MachO-specific parts of llvm-nm and llvm-objdump.
Based on our experiments with the prototype, we expect to be able to
significantly improve the diagnostics produced by these tools for broken
object files.

In the long-term I expect many parts of LLVM may want to migrate from
std::error_code to TypedError, but that's ultimately a decision for the
maintainers of the respective parts. I'll be very happy to provide input
and support, but I don't want to pre-empt anyone else's decision on whether
to adopt this.

Cheers,
Lang.

On Wed, Feb 10, 2016 at 1:36 PM, Lang Hames <lhames at gmail.com> wrote:

> Hi Rafael,
>
> What prevents you from using a diag handler in the jit server that
> sends errors/warnings over the RPCChannel?
>
>
> Sorry - this is a non-trivial problem to think through, so to speed
> things up, here are the issues:
>
> (1) A diagnostic handler can only exit or continue, neither of which are
> sufficient to deal with an error in the general case. You need to be able
> to stop and unwind to some point in the stack where the error can be
> handled, meaningfully releasing / cleaning-up resources as you go.
>
> (2) Using a diagnostic handler for structured error serialization removes
> the cohesion between error serialization / deserialization and the error
> types themselves. It also removes the cohesion between serialization and
> deserialization: serialization happens in the diagnostic handler,
> deserialization somewhere else. All this makes it very easy to forget to
> update serialization/deserialization code when error types change, and for
> serialization/deserialization to get out of sync with one another.
> Diagnostic handlers really aren't designed for this problem.
>
> In my design, anyone building a client/server system on top of ORC can
> make any error reportable over the wire by inheriting from some
> "Serializable" interface and writing their serialization/deserialization
> code in the error type itself. The server can then contain a
> checkTypedError block that boils down to:
>
> while (1) {
>   if (auto Err = handleServerCommand(Channel))
>     if (Err is handleable by the server)
>       /* handle it */
>     else if (Err.isA<Serializable>())
>       Err->serialize(Channel); /* Report it to the client, let them
> respond */
>     else
>       return Err; /* Bail out of the server loop */
> }
>
> Cheers,
> Lang.
>
> On Wed, Feb 10, 2016 at 10:55 AM, Lang Hames <lhames at gmail.com> wrote:
>
>> Hi Rafael,
>>
>> What prevents you from using a diag handler in the jit server that
>> sends errors/warnings over the RPCChannel?
>>
>>
>> What would you do with errors that can't reasonable be serialized when
>> they reach the diagnostic handler?
>>
>> And what would you do with the serialized bytes on the client end?
>>
>> - Lang.
>>
>> Sent from my iPhone
>>
>> On Feb 10, 2016, at 10:31 AM, Rafael Espíndola <
>> rafael.espindola at gmail.com> wrote:
>>
>> I recently added support for remote JITing to ORC. There are in-tree
>> library
>>
>> facilities for JITing code into a process on the other end of an abstract
>>
>> "RPCChannel". What happens if something goes wrong at one end? We want to
>> be
>>
>> able to communicate an error back across the RPCChannel (assuming it's
>> still
>>
>> intact) so the other side can recover or fail gracefully. That means we
>> need
>>
>> to be able to serialize an error with enough information to describe what
>>
>> went wrong. There's no practical way to maintain a serialization routine
>> for
>>
>> all possible std::error_codes that might come up, even if they were
>> powerful
>>
>> enough to describe everything that could go wrong (which again, being
>> static
>>
>> kinds, they're not). With my proposal however, a JITError base class can
>> be
>>
>> defined as:
>>
>>
>> class JITError : public TypedErrorInfo<JITError> {
>>
>> public:
>>
>>  virtual void serialize(RPCChannel &C) const = 0;
>>
>> };
>>
>>
>>
>> What prevents you from using a diag handler in the jit server that
>> sends errors/warnings over the RPCChannel?
>>
>> Cheers,
>> Rafael
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160210/a0e16bd2/attachment.html>