Skip to content

MultiError v2 #611

Closed
Closed
@njsmith

Description

@njsmith

MultiError is the one part of trio's core design that I'm really not satisfied with. Trying to support multiple exceptions on top of the language's firm one-exception-at-a-time stance raises (heh) all kinds of issues. Some of them can probably be fixed by changing the design. But the hardest problem is that there are lots of third-party packages that want to do custom handling of exception tracebacks (e.g. ipython, pytest, raven/sentry). And right now we have to monkeypatch all of them to work with MultiError, with more or less pain and success.

Now @1st1 wants to add nurseries to asyncio, and as part of that he'll need to add something like MultiError to the stdlib. We talked a bunch about this at PyCon, and the plan is to figure out how to do this in a way that works for both asyncio and trio. Probably we'll start by splitting MultiError off into a standalone library, that trio and uvloop can both consume, and then add that library to the stdlib in 3.8 (and then the library will remain as a backport library for those using trio on 3.7 and earlier). This way asyncio can build on our experience, and trio can get out of the monkeypatching business (because if MultiErrors are in the stdlib, then it becomes ipython/pytest/sentry's job to figure out how to cope with them).

But before we can do that we need to fix the design, and do it in writing so we (and Yury) can all see that the new design is right :-).

Current design

[If this were a PEP I'd talk more about the basic assumptions underlying the design: multiple errors cna happen concurrently, you need to preserve that fact, you need to be able to catch some-but-not-all of those errors, you need to make sure that don't accidentally throw away any errors that you didn't explicitly try to catch, etc. But this is a working doc so I'm just going to dive into the details...]

Currently, trio thinks of MultiError objects as being ephemeral things. It tries as much as possible to simulate a system where multiple exceptions just happen to propagate next to each other. So it's important to keep track of the individual errors and their tracebacks, but the MultiError objects themselves are just a detail needed to accomplish this.

So, we only create MultiErrors when there are actually multiple errors – if a MultiError has only a single exception, we "collapse" it, so MultiError([single_exc]) is single_exc. The basic primitive for working with a MultiError is the filter function, which is really a kind of weird flat-map-ish kind of thing: it runs a function over each of the "real" exceptions inside a MultiError, and can replace or remove any of them. If this results in any MultiError object that has zero or one child, then filter collapses it. And the catch helper, MultiError.catch, is a thin wrapper for filter: it catches an exception, then runs a filter over it, and then reraises whatever is left (if anything).

One more important detail: traceback handling. When you have a nested collection of MultiErrors, e.g. MultiError([RuntimeError(), MultiError([KeyError(), ValueError()])]), then the leaf exceptions' __traceback__ attr holds the traceback for the frames where they traveled independently before meeting up to become a MultiError, and then each MultiError object's __traceback__ attr holds the frames that that particular MultiError traversed. This is just how Python's __traceback__ handling works; there's no way to avoid it. But that's OK, it's actually quite convenient – when we display a traceback, we don't want to say "exception 1 went through frames A, B, C, D, and independently, exception 2 went through frames A', B, C, D" – it's more meaningful, and less cluttered, to say "exception 1 went through frame A, and exception 2 went through frame A', and then they met up and together they went through frames B, C, D". The way __traceback__ data ends up distributed over the MultiError structure makes this structure really easy to extract.

Proposal for new design

Never collapse MultiErrors. Example:

async def some_func():
    async with trio.open_nursery() as nursery:
        async with trio.open_nursery() as nursery:
            raise RuntimeError

If you do await some_func() then currently you get a RuntimeError; in this proposal, you'll instead get a MultiError([MultiError([RuntimeError()])]).

Get rid of filter, and replace it with a new primitive split. Given an exception and a predicate, split splits the exception into one half representing all the parts of the exception that match the predicate, and another half representing all the parts that don't match. Example:

match, rest = MultiError.split(
  # simple predicate: isinstance(exc, RuntimeError)
  RuntimeError,
  # The exception being split:
  MultiError([
    RuntimeError("a"),
    MultiError([
      RuntimeError("b"),
      ValueError(),
    ]),
  ])
)

# produces:

match = MultiError([RuntimeError("a"), MultiError([RuntimeError("b")])])
rest = MultiError([MultiError([ValueError()])])

The split operation always takes an exception type (or tuple of types) to match, just like an except clause. It should also take an optional arbitrary function predicate, like match=lambda exc: ....

If either match or rest is empty, it gets set to None. It's a classmethod rather than a regular method so that you can still use it in cases where you have an exception but don't know whether it's a MultiError or a regular exception, without having to check.

Catching MultiErrors is still done with a context manager, like with MultiError.catch(RuntimeError, handler). But now, catch takes a predicate + a handler (as opposed to filter, which combines these into one thing), uses the predicate to split any caught error, and then if there is a match it calls the handler exactly once, passing it the matched object.

Also, support async with MultiError.acatch(...) so you can write async handlers.

Limitations of the current design, and how this fixes them

Collapsing is not as helpful to users as you might think

A "nice" thing about collapsing out MultiErrors is that most of the time, when only one thing goes wrong, you get a nice regular exception and don't need to think about this MultiError stuff. I say "nice", but really this is... bad. When you write error handling code, you want to be prepared for everything that could happen, and this design makes it very easy to forget that MultiError is a possibility, and hard to figure out where MultiError handling is actually required. If the language made handling MultiErrors more natural/ergonomic, this might not be as big an issue, but that's just not how Python works. So Yury is strongly against the collapsing design, and he has a point.

Basically, seeing MultiError([RuntimeError()]) tells you "ah, this time it was a single exception, but it could have been multiple exceptions, so I'd better be prepared to handle that".

This also has the nice effect that it becomes much easier to teach people about MultiError, because it shows up front-and-center the first time you have an error inside a nursery.

One of my original motivations for collapsing was that trio.run has a hidden nursery (the "system nursery") that the main task runs inside, and if you do trio.run(main) and main raises RuntimeError, I wanted trio.run to raise RuntimeError as well, not MultiError([RuntimeError()]). But actually this is OK, because the way things have worked out, we never raise errors through the system nursery anyway: either we re-raise whatever main raised, or we raise TrioInternalError. So my concern was unfounded.

Collapsing makes traceback handling more complicated and fragile

Collapsing also makes the traceback handling code substantially more complicated. When filter simplifies a MultiError tree by removing intermediate nodes, it has to preserve the traceback data those nodes held, which it does by patching it into the remaining exceptions. (In our example above, if exception 2 gets caught, then we patch exception 1's __traceback__ so that it shows frames A, B, C, D after all.) This all works, but it makes the implementation much more complex. If we don't collapse, then we can throw away all the traceback patching code: the tracebacks can just continue to live on whichever object they started out on.

Collapsing also means that filter is a destructive operation: it has to mutate the underlying exception objects' __traceback__ attributes in place, so you can't like, speculatively run a filter and then change your mind and go back to using the original MultiError. That object still exists but after the filter operation it's now in an inconsistent state. Fine if you're careful, but it'd be nicer if users didn't have to be careful. If we don't collapse, then this isn't an issue: split doesn't have to mutate its input (and neither would filter, if we were still using filter).

Collapsing loses __context__ for intermediate MultiError nodes

Currently, Trio basically ignores the __context__ and __cause__ attributes on MultiError objects. They don't get assigned any semantics, they get freely discarded when collapsing, and they often end up containing garbage data. (In particular, if you catch a MultiError, handle part of it, and re-raise the remainder... the interpreter doesn't know that this is semantically a "re-raise", and insists on sticking the old MultiError object onto the new one's __context__. We have countermeasures, but it's all super annoying and messy.)

It turns out though that we do actually have a good use for __context__ on MultiErrors. It's super not obvious, but consider this scenario: you have two tasks, A and B, executing concurrently in the same nursery. They both crash. But! Task A's exception propagates faster, and reaches the nursery first. So the nursery sees this, and cancels task B. Meanwhile, the task B has blocked somewhere – maybe it's trying to send a graceful shutdown message from a finally: block or something. The cancellation interrupts this, so now task B has a Cancelled exception propagating, and that exception's __context__ is set to the original exception in task B. Eventually, the Cancelled exception reaches the nursery, which catches it. What happens to task B's original exception?

Currently, in Trio, it gets discarded. But it turns out that this is not so great – in particular, it's very possible that task A and B were working together on something, task B hit an error, and then task A crashed because task B suddenly stopped talking to it. So here task B's exception is the actual root cause, and task A's exception is detritus. At least two people have hit this in Trio (see #416 and python-trio/pytest-trio#30).

In the new design, we should declare that a MultiError object's __context__ held any exceptions that were preempted by the creation of that MultiError, i.e., by the nursery getting cancelled. We'd basically just look at the Cancelled objects, and move their __context__ attributes onto the MultiError that the nursery was raising. But this only works if we avoid collapsing.

It would be nice if tracebacks could show where exceptions jumped across task boundaries

This has been on our todo list forever. It'd be nice if we could like... annotate tracebacks somehow?

If we stopped collapsing MultiErrors, then there's a natural place to put this information: each MultiError corresponds to a jump across task boundaries, so we can put it in the exception string or something. (Oh yeah, maybe we should switch MultiErrors to having associated message strings? Currently they don't have that.)

Filtering is just an annoying abstraction to use

If I wanted to express exception catching using a weird flat-map-ish thing, I'd be writing haskell. In Python it's awkward and unidiomatic. But with filter, it's necessary, because you could have any number of exceptions you need to call it on.

With split, there's always exactly 2 outputs, so you can perform basic MultiError manipulations in straight-line code without callbacks.

Tracebacks make filtering even more annoying to use than it would otherwise be

When filter maps a function over a MultiError tree, the exceptions passed in are not really complete, standalone exceptions: they only have partial tracebacks attached. So you have to handle them carefully. You can't raise or catch them – if you did, the interpreter would start inserting new tracebacks and make a mess of things.

You might think it was natural to write a filter function using a generator, like how @contextmanager works:

def catch_and_log_handler():
    try:
        yield
    except Exception as exc:
        log.exception(exc)

def normalize_errors_handler():
    try:
        yield
    except OtherLibraryError as exc:
        raise MyLibraryError(...) from exc

But this can't work, because the tracebacks would get all corrupted. Instead, handlers take exceptions are arguments, and return either that exception object, or a new exception object (like MyLibraryError).

If a handler function does raise an exception (e.g., b/c of a typo), then there's no good way to cope with that. Currently Trio's MultiError code doesn't even try to handle this.

In the proposed design, all of these issues go away. The exceptions returned by split are always complete and self-contained. Probably for MultiError.catch we still will pass in the exception as an argument instead of using a generator and .throwing it – the only advantage of the .throw is that it lets you use an except block to say which exceptions you want to catch, and with the new MultiError.catch we've already done that before we call the handler. But we can totally allow raise as a way to replace the exception, or handle accidental exceptions. (See the code below for details.)

Async catching

Currently we don't have an async version of filter or catch (i.e., one where the user-specified handler can be async). Partly this is because when I was first implementing this I hit an interpreter bug that made it not work, but it's also because filter's is extremely complicated and maintaining two copies makes it that much worse.

With the new design, there's no need for async split, I think the new catch logic makes supporting both sync and async easy (see below).

Details

# A classic "catch-all" handler, very common in servers
with MultiError.catch(Exception, logger.exception):
    ...

# is equivalent to:

except BaseException as exc:
    caught, rest = MultiError.split(Exception, exc)
    if caught is None:
        raise
    try:
        logger.exception(caught)
    except BaseException as exc:
        # The way we set exc.__context__ here isn't quite right...
        # Ideally we should stash it the interpreter's implicit exception
        # handling context state before calling the handler.
        if rest is None:
            try:
                raise
            finally:
                exc.__context__ = caught
        else:
            exc.__context__ = caught
            new_exc = MultiError([exc, rest])
            try:
                raise new_exc
            finally:
                new_exc.__context__ = None
                # IIRC this traceback fixup doesn't actually work b/c
                # of interpreter details, but maybe we can fix that in 3.8
                new_exc.__traceback__ = new_exc.__traceback__.tb_next
    else:
        if rest is not None:
            orig_context = rest.__context__
            try:
                raise rest
            finally:
                rest.__context__ = orig_context
                # See above re: traceback fixup
                rest.__traceback__ = rest.__traceback__.tb_next

Notes:

As noted in comments, __context__ and __traceback__ handling is super finicky and has subtle bugs. Interpreter help would be very... helpful.

Notice that all the logic around the logger.exception call is always synchronous and can be factored into a context manager, so we can do something like:

def catch(type, handler, match=None):
    with _catch_impl(type, match=None) as caught:
        handler(caught)
        
async def acatch(type, handler, match=None):
    with _catch_impl(type, match=None) as caught:
        await handler(caught)

Other notes

Subclassing

Do we want to support subclassing of MultiError, like class NurseryError(MultiError)? Should we encourage it?

If so, we need to think about handling subclasses when cloning MultiErrors in .split.

I think we should not support this though. Because, we don't have, and don't want, a way to distinguish between a MultiError([MultiError([...])]) and a MultiError([NurseryError([...])]) – we preserve the structure, it contains metadata, but still it's structural. split and catch still only allow you address the leaf nodes. And that's important, because if we made it easy to match on structure, then people would do things like try to catch a MultiError([MultiError([RuntimeError()])]), when what they should be doing is trying to catch one-or-more-RuntimeErrors. The point of keeping the MultiErrors around instead of collapsing is to push you to handle this case, not continue to hard-code assumptions about there being only a single error.

Naming

MultiError isn't bad, but might as well consider other options while we're redoing everything. AggregateError? CombinedError? NurseryError?

Relevant open issues

#408, #285, #204, #56, python-trio/pytest-trio#30

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions