MultiError v2

`MultiError` is the one part of trio's core design that I'm really not satisfied with. Trying to support multiple exceptions on top of the language's firm one-exception-at-a-time stance raises (heh) all kinds of issues. Some of them can probably be fixed by changing the design. But the hardest problem is that there are lots of third-party packages that want to do custom handling of exception tracebacks (e.g. ipython, pytest, raven/sentry). And right now we have to monkeypatch all of them to work with MultiError, with more or less pain and success.

Now @1st1 wants to add nurseries to asyncio, and as part of that he'll need to add something like `MultiError` to the stdlib. We talked a bunch about this at PyCon, and the plan is to figure out how to do this in a way that works for both asyncio and trio. Probably we'll start by splitting `MultiError` off into a standalone library, that trio and uvloop can both consume, and then add that library to the stdlib in 3.8 (and then the library will remain as a backport library for those using trio on 3.7 and earlier). This way asyncio can build on our experience, and trio can get out of the monkeypatching business (because if `MultiError`s are in the stdlib, then it becomes ipython/pytest/sentry's job to figure out how to cope with them).

But before we can do that we need to fix the design, and do it in writing so we (and Yury) can all see that the new design is right :-).

# Current design

[If this were a PEP I'd talk more about the basic assumptions underlying the design: multiple errors cna happen concurrently, you need to preserve that fact, you need to be able to catch some-but-not-all of those errors, you need to make sure that don't accidentally throw away any errors that you didn't explicitly try to catch, etc. But this is a working doc so I'm just going to dive into the details...]

Currently, trio thinks of `MultiError` objects as being ephemeral things. It tries as much as possible to simulate a system where multiple exceptions just happen to propagate next to each other. So it's important to keep track of the individual errors and their tracebacks, but the `MultiError` objects themselves are just a detail needed to accomplish this.

So, we only create `MultiError`s when there are actually multiple errors – if a `MultiError` has only a single exception, we "collapse" it, so `MultiError([single_exc]) is single_exc`. The basic primitive for working with a `MultiError` is the `filter` function, which is really a kind of weird flat-map-ish kind of thing: it runs a function over each of the "real" exceptions inside a `MultiError`, and can replace or remove any of them. If this results in any `MultiError` object that has zero or one child, then `filter` collapses it. And the catch helper, `MultiError.catch`, is a thin wrapper for `filter`: it catches an exception, then runs a `filter` over it, and then reraises whatever is left (if anything).

One more important detail: traceback handling. When you have a nested collection of `MultiError`s, e.g. `MultiError([RuntimeError(), MultiError([KeyError(), ValueError()])])`, then the leaf exceptions' `__traceback__` attr holds the traceback for the frames where they traveled independently before meeting up to become a `MultiError`, and then each `MultiError` object's `__traceback__` attr holds the frames that that particular `MultiError` traversed. This is just how Python's `__traceback__` handling works; there's no way to avoid it. But that's OK, it's actually quite convenient – when we display a traceback, we don't want to say "exception 1 went through frames A, B, C, D, and independently, exception 2 went through frames A', B, C, D" – it's more meaningful, and less cluttered, to say "exception 1 went through frame A, and exception 2 went through frame A', and then they met up and together they went through frames B, C, D". The way `__traceback__` data ends up distributed over the `MultiError` structure makes this structure really easy to extract.

# Proposal for new design

Never collapse `MultiError`s. Example:

```python
async def some_func():
    async with trio.open_nursery() as nursery:
        async with trio.open_nursery() as nursery:
            raise RuntimeError
```

If you do `await some_func()` then currently you get a `RuntimeError`; in this proposal, you'll instead get a `MultiError([MultiError([RuntimeError()])])`.

Get rid of `filter`, and replace it with a new primitive `split`. Given an exception and a predicate, `split` splits the exception into one half representing all the parts of the exception that match the predicate, and another half representing all the parts that don't match. Example:

```python
match, rest = MultiError.split(
  # simple predicate: isinstance(exc, RuntimeError)
  RuntimeError,
  # The exception being split:
  MultiError([
    RuntimeError("a"),
    MultiError([
      RuntimeError("b"),
      ValueError(),
    ]),
  ])
)

# produces:

match = MultiError([RuntimeError("a"), MultiError([RuntimeError("b")])])
rest = MultiError([MultiError([ValueError()])])

```

The `split` operation always takes an exception type (or tuple of types) to match, just like an `except` clause. It should also take an optional arbitrary function predicate, like `match=lambda exc: ...`.

If either `match` or `rest` is empty, it gets set to `None`. It's a classmethod rather than a regular method so that you can still use it in cases where you have an exception but don't know whether it's a `MultiError` or a regular exception, without having to check.

Catching `MultiError`s is still done with a context manager, like `with MultiError.catch(RuntimeError, handler)`. But now, `catch` takes a predicate + a handler (as opposed to filter, which combines these into one thing), uses the predicate to `split` any caught error, and then if there is a match it calls the handler exactly once, passing it the matched object.

Also, support `async with MultiError.acatch(...)` so you can write async handlers.

# Limitations of the current design, and how this fixes them

## Collapsing is not as helpful to users as you might think

A "nice" thing about collapsing out `MultiError`s is that most of the time, when only one thing goes wrong, you get a nice regular exception and don't need to think about this `MultiError` stuff. I say "nice", but really this is... bad. When you write error handling code, you want to be prepared for everything that *could* happen, and this design makes it very easy to forget that `MultiError` is a possibility, and hard to figure out where `MultiError` handling is actually required. If the language made handling `MultiError`s more natural/ergonomic, this might not be as big an issue, but that's just not how Python works. So Yury is strongly against the collapsing design, and he has a point.

Basically, seeing `MultiError([RuntimeError()])` tells you "ah, this time it was a single exception, but *it could have been* multiple exceptions, so I'd better be prepared to handle that".

This also has the nice effect that it becomes much easier to teach people about `MultiError`, because it shows up front-and-center the first time you have an error inside a nursery.

One of my original motivations for collapsing was that `trio.run` has a hidden nursery (the "system nursery") that the main task runs inside, and if you do `trio.run(main)` and `main` raises `RuntimeError`, I wanted `trio.run` to raise `RuntimeError` as well, not `MultiError([RuntimeError()])`. But actually this is OK, because the way things have worked out, we never raise errors through the system nursery anyway: either we re-raise whatever `main` raised, *or* we raise `TrioInternalError`. So my concern was unfounded.

## Collapsing makes traceback handling more complicated and fragile

Collapsing also makes the traceback handling code substantially more complicated. When `filter` simplifies a `MultiError` tree by removing intermediate nodes, it has to preserve the traceback data those nodes held, which it does by patching it into the remaining exceptions. (In our example above, if exception 2 gets caught, then we patch exception 1's `__traceback__` so that it shows frames A, B, C, D after all.) This all works, but it makes the implementation much more complex. If we don't collapse, then we can throw away all the traceback patching code: the tracebacks can just continue to live on whichever object they started out on.

Collapsing also means that `filter` is a destructive operation: it has to mutate the underlying exception objects' `__traceback__` attributes in place, so you can't like, speculatively run a `filter` and then change your mind and go back to using the original `MultiError`. That object still exists but after the `filter` operation it's now in an inconsistent state. Fine if you're careful, but it'd be nicer if users didn't have to be careful. If we don't collapse, then this isn't an issue: `split` doesn't have to mutate its input (and neither would `filter`, if we were still using `filter`).

## Collapsing loses `__context__` for intermediate `MultiError` nodes

Currently, Trio basically ignores the `__context__` and `__cause__` attributes on `MultiError` objects. They don't get assigned any semantics, they get freely discarded when collapsing, and they often end up containing garbage data. (In particular, if you catch a `MultiError`, handle part of it, and re-raise the remainder... the interpreter doesn't know that this is semantically a "re-raise", and insists on sticking the old `MultiError` object onto the new one's `__context__`. We have countermeasures, but it's all super annoying and messy.)

It turns out though that we do actually have a good use for `__context__` on `MultiError`s. It's super not obvious, but consider this scenario: you have two tasks, A and B, executing concurrently in the same nursery. They both crash. But! Task A's exception propagates faster, and reaches the nursery first. So the nursery sees this, and cancels task B. Meanwhile, the task B has blocked somewhere – maybe it's trying to send a graceful shutdown message from a `finally:` block or something. The cancellation interrupts this, so now task B has a `Cancelled` exception propagating, and that exception's `__context__` is set to the original exception in task B. Eventually, the `Cancelled` exception reaches the nursery, which catches it. What happens to task B's original exception?

Currently, in Trio, it gets discarded. But it turns out that this is not so great – in particular, it's very possible that task A and B were working together on something, task B hit an error, and then task A crashed because task B suddenly stopped talking to it. So here task B's exception is the actual root cause, and task A's exception is detritus. At least two people have hit this in Trio (see #416 and https://github.com/python-trio/pytest-trio/issues/30).

In the new design, we should declare that a `MultiError` object's `__context__` held any exceptions that were *preempted by the creation of that `MultiError`, i.e., by the nursery getting cancelled*. We'd basically just look at the `Cancelled` objects, and move their `__context__` attributes onto the `MultiError` that the nursery was raising. But this only works if we avoid collapsing.

## It would be nice if tracebacks could show where exceptions jumped across task boundaries

This has been on our todo list forever. It'd be nice if we could like... annotate tracebacks somehow?

If we stopped collapsing `MultiError`s, then there's a natural place to put this information: each `MultiError` corresponds to a jump across task boundaries, so we can put it in the exception string or something. (Oh yeah, maybe we should switch `MultiError`s to having associated message strings? Currently they don't have that.)

## Filtering is just an annoying abstraction to use

If I wanted to express exception catching using a weird flat-map-ish thing, I'd be writing haskell. In Python it's awkward and unidiomatic. But with `filter`, it's necessary, because you could have any number of exceptions you need to call it on.

With `split`, there's always exactly 2 outputs, so you can perform basic `MultiError` manipulations in straight-line code without callbacks.

## Tracebacks make filtering even more annoying to use than it would otherwise be

When `filter` maps a function over a `MultiError` tree, the exceptions passed in are not really complete, standalone exceptions: they only have partial tracebacks attached. So you have to handle them carefully. You can't raise or catch them – if you did, the interpreter would start inserting new tracebacks and make a mess of things.

You might think it was natural to write a filter function using a generator, like how `@contextmanager` works:

```python
def catch_and_log_handler():
    try:
        yield
    except Exception as exc:
        log.exception(exc)

def normalize_errors_handler():
    try:
        yield
    except OtherLibraryError as exc:
        raise MyLibraryError(...) from exc
```

But this can't work, because the tracebacks would get all corrupted. Instead, handlers take exceptions are arguments, and return either that exception object, or a new exception object (like `MyLibraryError`).

If a handler function does raise an exception (e.g., b/c of a typo), then there's no good way to cope with that. Currently Trio's `MultiError` code doesn't even try to handle this.

In the proposed design, all of these issues go away. The exceptions returned by `split` are always complete and self-contained. Probably for `MultiError.catch` we still will pass in the exception as an argument instead of using a generator and `.throw`ing it – the only advantage of the `.throw` is that it lets you use an `except` block to say which exceptions you want to catch, and with the new `MultiError.catch` we've already done that before we call the handler. But we can totally allow `raise` as a way to replace the exception, or handle accidental exceptions. (See the code below for details.)

## Async catching

Currently we don't have an async version of `filter` or `catch` (i.e., one where the user-specified handler can be async). Partly this is because when I was first implementing this I hit an interpreter bug that made it not work, but it's also because `filter`'s is extremely complicated and maintaining two copies makes it that much worse.

With the new design, there's no need for async `split`, I think the new `catch` logic makes supporting both sync and async easy (see below).


# Details

```python
# A classic "catch-all" handler, very common in servers
with MultiError.catch(Exception, logger.exception):
    ...

# is equivalent to:

except BaseException as exc:
    caught, rest = MultiError.split(Exception, exc)
    if caught is None:
        raise
    try:
        logger.exception(caught)
    except BaseException as exc:
        # The way we set exc.__context__ here isn't quite right...
        # Ideally we should stash it the interpreter's implicit exception
        # handling context state before calling the handler.
        if rest is None:
            try:
                raise
            finally:
                exc.__context__ = caught
        else:
            exc.__context__ = caught
            new_exc = MultiError([exc, rest])
            try:
                raise new_exc
            finally:
                new_exc.__context__ = None
                # IIRC this traceback fixup doesn't actually work b/c
                # of interpreter details, but maybe we can fix that in 3.8
                new_exc.__traceback__ = new_exc.__traceback__.tb_next
    else:
        if rest is not None:
            orig_context = rest.__context__
            try:
                raise rest
            finally:
                rest.__context__ = orig_context
                # See above re: traceback fixup
                rest.__traceback__ = rest.__traceback__.tb_next
```

Notes:

As noted in comments, `__context__` and `__traceback__` handling is super finicky and has subtle bugs. Interpreter help would be very... helpful.

Notice that all the logic around the `logger.exception` call is always synchronous and can be factored into a context manager, so we can do something like:

```python
def catch(type, handler, match=None):
    with _catch_impl(type, match=None) as caught:
        handler(caught)
        
async def acatch(type, handler, match=None):
    with _catch_impl(type, match=None) as caught:
        await handler(caught)
```        

# Other notes

## Subclassing

Do we want to support subclassing of `MultiError`, like `class NurseryError(MultiError)`? Should we encourage it?

If so, we need to think about handling subclasses when cloning `MultiError`s in `.split`.

I think we should *not* support this though. Because, we don't have, and don't want, a way to distinguish between a `MultiError([MultiError([...])])` and a `MultiError([NurseryError([...])])` – we preserve the structure, it contains metadata, but still it's *structural*. `split` and `catch` still only allow you address the leaf nodes. And that's important, because if we made it easy to match on structure, then people would do things like try to catch a `MultiError([MultiError([RuntimeError()])])`, when what they should be doing is trying to catch one-or-more-`RuntimeError`s. The point of keeping the `MultiError`s around instead of collapsing is to push you to handle this case, not continue to hard-code assumptions about there being only a single error.

## Naming

`MultiError` isn't bad, but might as well consider other options while we're redoing everything. `AggregateError`? `CombinedError`? `NurseryError`?

## Relevant open issues

#408, #285, #204, #56, https://github.com/python-trio/pytest-trio/issues/30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MultiError v2 #611

Current design

Proposal for new design

Limitations of the current design, and how this fixes them

Collapsing is not as helpful to users as you might think

Collapsing makes traceback handling more complicated and fragile

Collapsing loses `context` for intermediate `MultiError` nodes

It would be nice if tracebacks could show where exceptions jumped across task boundaries

Filtering is just an annoying abstraction to use

Tracebacks make filtering even more annoying to use than it would otherwise be

Async catching

Details

Other notes

Subclassing

Naming

Relevant open issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

MultiError v2 #611

Description

Current design

Proposal for new design

Limitations of the current design, and how this fixes them

Collapsing is not as helpful to users as you might think

Collapsing makes traceback handling more complicated and fragile

Collapsing loses __context__ for intermediate MultiError nodes

It would be nice if tracebacks could show where exceptions jumped across task boundaries

Filtering is just an annoying abstraction to use

Tracebacks make filtering even more annoying to use than it would otherwise be

Async catching

Details

Other notes

Subclassing

Naming

Relevant open issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Collapsing loses `context` for intermediate `MultiError` nodes