Skip to content

gh-118702: Implement vectorcall for BaseException #118703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 10, 2024

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented May 7, 2024

  • BaseException_vectorcall() now creates a tuple from 'args' array.
  • Creation an exception using BaseException_vectorcall() is now a single function call, rather than having to call BaseException_new() and then BaseException_init(). Calling BaseException_init() is inefficient since it overrides the 'args' attribute.
  • _PyErr_SetKeyError() now uses PyObject_CallOneArg() to create the KeyError instance to use BaseException_vectorcall().

Micro-benchmark on creating a KeyError on accessing a non-existent dictionary key:

Mean +- std dev: 447 ns +- 31 ns -> 373 ns +- 15 ns: 1.20x faster

@vstinner
Copy link
Member Author

vstinner commented May 7, 2024

Benchmark:

import pyperf

EMPTY_DICT = {}
INNER_LOOPS = 1024
KEYS = tuple(range(INNER_LOOPS))

def bench_keyerror():
    d = EMPTY_DICT
    for key in KEYS:
        try:
            d[key]
        except KeyError:
            pass

runner = pyperf.Runner()
runner.bench_func('keyerror', bench_keyerror, inner_loops=INNER_LOOPS)

@vstinner
Copy link
Member Author

vstinner commented May 7, 2024

@erlend-aasland
Copy link
Contributor

Nice change, but you broke CI :)

@vstinner
Copy link
Member Author

vstinner commented May 7, 2024

Nice change, but you broke CI :)

Right, FAIL: testKeywordArgs (test.test_exceptions.ExceptionTests.testKeywordArgs).

It should now be fixed.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that switching to vector call makes a difference here. How is it in comparison with the following BaseException_vectorcall implementation?

argstuple = _PyTuple_FromArray(args, PyVectorcall_NARGS(nargsf));
self = type_obj->tp_new(type_obj, argstuple, NULL);
Py_TYPE(self)->tp_init(self, argstuple, NULL);
Py_DECREF(argstuple);

If it is still faster than the current code, then there is something wrong in the non-vectorcall path.

@eendebakpt
Copy link
Contributor

I am surprised that switching to vector call makes a difference here. How is it in comparison with the following BaseException_vectorcall implementation?

argstuple = _PyTuple_FromArray(args, PyVectorcall_NARGS(nargsf));
self = type_obj->tp_new(type_obj, argstuple, NULL);
Py_TYPE(self)->tp_init(self, argstuple, NULL);
Py_DECREF(argstuple);

If it is still faster than the current code, then there is something wrong in the non-vectorcall path.

In current main there is a call to PyTuple_Pack that is not in the vectorcall path. I did not benchmark this, but I suspect that could explain part of the difference

@vstinner
Copy link
Member Author

vstinner commented May 7, 2024

If it is still faster than the current code, then there is something wrong in the non-vectorcall path.

This change avoids calling type_call() which contains more code.

@vstinner
Copy link
Member Author

vstinner commented May 7, 2024

Benchmark result with CPU isolation (this PR):

Mean +- std dev: [ref] 264 ns +- 1 ns -> [baseexc] 227 ns +- 1 ns: 1.16x faster

I also measured only the Python/errors.c changes; replace PyTuple_Pack()+PyObject_Call() with PyObject_CallOneArg():

Mean +- std dev: [ref] 264 ns +- 1 ns -> [optim_keyerror] 248 ns +- 2 ns: 1.07x faster

It's faster, but not as fast as this PR.

@serhiy-storchaka
Copy link
Member

This change avoids calling type_call() which contains more code.

But was is wrong with type_call()? On what does it spent time besides calling tp_new(), tp_init() and few trivial checks that it is so slower? Or maybe the difference is just the cost of these trivial checks?

@erlend-aasland
Copy link
Contributor

But was is wrong with type_call()? On what does it spent time besides calling tp_new(), tp_init() and few trivial checks that it is so slower? Or maybe the difference is just the cost of these trivial checks?

AFAICS, in _PyObject_VectorcallTstate, a vectorcall function is called directly, avoiding the overhead of _PyObject_MakeTpCall and type_call.

Python/errors.c Outdated
Comment on lines 260 to 266
PyObject *exc = PyObject_CallOneArg(PyExc_KeyError, arg);
if (!exc) {
/* caller will expect error to be set anyway */
return;
}
_PyErr_SetObject(tstate, PyExc_KeyError, tup);
Py_DECREF(tup);

_PyErr_SetRaisedException(tstate, exc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the same. It does not set __context__, and there may be other differences in case when it fails to create an exception object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. I fixed this regression and added an unit test.

@serhiy-storchaka
Copy link
Member

The half of the gain is due to the change in _PyErr_SetKeyError() which is not equivalent to the current code, so it perhaps should be reverted. Addition of BaseException_vectorcall is still beneficial right now, but optimizing _PyObject_MakeTpCall and type_call may reduce the gain even more. And it does not beneficial for exceptions with __new__ or __init__. In the end, it may make the benefit/cost ratio pretty low.

I do not actively oppose this change, I only suggest that there may be place for more general optimization. If you still want to add BaseException_vectorcall, with or without optimizing the other path, you are free to do this.

@vstinner vstinner force-pushed the baseexception_vectorcall branch from fcc7a2f to e689491 Compare May 10, 2024 09:16
@vstinner vstinner marked this pull request as ready for review May 10, 2024 09:39
@vstinner vstinner requested a review from iritkatriel as a code owner May 10, 2024 09:39
vstinner added 4 commits May 10, 2024 11:45
* BaseException_vectorcall() now creates a tuple from 'args' array.
* Creation an exception using BaseException_vectorcall() is now a
  single function call, rather than having to call
  BaseException_new() and then BaseException_init().
  Calling BaseException_init() is inefficient since it overrides
  the 'args' attribute.
* _PyErr_SetKeyError() now uses PyObject_CallOneArg() to create the
  KeyError instance to use BaseException_vectorcall().

Micro-benchmark on creating a KeyError on accessing a non-existent
dictionary key:

    Mean +- std dev: 447 ns +- 31 ns -> 373 ns +- 15 ns: 1.20x faster
@vstinner vstinner force-pushed the baseexception_vectorcall branch from 4424f98 to eb0b861 Compare May 10, 2024 09:45
@vstinner
Copy link
Member Author

The PR is now ready for a review. The main branch is now Python 3.14.

I rebased the PR on the main branch and squashed commit. I fixed the __context__ regression.

Updated benchmark, Python built with gcc -O3, with CPU isolation:

  • key_error: Mean +- std dev: [ref] 267 ns +- 1 ns -> [change] 243 ns +- 1 ns: 1.10x faster
  • value_error: Mean +- std dev: [ref] 288 ns +- 2 ns -> [change] 258 ns +- 2 ns: 1.12x faster

I added a benchmark on raise ValueError.

+----------------+--------+----------------------+
| Benchmark      | ref    | change               |
+================+========+======================+
| key_error      | 267 ns | 243 ns: 1.10x faster |
+----------------+--------+----------------------+
| value_error    | 288 ns | 258 ns: 1.12x faster |
+----------------+--------+----------------------+
| Geometric mean | (ref)  | 1.11x faster         |
+----------------+--------+----------------------+

Benchmark:

import pyperf

EMPTY_DICT = {}
INNER_LOOPS = 1024
KEYS = tuple(range(INNER_LOOPS))

def bench_key_error(loops):
    range_it = range(loops)
    t0 = pyperf.perf_counter()

    value = "value"
    for _ in range_it:
        d = EMPTY_DICT
        for key in KEYS:
            try:
                d[key]
            except KeyError:
                pass

    return pyperf.perf_counter() - t0


def bench_value_error(loops):
    range_it = range(loops)
    t0 = pyperf.perf_counter()

    value = "value"
    for _ in range_it:
        try: raise ValueError(value)
        except: pass
        try: raise ValueError(value)
        except: pass

        try: raise ValueError(value)
        except: pass
        try: raise ValueError(value)
        except: pass

        try: raise ValueError(value)
        except: pass
        try: raise ValueError(value)
        except: pass

        try: raise ValueError(value)
        except: pass
        try: raise ValueError(value)
        except: pass

        try: raise ValueError(value)
        except: pass
        try: raise ValueError(value)
        except: pass

    return pyperf.perf_counter() - t0

runner = pyperf.Runner()
runner.bench_time_func('key_error', bench_key_error, inner_loops=INNER_LOOPS)
runner.bench_time_func('value_error', bench_value_error, inner_loops=10)

@vstinner
Copy link
Member Author

@serhiy-storchaka:

The half of the gain is due to the change in _PyErr_SetKeyError() which is not equivalent to the current code, so it perhaps should be reverted.

I fixed the code.

Addition of BaseException_vectorcall is still beneficial right now,

10% to 12% faster to create exceptions sound appealing since raising KeyError is a common operation.

I ran benchmark on functions using METH_VARVARGS. KeyError is the clear winner in number of calls when running ./python -m test test_builtin.

but optimizing _PyObject_MakeTpCall and type_call may reduce the gain even more. And it does not beneficial for exceptions with new or init. In the end, it may make the benefit/cost ratio pretty low.

I don't see any simple optimization opportunity, do you?

I do not actively oppose this change, I only suggest that there may be place for more general optimization.

The optimization applies to 32 built-in exceptions:

  • ArithmeticError
  • AssertionError
  • BaseException
  • BufferError
  • BytesWarning
  • DeprecationWarning
  • EOFError
  • EncodingWarning
  • Exception
  • FloatingPointError
  • FutureWarning
  • GeneratorExit
  • ImportWarning
  • IndexError
  • KeyError
  • KeyboardInterrupt
  • LookupError
  • NotImplementedError
  • OverflowError
  • PendingDeprecationWarning
  • PythonFinalizationError
  • RecursionError
  • ReferenceError
  • ResourceWarning
  • RuntimeError
  • RuntimeWarning
  • StopAsyncIteration
  • SyntaxWarning
  • SystemError
  • TypeError
  • UnicodeError
  • UnicodeWarning
  • UserWarning
  • ValueError
  • Warning
  • ZeroDivisionError

@vstinner vstinner merged commit aa36f83 into python:main May 10, 2024
@vstinner vstinner deleted the baseexception_vectorcall branch May 10, 2024 19:08
@vstinner
Copy link
Member Author

Merged, thanks for reviews.

estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
* BaseException_vectorcall() now creates a tuple from 'args' array.
* Creation an exception using BaseException_vectorcall() is now a
  single function call, rather than having to call
  BaseException_new() and then BaseException_init().
  Calling BaseException_init() is inefficient since it overrides
  the 'args' attribute.
* _PyErr_SetKeyError() now uses PyObject_CallOneArg() to create the
  KeyError instance to use BaseException_vectorcall().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants