Skip to content

decimal: Use FASTCALL and/or Argument Clinic #73487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vstinner opened this issue Jan 17, 2017 · 23 comments
Closed

decimal: Use FASTCALL and/or Argument Clinic #73487

vstinner opened this issue Jan 17, 2017 · 23 comments
Assignees
Labels
3.7 (EOL) end of life performance Performance or resource usage

Comments

@vstinner
Copy link
Member

BPO 29301
Nosy @rhettinger, @vstinner, @skrah, @serhiy-storchaka, @animalize
Files
  • decimal.patch
  • decimal-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/skrah'
    closed_at = <Date 2018-05-29.22:14:26.039>
    created_at = <Date 2017-01-17.16:38:44.176>
    labels = ['3.7', 'performance']
    title = 'decimal: Use FASTCALL and/or Argument Clinic'
    updated_at = <Date 2019-03-21.00:24:41.547>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-03-21.00:24:41.547>
    actor = 'vstinner'
    assignee = 'skrah'
    closed = True
    closed_date = <Date 2018-05-29.22:14:26.039>
    closer = 'vstinner'
    components = []
    creation = <Date 2017-01-17.16:38:44.176>
    creator = 'vstinner'
    dependencies = []
    files = ['46320', '46435']
    hgrepos = []
    issue_num = 29301
    keywords = ['patch']
    message_count = 23.0
    messages = ['285665', '285670', '285671', '285673', '285680', '286359', '286368', '286369', '286371', '286372', '286374', '286375', '286448', '287131', '287133', '287149', '297135', '297153', '318123', '338430', '338462', '338513', '338516']
    nosy_count = 5.0
    nosy_names = ['rhettinger', 'vstinner', 'skrah', 'serhiy.storchaka', 'malin']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue29301'
    versions = ['Python 3.7']

    @vstinner
    Copy link
    Member Author

    I'm currently working on the isuse bpo-29259: "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects". I used bm_telco of the performance benchmark suite to check which functions still require to create a temporary tuple for positional arguments and a temporary dict for keyword arguments. I found 3 remaining functions which have an impact on the result of the benchmark:

    • print(): optimized by the issue bpo-29296
    • _struct.unpack(): I just created the issue bpo-29300 "Modify the _struct module to use FASTCALL and Argument Clinic"
    • _decimal.Decimal.quantize()

    I would like to know if Stephan would be ok to modify the _decimal module to use FASTCALL. I know that recently he reverted changes to keep the same code base on Python 3.5, 3.6 and 3.7.

    With 4 changes (tp_fastcall bpo-29259, print bpo-29296, unpack bpo-29300 and this issue), bm_telco becomes 22% faster which is not negligible!

    20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms

    Attached decimal.patch patch is the strict minimum change to optimize bm_telco, but I would prefer to change all _decimal functions and methods using METH_VARARGS and METH_VARARGS|METH_KEYWORDS to convert them to METH_FASTCALL.

    The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different.

    @vstinner vstinner added 3.7 (EOL) end of life performance Performance or resource usage labels Jan 17, 2017
    @vstinner
    Copy link
    Member Author

    Oh wait, I'm not sure that attached patch has a significant impact on performances. It seems like the speedup mostly comes from the print patch:
    http://bugs.python.org/issue29296#msg285668

    But well, it is still interesting to use METH_FASTCALL in decimal ;-)

    @serhiy-storchaka
    Copy link
    Member

    See msg207652 in bpo-20177.

    @rhettinger
    Copy link
    Contributor

    This should not go in without Stephan Krah's approval. This code is finely tuned and carefully arranged.

    @vstinner
    Copy link
    Member Author

    Raymond Hettinger: "This should not go in without Stephan Krah's approval. This code is finely tuned and carefully arrang"

    Sure, that's why I opened this issue.

    Serhiy Storchaka: "See msg207652 in bpo-20177."

    Oh, I didn't know that Stefan Krah already gave his opinion.

    Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.

    @vstinner
    Copy link
    Member Author

    Hello, I'm back! :-)

    I almost abandonned my the tp_fastcall change (bpo-29259).

    print() now uses FASTCALL (bpo-29296), it already made bm_telco faster:
    https://speed.python.org/timeline/#/?exe=5&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on
    (see the speedup around January 16 at the right)

    For the unpack module, I'm still working on my patch in the issue bpo-29300, but it doesn't seem to have a significant impact on bm_telco.

    So the remaining question is the usage of FASTCALL or Argument Clinic in the _decimal module.

    Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.

    I ran a new benchmark and FASTCALL makes bm_telco benchmark 1.11x faster:

    Median +- std dev: [ref] 19.4 ms +- 0.7 ms -> [decimal-2.patch] 17.5 ms +- 0.6 ms: 1.11x faster (-10%)

    So I decided to reopen the issue.

    Stefan: what do you think of using Argument Clinic and/or FASTCALL in _decimal? Is "bm_telco 1.11x faster" significant enough for you to justify such change?

    "The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different."

    @vstinner vstinner reopened this Jan 27, 2017
    @serhiy-storchaka
    Copy link
    Member

    I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable. I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.

    @vstinner
    Copy link
    Member Author

    Oops, I forgot to attach decimal-2.patch, here you have.

    I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable.

    I wrote decimal-2.patch in a few minutes, when I was trying to find all functions still calling _PyStack_AsTuple() in my tp_fast{new,init,call} fork. I just wrote it to identify which parts of CPython can be optimized to make bm_telco faster.

    I agree that Argument Clinic should be preferred over using directly METH_FASTCALL, especially for the _decimal module: Stefan already wrote that he wants to have the same code base on Python 3.5, 3.6 and 3.7.

    I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.

    What is the status of Argument Clinic? Is it something "public" or not?

    The status of _decimal is also unclear to me. Is it part of CPython or not? :-) I know that there is also a "backported" module for Python 2:

    https://pypi.python.org/pypi/cdecimal
    http://www.bytereef.org/mpdecimal/index.html

    IMHO it's a great tool. Maybe it's time to make it usable outside CPython in Python 3.7? Or maybe we should wait until the remaining missing features are implemented? Issues bpo-20291 and bpo-29299 for example.

    I'm now waiting for Stefan's feedback ;-)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 27, 2017

    I'll take a look when I have the opportunity.

    AC will not happen: It makes the module too large and unreadable.

    @serhiy-storchaka
    Copy link
    Member

    What is the status of Argument Clinic? Is it something "public" or not?

    No, it is for for internal CPython use only. It lacks some features (support
    of var-positional and var-keyword parameters, optional parameters without
    default value), the syntax of positional-only parameters is not officially
    accepted still, and future optimizations can require incompatible changes.

    Only when all CPython builtins and extensions be converted to Argument Clinic,
    PEP-457 (or an alternative) be accepted, Argument Clinic issues be resolved,
    we could say it stable enough. For now even Argument Clinic tests are broken.

    @vstinner
    Copy link
    Member Author

    Stefan Krah:

    AC will not happen: It makes the module too large and unreadable.

    Ah you dislike the additional [clinic input] sections?

    It's kind of strange when you have to convert existing code, when once
    the code uses AC, I prefer AC to separated documentation variables. It
    helps to keep docstrings more up to date, and it helps to enhance the
    API (ex: allow keywords, rename parameters to better names, etc.). It
    also helps to make the documentation closer to the code, which is IMHO
    a good thing :-)

    IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are
    "unreadable", and I'm happy to be able to hide them!

    FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with
    _PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This
    object only decides keyword names once and is more efficient to parse
    arguments. It explains partially the speedup. Only partially because
    bm_telco only calls the .quantize() method, and it only uses
    positional arguments (no keyword arguments) ;-)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 27, 2017

    STINNER Victor added the comment:
    > AC will not happen: It makes the module too large and unreadable.

    Ah you dislike the additional [clinic input] sections?

    Yes, they tear apart the code. I stopped reading many C files because
    of this. Brett asked why several people don't review, this is actually
    *one* of the reasons for me.

    It's kind of strange when you have to convert existing code, when once
    the code uses AC, I prefer AC to separated documentation variables. It
    helps to keep docstrings more up to date, and it helps to enhance the
    API (ex: allow keywords, rename parameters to better names, etc.). It
    also helps to make the documentation closer to the code, which is IMHO
    a good thing :-)

    Apparently it works for several people, but not me.

    IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are
    "unreadable", and I'm happy to be able to hide them!

    FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with
    _PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This
    object only decides keyword names once and is more efficient to parse
    arguments. It explains partially the speedup. Only partially because
    bm_telco only calls the .quantize() method, and it only uses
    positional arguments (no keyword arguments) ;-)

    Okay, thanks!

    @rhettinger
    Copy link
    Contributor

    > Ah you dislike the additional [clinic input] sections?

    Yes, they tear apart the code. I stopped reading many C files because
    of this.

    I concur with Stefan. AC is very impactful on modules, greatly affecting their readability and maintainability. The existing PyArg_ParseXXX() calls and their "kwlist" static variable are very easy to work with, easy to modify, and easy to teach (I cover them effortlessly in the Python classes I teach). In contrast, AC is much harder to learn, harder to get right, doesn't fit some existing APIs, harder to change, and it greatly expands the number of lines of code.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 6, 2017

    If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2017

    If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.

    I do not understand why Serhiy keeps repeating that the API is going to change, I have no such plan. IMHO the FASTCALL API is now very well defined: (PyObject **args, Py_ssize_t nargs, PyObject *kwnames), and helper functions are now well tested.

    We might optimize Argument Clinic further, but for decimal, it seems like it's a no-no and that direct usage of METH_FASTCALL is preferred.

    So no, I don't plan or expect any API change.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2017

    Oh ok, it seems like Serhiy wants to change METH_FASTCALL. I didn't know :-) He just opened the issue bpo-29464: "Specialize FASTCALL for functions with positional-only parameters". Interesting idea.

    @vstinner
    Copy link
    Member Author

    So Stefan: what do you want to do? Use FASTCALL or not? :-)

    I would prefer to not wait until the beta if you are ok to use FASTCALL.

    Note: it seems like bpo-29464 is going to be approved ;-)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jun 28, 2017

    I think I'll wait until bpo-29464 is committed and the API is considered frozen (see msg295176?).

    @vstinner
    Copy link
    Member Author

    Sorry, I lost interest in this optimization. If someone wants to work on it, please open a new issue.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 20, 2019

    AC will not happen: It makes the module too large and unreadable.

    AC has great performance improvements these days: bpo-35582 and bpo-36127
    It's quite worthy of reconsidering this decision.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 20, 2019

    Thanks, but it is still not going to happen. Look at the increased code size
    in e.g. blake2s_impl.c.h.

    I want to know what is going on in the code. Also, the performance
    improvements are in argument parsing, but not when you have numerical
    code like a * b, where a and b are already decimals.

    @vstinner
    Copy link
    Member Author

    Also, the performance improvements are in argument parsing, but not when you have numerical code like a * b, where a and b are already decimals.

    If the function call takes around 100 ns, the benefit of FASTCALL and the new more efficient function to parse arguments becomes quite significant. Some Decimal methods are that fast if not faster.

    python3 -m perf timeit \
    -s 'import decimal; a=decimal.Decimal(1); b=decimal.Decimal(2); ctx=decimal.getcontext()' \
    'ctx.copy_sign(a, b)' \
    --duplicate=1024

    => Mean +- std dev: [ref] 148 ns +- 1 ns -> [fastcall] 98.9 ns +- 4.9 ns: 1.49x faster (-33%)

    ./python -m perf timeit \
    -s 'import decimal; a=decimal.Decimal(1); b=decimal.Decimal(2)' \
    'a.copy_sign(b)' \
    --duplicate=1024

    => Mean +- std dev: [ref] 123 ns +- 5 ns -> [fastcall] 86.5 ns +- 1.4 ns: 1.42x faster (-29%)

    I wrote a quick & dirty change using directly METH_FASTCALL with _PyArg_ParseStackAndKeywords() (dec_mpd_qcopy_sign) and _PyArg_CheckPositional() (ctx_mpd_qcopy_sign). Using Argument Clinic may be more verbose, but it generates *even more* efficient code for functions accepting keyword arguments (like Decimal.copy_sign) and generate better docstring (at least, the signature for function parameters if no text is given).

    Obviously, if your application only uses large numbers, the benefit will be invisible. But I guess that some people use decimal with "small" numbers ;-)

    Note: Sure, you're right that operators like "a * b" wouldn't benefit from FASTCALL since they don't parse explicitly arguments. BINARY_MULTIPLY instruction gets directly 2 values from the Python stack and pass them to PyNumber_Multiply() with is defined to always take exactly 2 PyObject* in C.

    @vstinner
    Copy link
    Member Author

    Hum, after reading again my previous, I'm not sure that my intent was clear. I'm fine with Stefan rejecting the optimization. He is the maintainer of decimal. I just wanted to comment what he said ;-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants