Skip to content

Use the extened euclidian algorithm to compute scalar inverse in variable time #730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

deadalnix
Copy link

This is faster than the multiple exponentiation method that is used now. It remains way slower than libgmp, so probably not a full replacement, but it provides a nice speedup when libgmp is not available.

These are the benchmark time I measure.

scalar_inverse: min 14.4us / avg 16.4us / max 20.0us
scalar_inverse_var: min 11.5us / avg 11.6us / max 11.6us

While there is a fair bag of trick that is used to make this fast, there is still a lot of untapped potential. For instance, doing numerous additions before reducing x0 and x1. While more could be done, this is already an improvement, so it is time for a PR.

@deadalnix deadalnix force-pushed the scalarinversevar branch 2 times, most recently from 966f173 to be1ce35 Compare March 29, 2020 02:36
@elichai
Copy link
Contributor

elichai commented Mar 29, 2020

This seem like a simpler approach than #290
even though it's slower and doesn't have the jacobi implementation, but the bignum approach is quite hard to implement correctly.

(haven't reviewed the actual logic here)

@deadalnix
Copy link
Author

I wasn't aware of #290 , this is unfortunate. A few comments.

The approach is slightly different. This uses an approach with no divisions - technically, divisions by powers of 2 only, but these can be performed as simple shifts and additions while #290 tries to limit the number of bigint divisions using lehmer's optimization. I do not know which is best and it is hard to compare because the implementations details are quite different.

I was considering using a 5x64bits or 9x32bits big int for x0, x1 and negx going forward, just like #290. A lot of time is spent keeping them between 0 and p, but we could do it only once at the end. Everything else do not really needs big int. This is the next logical step here.

On a more high level note, I think proceeding is small steps is preferable. It is easier to review, avoids rebase hell and delivers value to users now. In addition, it allows for better fan out down the road, as it is possible to work and experiment with the various optimisations and variations discussed independently on top of a common base.

Implementing jacobi, further optimizing this and making it work with field element in addition to scalars is on my todo list, but I think it is better to agree on some approach before investing that much effort into this.

@real-or-random
Copy link
Contributor

Let me just note for now that I haven't had a closer look so far but this looks very interesting. Is this the algorithm described in pseudocode here? https://crypto.stackexchange.com/a/54623/12020
A pseudocode variant will make it easier to review this.

@deadalnix
Copy link
Author

deadalnix commented Mar 30, 2020

Not quite, but it is similar. The way i approached is starting with algorithm 2.23 from "Guide to Elliptic Curve Cryptography" by Hankerson, Menezes and Vanstone for which I added a few optimizations:

  • a can only be even early on, and at that time, x1 is always zero, so no need to double it.
  • when b is even, they accumulate more factors 2 in x0 and x1, but I eliminate them right away - it is worth it due to the optimized division by a power of 2 routine.

Pseudocode is as follow:

inverse a:
  b = p
  x0 = 1
  x1 = 0
  k = 0

  while a is even:
    a /= 2
    k++

  while a > 1:
    b -= a
    x1 -= x0
    while b is even:
      b /= 2
      x1 = x1 / 2 if x1 is even else (x1 + p) / 2
    if a > b:
      swap a, b
      swap x0, x1

  repeat k times:
    x0 = x0 / 2 if x0 is even else (x0 + p) / 2

  return x0

This is the hist of it, but there are various optimizations done along the way.

Not adding it to scalar.h because it is not meant to be a public API. Scalars do not have ordering per se, but this is meant to be able to implement the extended euclidian algorithm in order to compute inverse in variable time effisciently.
…scalars in variable time.

This is faster than the exponentiation method. It is still significantly slower than GMP, so we keep that option as well.
@deadalnix
Copy link
Author

I updated secp256k1_scalar_pow2_div to add zero instea of skipping the fixup with a branch as it turns out it is faster.

@mratsim
Copy link

mratsim commented Apr 15, 2020

FYI, EEA can be made constant-time with very reasonable speed (for a constant-time impl, it would still be an order of magnitude slower than variable time GCD like in GMP) following the algorithm 5 from paper: https://link.springer.com/content/pdf/10.1007%2F978-3-642-40588-4_10.pdf

image

Implementation in C are available in GMP (mirror) and Nettle, in C++ in Botan and in Nim in my own library Constantine

There is a lengthy discussion on constant-time modular inverse in Botan repo: randombit/botan#1479

In particular there is discussion on a recent paper by Bernstein and Yang that claims a very fast constant time GCD and inversion (https://eprint.iacr.org/2019/266).

@deadalnix
Copy link
Author

That's interesting, but considering the benchmark I have, I highly doubt this can be made faster that the exponentiation technique that is used today.

@mratsim
Copy link

mratsim commented Apr 16, 2020

The Bernstein-Yang paper claims that their GCD-based inversion is faster than using Little Fermat on Curve25519. Similarly, the Möller algorithm for inversion is faster than the addition-chains from secp256k1 in my own benchmark though I use a generic Montgomery mul/square algorithm and the difference is about 20% so faster/specialized mul/square will probably change the balance). Anyway if there is a dedicated thread on constant-time inversion I would be happy to discuss there instead of in this variable time EEA thread.

@roconnor-blockstream
Copy link
Contributor

roconnor-blockstream commented Apr 21, 2020

I've made a PR with some suggested amendments @ deadalnix#1

roconnor-blockstream@2446b1e replaces the scalar_complement and scalar_binadd with a scalar_sub function. I've been told that my scalar_sub function has managed to coax the compiler into emitting a SBB (subtract with borrow) instruction which is very good. I haven't verified that fact myself yet. I believe this substitution is an objective improvement.

I suppose there is an argument over which style is better, using subtraction or addition with complement as is done in scalar_negate. My opinion is that subtraction is marginally better and arguably the scalar_negate should be using this subtraction technique.

roconnor-blockstream@13e07ab revises the main secp256k1_scalar_eea_inverse a little. My main goal there is to remove the use of goto. My reasons are a little selfish. The Verifiable C project cannot reason about C code with gotos, and I would like to be able to do a formal verification of this code at some point in the future. That said, to date I've only done a very limited about of verification of libsecp256k1 using Verifiable C, only enough to evaluate the tool. Otherwise this would be the only use of goto in the entire codebase. I think this revision to secp256k1_scalar_eea_inverse is a subjective improvement.

That all said, it would be a huge improvement to generalize this so that it can be used to compute field inversions in addition to scalar inversions. ECDSA verification requires a single scalar inversion, and Schnorr verification requires no scalar inversions at all, whereas field inversion occurs in a moderate number of places during verification.

I think we can generalize this by building a static table of "remainders", as the lookup table found in secp256k1_scalar_pow2_div, for both the field order and group order, and pass a pointer to required table into a generalized eea_inverse function. For field inversion, we can compact the field value into the scalar type (or perhaps better, a new equivalent 256-bit type), perform the eea inversion, and copy the result back into the field limb format.

Let me know if I've made any errors.

@deadalnix
Copy link
Author

Awesome!

I'm not sure if it is best to comment here or in your PR, I'll do it here and we can move the discussion there if need be.

I was working myself on the next iteration of this thing and I got some more nice speedup - mostly by avoiding to reduce at every steps. Unfortunately, we stepped on each others toe a bit, but that's okay, there are no way you'd be able to know.

  • You move some variable declarations to have smaller scope. This is awesome. I was uncertain about the C89 rules and don't wanted to fight the compiler over this so I went extra large.

  • I'm not sure about the goto removal. I get it, goto is frowned upon, and for the most part for good reasons. But once in a while comes this situation where a goto really is better and I think we are in this case. This is because the goto effectively skips a good chunk of the first loop iteration - and in fact it can skip more. If you have a solution to achieve this without goto, then I'm taking, but it is unclear to me if this is possible at all.

  • Your implementation of scalar sub should be called binsub instead, because it skips the modulo reduction. It should also return the leftover so that the caller can do something with it. If nothing else, check that expectations about overflow are maintained.

  • Please do not remove the binadd operation. There are optimizations of secp256k1_scalar_pow2_div that can greatly benefit from it. It is also useful as a building block to avoid reduction at every step. I have code for all this, but not in a shape I think is good enough for review ATM. I can upload it anyways if you want to check.

  • I expect that there won't be any major roadblock to generalize this to field element and I intend to do this if there is interest. I made the choice of going for scalar only so that I can focus on extracting all the juice from there - there is a lot more to extract. Doing everything at once would make review more difficult, as well as iterating on various ways to do things to see which is faster slower, both of which are undesirable initially.

@roconnor-blockstream
Copy link
Contributor

roconnor-blockstream commented Apr 21, 2020

Regarding variable scoping. I wasn't sure either. Jonas suggested to me that it was okay to narrow the scope, so I guess we should do it unless we hear an objection.

Regarding goto: I believe my revision does no more work than your goto version. Roughly speaking the call secp256k1_scalar_negate(&v, a); was eliminated in favour of entering the loop from the top an running secp256k1_scalar_sub which does more or less the identical amount of work.

In light of your overall comments, I suggest closing my PR against your repo and I can revisit it once you have some new code. Feel free to incorporate any of my suggestions that you agree with into your next update. I have a couple of other, more questionable tricks in there, such as using *r as scratch space.

@roconnor-blockstream
Copy link
Contributor

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

@deadalnix
Copy link
Author

Yes. Closing.

@deadalnix deadalnix closed this Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants