Skip to content

k256: endomorphism-aware wNAF for vartime scalar multiplication#1745

Merged
tarcieri merged 9 commits into
RustCrypto:masterfrom
42Pupusas:k256/schnorr-verify-perf
Jun 10, 2026
Merged

k256: endomorphism-aware wNAF for vartime scalar multiplication#1745
tarcieri merged 9 commits into
RustCrypto:masterfrom
42Pupusas:k256/schnorr-verify-perf

Conversation

@42Pupusas

Copy link
Copy Markdown
Contributor

Summary

Replaces the placeholder MulVartime / MulByGeneratorVartime impls
for k256::ProjectivePoint (which fell back to the constant-time path
and had TODOs to match) with a real endomorphism-aware width-5 wNAF,
then folds the combined mul_by_generator_and_mul_add_vartime into a
single shared-doublings ladder over all 4 GLV sub-scalars.

Closes #1725.

What changes

  • Commit 1 — wNAF core. GLV-decompose the scalar into two ~128-bit
    halves, compute a width-5 signed-digit NAF of each magnitude, and run
    a standard left-to-right double-and-add with precomputed odd
    multiples [P, 3P, ..., 15P]. Sign is folded into the precomputed
    points at setup.
  • Commit 2 — share doublings. Extract a small WnafSlot
    (odd-multiples table + digits) and a wnaf_ladder helper. The
    combined a*G + b*P variant runs one ladder over all 4 GLV slots
    (G + βG + P + βP), doing one double() per step instead of two
    independent ladders.
  • Commit 3 — debug_assert. Guard the fixed 130-entry wNAF digit
    buffer; the bound is currently implicit in WNAF_WIDTH = 5, this
    makes it explicit at test time.

Perf (Schnorr verify, default features, x86_64)

Stage time/verify
Master (constant-time fallback) ~62 µs
After commit 1 (wNAF) ~53 µs
After commit 2 (shared ladder) ~50 µs

~19% faster end-to-end. Also speeds up any other user of
MulVartime / MulByGeneratorVartime on the k256 curve.

Test plan

  • cargo test -p k256 --lib --features getrandom — 89 passed
  • New randomized tests for mul_vartime and
    mul_and_mul_add_vartime vs. the constant-time reference
    (32 iterations each with ProjectivePoint::generate() and
    Scalar::generate()).
  • Edge-case tests: scalar = 0, 1, −1, point = identity.
  • cargo bench -p k256 --bench schnorr -- verify on an idle host
    confirms the numbers above (criterion-reported change is stable
    across runs).

Notes

  • Not constant time; SECURITY: comments are on the two vartime impls.
    Only reachable via the MulVartime / MulByGeneratorVartime traits
    that callers opt into for non-secret scalars.
  • Briefly explored several further optimizations (batched-affine
    odd-multiples via Montgomery's trick; static precomputed G tables
    with mixed-add; wider window for the G side). The first two
    regressed perf at this width/scalar size; the last gave ~4% more but
    added a ~6 KB static, const-generics, and a new LazyLock path —
    not worth the complexity for a single-curve specialization. This PR
    sticks to the change that's pure upside.

@tarcieri

Copy link
Copy Markdown
Member

It would be good if this could be implemented in terms of types from the (rustcrypto-)group crate and its existing group::{Wnaf, WnafBase, WnafScalar}.

That will probably require API changes to (rustcrypto-)group, e.g. potentially moving scalar multiplication to the WnafGroup trait which can provide a method but let this crate plug in its own that's aware of the endomorphism. But if done right I think it would let us reuse all the types, including the new BasepointTableVartime which many other curves in this repo are using to accelerate e.g. ECDSA verification.

@42Pupusas

42Pupusas commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

that makes sense, I should have thought of that but was kind of tunnel visioned on this crate.

will explore this direction @tarcieri , if an API change is needed to the group crate I am guessing a corresponding PR should be raised there?

@tarcieri

Copy link
Copy Markdown
Member

Yes, and you should be able to use patch.crates-io to update this PR

@42Pupusas 42Pupusas force-pushed the k256/schnorr-verify-perf branch 3 times, most recently from d22cd8a to 577a58b Compare May 14, 2026 17:09
@42Pupusas

Copy link
Copy Markdown
Contributor Author

@tarcieri the new approach maintained most of the performance gains, but still falls behind a bit behind the hand-rolled version due to some extra allocations that happen in the group crate.

From my initial exploration, the Wnaf primitives could add a const generic for TABLE_SIZE, basically dropping heap allocations to 0 for the mul paths. However this is much more intrusive and potentially cascades down to most other curves that use it, so I held off from moving forward on that path.

If that approach seems valuable to you, I can follow up with a PR for that

@tarcieri

Copy link
Copy Markdown
Member

@42Pupusas I would suggest keeping the changes as minimal as possible, as they need to get upstreamed to https://github.com/zkcrypto/group

It does look like there are a few things in the PR you opened which could be split out into their own PRs and upstreamed directly though

@42Pupusas

Copy link
Copy Markdown
Contributor Author

Appreciate the guidance @tarcieri, thanks.

I have now split the groups PR into the following:

The upstream doesn't seem very active in the last year, but should I push the PRs there instead ?

tarcieri pushed a commit to RustCrypto/traits that referenced this pull request Jun 6, 2026
For window size `w`, wNAF digits have max magnitude `2^(w-1) - 1`. The
table is indexed by `|digit| / 2`, so the maximum index is `(2^(w-1) -
1) / 2 = 2^(w-2) - 1`, requiring `2^(w-2)` entries.

The previous `2^(w-1)` allocation computed twice as many odd multiples
as needed, wasting point additions during table setup.


Ran the `k256` Schnorr signing and verifying benches and gave around 20%
improvements for hot paths, however
@tarcieri please note my benchmarks were ran against [this
branch](RustCrypto/elliptic-curves#1745 (comment))
that does GLV decomposition + shared doublings, where the larger table +
extra allocations actually hurt.

While the fix is provably correct, the current vendored wNAF path only
allocates a single table, so the effects a not really measurable.

| Stage | wNAF tables built per verify | `2^(w-2)` (fix) | `2^(w-1)` (no
fix) | Fix's contribution |
  | --- | --- | --- | --- | --- |
| Pre-wNAF base | 0 (constant-time `lincomb`) | — | — | 0% (can't apply)
|
| GLV wNAF, separate ladders | 2 (only P; G uses basepoint table) | 53.2
µs | 57.6 µs | ~7% |
| GLV + shared doublings | 4 (G via wNAF too, fused) | 50.6 µs | 59.8 µs
| ~19% |
@42Pupusas

Copy link
Copy Markdown
Contributor Author

@tarcieri with #2437 being merged, I am now only missing WnafScalar::from_le_bytes(&[u8]) (128-bit half-scalar) and WnafBase::multiscalar_mul_array(&[..], &[..]) to fully refactor the PR with the new traits.

Please confirm if I should proceed with PRs for those changes to the traits crate.

@tarcieri

tarcieri commented Jun 6, 2026

Copy link
Copy Markdown
Member

Yes, you can go ahead and open PRs for those

@42Pupusas

Copy link
Copy Markdown
Contributor Author

@tarcieri refactored with the new traits and cleaned up a bit to match style of rest of the crate.

I've pointed elliptic-curves to master for the new APIs, but I guess at some point you can bump a version of that crate?

Also I seem to have broken something in CI, maybe the crate patch?

Updated Benchmark

Benchmark master (4b919e9a) branch (k256/schnorr-verify-perf) Δ
schnorr/verify ~64.1 µs ~52.4 µs −19.0% (≈1.22× faster)

@tarcieri

tarcieri commented Jun 9, 2026

Copy link
Copy Markdown
Member

@42Pupusas as of #1779 there is now a wnaf crate in this repo you can modify in the same commit if you rebase, which should make it easier than orchestrating work across multiple repos

42Pupusas and others added 9 commits June 9, 2026 16:34
Replaces the placeholder MulVartime / MulByGeneratorVartime impls (which
just called the constant-time path and had TODOs to match) with a width-5
wNAF that uses the GLV endomorphism to split each scalar into two
~128-bit halves.

Schnorr verify: ~62 µs -> ~53 µs (14% faster, no precomputed-tables;
~55 µs with tables). Addresses RustCrypto#1725.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the combined `mul_by_generator_and_mul_add_vartime` into a single
wNAF ladder over all 4 GLV sub-scalars (s1, s2 for G and the
endomorphism; e1, e2 for P and the endomorphism). One `double()` per
step instead of two independent ladders.

Factors out a small `WnafSlot` (odd-multiples table + digits) and a
`wnaf_ladder` helper so the single-point `mul_vartime` and the combined
op share the same loop body.

Schnorr verify: ~53 µs -> ~50 µs (no precomputed-tables; ~51 µs with
tables). Total vs. pre-wNAF baseline: ~62 µs -> ~50 µs (~19% faster).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` writes into a fixed 130-entry buffer; the bound holds for the
current `WNAF_WIDTH = 5` and the ≤128-bit GLV sub-scalars, but it's
implicit. Add a `debug_assert!` in the loop so that any future change to
`WNAF_WIDTH` that invalidates the bound is caught at test time rather
than silently writing out of bounds in worst-case inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` tracked the residual scalar in two u64 limbs, but a negative
recentered digit adds up to 2^(W-1) − 1 to the value, which can legitimately
overflow past bit 127 when the input is close to 2^128 − 1. The old code
let `hi.wrapping_add(1)` silently wrap, losing the carried bit and
producing a NAF that reconstructs to the wrong value.

The GLV decomposition's `(r1, r2)` each have magnitude strictly less than
2^128, so values in the carry-out window are possible (though
vanishingly rare in random scalars — which is why the existing
randomized tests never caught it).

Fix by carrying the overflow bit into a third limb `top` that is absorbed
back on the next right-shift. Perf impact is in the noise: the `top`
branch is almost never taken and the predictor handles it cleanly.

Add two regression tests:

- `test_wnaf_128_reconstruction_adversarial` — reconstructs the NAF of a
  scalar with low 128 bits = 0xFF..FF and asserts it equals 2^128 − 1.
- `test_mul_vartime_adversarial_scalars` — end-to-end check that
  `mul_vartime(P, k)` matches the constant-time reference when `k`'s low
  128 bits trigger the carry window.

Also add a `debug_assert!` on `idx` in `WnafSlot::apply` to guard the
parallel invariant (`idx < WNAF_TABLE_SIZE`) if `WNAF_WIDTH` is ever
widened without growing the table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace custom wNAF implementation (wnaf_128, build_odd_multiples,
WnafSlot, wnaf_ladder) with the group crate's WnafBase/WnafScalar
types and WnafBase::multiscalar_mul_array.

A new WnafScalar::from_le_bytes constructor accepts short (128-bit)
GLV half-scalars, producing ~half the wNAF digits and ~half the
doublings in the evaluation loop. multiscalar_mul_array avoids the
two collect() heap allocations of the iterator-based multiscalar_mul.

Depends on RustCrypto/group#15 for the group
crate changes (wnaf_table size fix, from_le_bytes, multiscalar_mul_array,
pre-sized Vec allocations).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The WnafScalar::from_le_bytes (#2438) and WnafBase::multiscalar_mul_array
(#2439) work this branch relied on has merged upstream, replacing the
fork-only APIs used so far:

- replace the fork-only PrimeField::to_le_repr with a local to_le_bytes
  helper (to_repr is big-endian; reverse for from_le_bytes)
- handle the new fallible from_le_bytes signature; GLV half-scalars are
  < 2^128 so the canonical-range check cannot fail (.expect)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure cleanup of the GLV+wNAF vartime path, no behavior change:

- Fold the loose `glv_wnaf_pair` / `mul_vartime_impl` /
  `mul_and_mul_add_vartime_impl` free functions into an
  `#[cfg(alloc)] impl ProjectivePoint` as methods, alongside the
  existing `mul_by_generator`.
- Replace the in-body `#[cfg]` ladders in the `MulVartime` /
  `MulByGeneratorVartime` impls with per-fn `#[cfg]` definitions,
  matching the `mul_by_generator` twin-definition idiom. Without
  `alloc`, `mul_vartime` is plain `self * rhs` and the trait's default
  `mul_by_generator_and_mul_add_vartime` applies (no override needed).
- Drop the `to_le_bytes` helper and its redundant big-endian->little-
  endian reverse: read the scalar's little-endian bytes directly via
  `U256::to_le_byte_array`. The wNAF path needs LE, and `to_repr` is BE,
  so the previous reverse was immediately undone inside `from_le_bytes`.
- Replace the `.expect` on `from_le_bytes` with an `unwrap_or_else`
  fallback to the infallible full-width `WnafScalar::new`, removing the
  panic path from a crypto routine.

Verified: builds across the feature matrix (arithmetic / schnorr, each
with and without alloc; alloc without precomputed-tables); all k256
tests pass; schnorr verify still ~20% faster than master.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The GLV rework only defined mul_by_generator_and_mul_add_vartime under
`#[cfg(feature = "alloc")]`, so without `alloc` the impl fell through to the
trait default, which computes `aG + bP` as two independent variable-time
scalar multiplications (`mul_by_generator_vartime(a) + p.mul_vartime(b)`).
That doubles the point doublings versus the pre-GLV behavior, where this was a
single linear combination sharing doublings across both terms.

Add a `#[cfg(not(feature = "alloc"))]` arm that restores the original
`Self::lincomb(&[(G, a), (b_point, b)])`. The array-based LinearCombination
impl uses stack tables and is not gated on `alloc`, so it works in no_std
verifiers (the main consumers of the no-alloc path).

Verified the fallback matches `aG + bP` and the identity/zero edge cases in
both no-alloc configs (with and without precomputed-tables + critical-section).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
As of RustCrypto#1779 the forked WnafBase/WnafScalar implementation lives in the
wnaf crate in this repository, re-exported by elliptic-curve behind its
`wnaf` feature; the root re-exports are gone from traits master.
Enable elliptic-curve/wnaf from the k256 alloc feature and import via
elliptic_curve::wnaf, matching primeorder. from_le_bytes returns Option
here rather than Result; adjust the fallback closures to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@42Pupusas 42Pupusas force-pushed the k256/schnorr-verify-perf branch from 4094895 to db2db3a Compare June 9, 2026 23:45

@tarcieri tarcieri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a start. There are a few other cases to wire up like linear combinations / multiscalar multiplications, but this is enough to do an initial release.

@tarcieri tarcieri merged commit 43347f7 into RustCrypto:master Jun 10, 2026
18 checks passed
@42Pupusas

Copy link
Copy Markdown
Contributor Author

@tarcieri if you open issues to target the specific usecases you feel are missing I will be happy to keep contributing, but I would rather the direction came from your side to not push blindly in any direction.

We depend heavily on k256 at our company, so any help needed related to it will get attention from me.

tarcieri added a commit that referenced this pull request Jun 17, 2026
Follows suit with #1798 and removes the integration with the `wnaf`
crate originally added in #1745. This removes `wnaf` as a stabilization
block for the time being. We can always add it back.

The result is ~20% slower, but doesn't require `alloc`, and avoids the
need to stabilize `wnaf` to stabilize `k256`.

high-level operations/point-scalar mul (variable-time)
    time:   [33.719 µs 33.895 µs 34.123 µs]
    change: [+15.985% +18.087% +20.911%] (p = 0.00 < 0.05)
    Performance has regressed.

ecdsa/verify_prehashed
    time:   [55.777 µs 56.194 µs 56.685 µs]
    change: [+9.2821% +13.258% +17.003%] (p = 0.00 < 0.05)
    Performance has regressed.

The performance regression is fairly significant though, so we should
investigate bringing it back when stabilization blockers are resolved.
tarcieri added a commit that referenced this pull request Jun 17, 2026
Follows suit with #1798 and removes the integration with the `wnaf`
crate originally added in #1745. This removes `wnaf` as a stabilization
block for the time being. We can always add it back.

The result is ~20% slower, but doesn't require `alloc`, and avoids the
need to stabilize `wnaf` to stabilize `k256`.

high-level operations/point-scalar mul (variable-time)
    time:   [33.719 µs 33.895 µs 34.123 µs]
    change: [+15.985% +18.087% +20.911%] (p = 0.00 < 0.05)
    Performance has regressed.

ecdsa/verify_prehashed
    time:   [55.777 µs 56.194 µs 56.685 µs]
    change: [+9.2821% +13.258% +17.003%] (p = 0.00 < 0.05)
    Performance has regressed.

The performance regression is fairly significant though, so we should
investigate bringing it back when stabilization blockers are resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

k256: endomorphism-aware wNAF implementation

2 participants