k256: endomorphism-aware wNAF for vartime scalar multiplication by 42Pupusas · Pull Request #1745 · RustCrypto/elliptic-curves

42Pupusas · 2026-04-24T01:07:39Z

Summary

Replaces the placeholder MulVartime / MulByGeneratorVartime impls
for k256::ProjectivePoint (which fell back to the constant-time path
and had TODOs to match) with a real endomorphism-aware width-5 wNAF,
then folds the combined mul_by_generator_and_mul_add_vartime into a
single shared-doublings ladder over all 4 GLV sub-scalars.

Closes #1725.

What changes

Commit 1 — wNAF core. GLV-decompose the scalar into two ~128-bit
halves, compute a width-5 signed-digit NAF of each magnitude, and run
a standard left-to-right double-and-add with precomputed odd
multiples [P, 3P, ..., 15P]. Sign is folded into the precomputed
points at setup.
Commit 2 — share doublings. Extract a small WnafSlot
(odd-multiples table + digits) and a wnaf_ladder helper. The
combined a*G + b*P variant runs one ladder over all 4 GLV slots
(G + βG + P + βP), doing one double() per step instead of two
independent ladders.
Commit 3 — debug_assert. Guard the fixed 130-entry wNAF digit
buffer; the bound is currently implicit in WNAF_WIDTH = 5, this
makes it explicit at test time.

Perf (Schnorr verify, default features, x86_64)

Stage	time/verify
Master (constant-time fallback)	~62 µs
After commit 1 (wNAF)	~53 µs
After commit 2 (shared ladder)	~50 µs

~19% faster end-to-end. Also speeds up any other user of
MulVartime / MulByGeneratorVartime on the k256 curve.

Test plan

cargo test -p k256 --lib --features getrandom — 89 passed
New randomized tests for mul_vartime and
mul_and_mul_add_vartime vs. the constant-time reference
(32 iterations each with ProjectivePoint::generate() and
Scalar::generate()).
Edge-case tests: scalar = 0, 1, −1, point = identity.
cargo bench -p k256 --bench schnorr -- verify on an idle host
confirms the numbers above (criterion-reported change is stable
across runs).

Notes

Not constant time; SECURITY: comments are on the two vartime impls.
Only reachable via the MulVartime / MulByGeneratorVartime traits
that callers opt into for non-secret scalars.
Briefly explored several further optimizations (batched-affine
odd-multiples via Montgomery's trick; static precomputed G tables
with mixed-add; wider window for the G side). The first two
regressed perf at this width/scalar size; the last gave ~4% more but
added a ~6 KB static, const-generics, and a new LazyLock path —
not worth the complexity for a single-curve specialization. This PR
sticks to the change that's pure upside.

tarcieri · 2026-05-10T22:56:32Z

It would be good if this could be implemented in terms of types from the (rustcrypto-)group crate and its existing group::{Wnaf, WnafBase, WnafScalar}.

That will probably require API changes to (rustcrypto-)group, e.g. potentially moving scalar multiplication to the WnafGroup trait which can provide a method but let this crate plug in its own that's aware of the endomorphism. But if done right I think it would let us reuse all the types, including the new BasepointTableVartime which many other curves in this repo are using to accelerate e.g. ECDSA verification.

42Pupusas · 2026-05-10T23:11:36Z

that makes sense, I should have thought of that but was kind of tunnel visioned on this crate.

will explore this direction @tarcieri , if an API change is needed to the group crate I am guessing a corresponding PR should be raised there?

tarcieri · 2026-05-11T00:29:28Z

Yes, and you should be able to use patch.crates-io to update this PR

42Pupusas · 2026-05-14T17:41:42Z

@tarcieri the new approach maintained most of the performance gains, but still falls behind a bit behind the hand-rolled version due to some extra allocations that happen in the group crate.

From my initial exploration, the Wnaf primitives could add a const generic for TABLE_SIZE, basically dropping heap allocations to 0 for the mul paths. However this is much more intrusive and potentially cascades down to most other curves that use it, so I held off from moving forward on that path.

If that approach seems valuable to you, I can follow up with a PR for that

tarcieri · 2026-05-14T19:46:01Z

@42Pupusas I would suggest keeping the changes as minimal as possible, as they need to get upstreamed to https://github.com/zkcrypto/group

It does look like there are a few things in the PR you opened which could be split out into their own PRs and upstreamed directly though

42Pupusas · 2026-05-18T19:58:54Z

Appreciate the guidance @tarcieri, thanks.

I have now split the groups PR into the following:

The upstream doesn't seem very active in the last year, but should I push the PRs there instead ?

@tarcieri

For window size `w`, wNAF digits have max magnitude `2^(w-1) - 1`. The table is indexed by `|digit| / 2`, so the maximum index is `(2^(w-1) - 1) / 2 = 2^(w-2) - 1`, requiring `2^(w-2)` entries. The previous `2^(w-1)` allocation computed twice as many odd multiples as needed, wasting point additions during table setup. Ran the `k256` Schnorr signing and verifying benches and gave around 20% improvements for hot paths, however @tarcieri please note my benchmarks were ran against [this branch](RustCrypto/elliptic-curves#1745 (comment)) that does GLV decomposition + shared doublings, where the larger table + extra allocations actually hurt. While the fix is provably correct, the current vendored wNAF path only allocates a single table, so the effects a not really measurable. | Stage | wNAF tables built per verify | `2^(w-2)` (fix) | `2^(w-1)` (no fix) | Fix's contribution | | --- | --- | --- | --- | --- | | Pre-wNAF base | 0 (constant-time `lincomb`) | — | — | 0% (can't apply) | | GLV wNAF, separate ladders | 2 (only P; G uses basepoint table) | 53.2 µs | 57.6 µs | ~7% | | GLV + shared doublings | 4 (G via wNAF too, fused) | 50.6 µs | 59.8 µs | ~19% |

42Pupusas · 2026-06-06T15:56:40Z

@tarcieri with #2437 being merged, I am now only missing WnafScalar::from_le_bytes(&[u8]) (128-bit half-scalar) and WnafBase::multiscalar_mul_array(&[..], &[..]) to fully refactor the PR with the new traits.

Please confirm if I should proceed with PRs for those changes to the traits crate.

tarcieri · 2026-06-06T16:00:36Z

Yes, you can go ahead and open PRs for those

42Pupusas · 2026-06-07T00:04:04Z

@tarcieri refactored with the new traits and cleaned up a bit to match style of rest of the crate.

I've pointed elliptic-curves to master for the new APIs, but I guess at some point you can bump a version of that crate?

Also I seem to have broken something in CI, maybe the crate patch?

Updated Benchmark

Benchmark	master (`4b919e9a`)	branch (`k256/schnorr-verify-perf`)	Δ
`schnorr/verify`	~64.1 µs	~52.4 µs	−19.0% (≈1.22× faster)

tarcieri · 2026-06-09T22:02:24Z

@42Pupusas as of #1779 there is now a wnaf crate in this repo you can modify in the same commit if you rebase, which should make it easier than orchestrating work across multiple repos

Replaces the placeholder MulVartime / MulByGeneratorVartime impls (which just called the constant-time path and had TODOs to match) with a width-5 wNAF that uses the GLV endomorphism to split each scalar into two ~128-bit halves. Schnorr verify: ~62 µs -> ~53 µs (14% faster, no precomputed-tables; ~55 µs with tables). Addresses RustCrypto#1725. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Folds the combined `mul_by_generator_and_mul_add_vartime` into a single wNAF ladder over all 4 GLV sub-scalars (s1, s2 for G and the endomorphism; e1, e2 for P and the endomorphism). One `double()` per step instead of two independent ladders. Factors out a small `WnafSlot` (odd-multiples table + digits) and a `wnaf_ladder` helper so the single-point `mul_vartime` and the combined op share the same loop body. Schnorr verify: ~53 µs -> ~50 µs (no precomputed-tables; ~51 µs with tables). Total vs. pre-wNAF baseline: ~62 µs -> ~50 µs (~19% faster). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`wnaf_128` writes into a fixed 130-entry buffer; the bound holds for the current `WNAF_WIDTH = 5` and the ≤128-bit GLV sub-scalars, but it's implicit. Add a `debug_assert!` in the loop so that any future change to `WNAF_WIDTH` that invalidates the bound is caught at test time rather than silently writing out of bounds in worst-case inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`wnaf_128` tracked the residual scalar in two u64 limbs, but a negative recentered digit adds up to 2^(W-1) − 1 to the value, which can legitimately overflow past bit 127 when the input is close to 2^128 − 1. The old code let `hi.wrapping_add(1)` silently wrap, losing the carried bit and producing a NAF that reconstructs to the wrong value. The GLV decomposition's `(r1, r2)` each have magnitude strictly less than 2^128, so values in the carry-out window are possible (though vanishingly rare in random scalars — which is why the existing randomized tests never caught it). Fix by carrying the overflow bit into a third limb `top` that is absorbed back on the next right-shift. Perf impact is in the noise: the `top` branch is almost never taken and the predictor handles it cleanly. Add two regression tests: - `test_wnaf_128_reconstruction_adversarial` — reconstructs the NAF of a scalar with low 128 bits = 0xFF..FF and asserts it equals 2^128 − 1. - `test_mul_vartime_adversarial_scalars` — end-to-end check that `mul_vartime(P, k)` matches the constant-time reference when `k`'s low 128 bits trigger the carry window. Also add a `debug_assert!` on `idx` in `WnafSlot::apply` to guard the parallel invariant (`idx < WNAF_TABLE_SIZE`) if `WNAF_WIDTH` is ever widened without growing the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace custom wNAF implementation (wnaf_128, build_odd_multiples, WnafSlot, wnaf_ladder) with the group crate's WnafBase/WnafScalar types and WnafBase::multiscalar_mul_array. A new WnafScalar::from_le_bytes constructor accepts short (128-bit) GLV half-scalars, producing ~half the wNAF digits and ~half the doublings in the evaluation loop. multiscalar_mul_array avoids the two collect() heap allocations of the iterator-based multiscalar_mul. Depends on RustCrypto/group#15 for the group crate changes (wnaf_table size fix, from_le_bytes, multiscalar_mul_array, pre-sized Vec allocations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The WnafScalar::from_le_bytes (#2438) and WnafBase::multiscalar_mul_array (#2439) work this branch relied on has merged upstream, replacing the fork-only APIs used so far: - replace the fork-only PrimeField::to_le_repr with a local to_le_bytes helper (to_repr is big-endian; reverse for from_le_bytes) - handle the new fallible from_le_bytes signature; GLV half-scalars are < 2^128 so the canonical-range check cannot fail (.expect) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pure cleanup of the GLV+wNAF vartime path, no behavior change: - Fold the loose `glv_wnaf_pair` / `mul_vartime_impl` / `mul_and_mul_add_vartime_impl` free functions into an `#[cfg(alloc)] impl ProjectivePoint` as methods, alongside the existing `mul_by_generator`. - Replace the in-body `#[cfg]` ladders in the `MulVartime` / `MulByGeneratorVartime` impls with per-fn `#[cfg]` definitions, matching the `mul_by_generator` twin-definition idiom. Without `alloc`, `mul_vartime` is plain `self * rhs` and the trait's default `mul_by_generator_and_mul_add_vartime` applies (no override needed). - Drop the `to_le_bytes` helper and its redundant big-endian->little- endian reverse: read the scalar's little-endian bytes directly via `U256::to_le_byte_array`. The wNAF path needs LE, and `to_repr` is BE, so the previous reverse was immediately undone inside `from_le_bytes`. - Replace the `.expect` on `from_le_bytes` with an `unwrap_or_else` fallback to the infallible full-width `WnafScalar::new`, removing the panic path from a crypto routine. Verified: builds across the feature matrix (arithmetic / schnorr, each with and without alloc; alloc without precomputed-tables); all k256 tests pass; schnorr verify still ~20% faster than master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The GLV rework only defined mul_by_generator_and_mul_add_vartime under `#[cfg(feature = "alloc")]`, so without `alloc` the impl fell through to the trait default, which computes `aG + bP` as two independent variable-time scalar multiplications (`mul_by_generator_vartime(a) + p.mul_vartime(b)`). That doubles the point doublings versus the pre-GLV behavior, where this was a single linear combination sharing doublings across both terms. Add a `#[cfg(not(feature = "alloc"))]` arm that restores the original `Self::lincomb(&[(G, a), (b_point, b)])`. The array-based LinearCombination impl uses stack tables and is not gated on `alloc`, so it works in no_std verifiers (the main consumers of the no-alloc path). Verified the fallback matches `aG + bP` and the identity/zero edge cases in both no-alloc configs (with and without precomputed-tables + critical-section). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

As of RustCrypto#1779 the forked WnafBase/WnafScalar implementation lives in the wnaf crate in this repository, re-exported by elliptic-curve behind its `wnaf` feature; the root re-exports are gone from traits master. Enable elliptic-curve/wnaf from the k256 alloc feature and import via elliptic_curve::wnaf, matching primeorder. from_le_bytes returns Option here rather than Result; adjust the fallback closures to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

tarcieri

Seems like a start. There are a few other cases to wire up like linear combinations / multiscalar multiplications, but this is enough to do an initial release.

42Pupusas · 2026-06-10T14:37:40Z

@tarcieri if you open issues to target the specific usecases you feel are missing I will be happy to keep contributing, but I would rather the direction came from your side to not push blindly in any direction.

We depend heavily on k256 at our company, so any help needed related to it will get attention from me.

Follows suit with #1798 and removes the integration with the `wnaf` crate originally added in #1745. This removes `wnaf` as a stabilization block for the time being. We can always add it back. The result is ~20% slower, but doesn't require `alloc`, and avoids the need to stabilize `wnaf` to stabilize `k256`. high-level operations/point-scalar mul (variable-time) time: [33.719 µs 33.895 µs 34.123 µs] change: [+15.985% +18.087% +20.911%] (p = 0.00 < 0.05) Performance has regressed. ecdsa/verify_prehashed time: [55.777 µs 56.194 µs 56.685 µs] change: [+9.2821% +13.258% +17.003%] (p = 0.00 < 0.05) Performance has regressed. The performance regression is fairly significant though, so we should investigate bringing it back when stabilization blockers are resolved.

42Pupusas mentioned this pull request May 14, 2026

Wnaf Optimizations RustCrypto/group#15

Closed

42Pupusas force-pushed the k256/schnorr-verify-perf branch 3 times, most recently from d22cd8a to 577a58b Compare May 14, 2026 17:09

tarcieri mentioned this pull request May 22, 2026

wNAF: support for custom scalar multiplication algorithms zkcrypto/group#79

Open

42Pupusas mentioned this pull request Jun 6, 2026

Fix wnaf_table overallocation RustCrypto/traits#2437

Merged

42Pupusas and others added 9 commits June 9, 2026 16:34

42Pupusas force-pushed the k256/schnorr-verify-perf branch from 4094895 to db2db3a Compare June 9, 2026 23:45

tarcieri approved these changes Jun 10, 2026

View reviewed changes

tarcieri merged commit 43347f7 into RustCrypto:master Jun 10, 2026
18 checks passed

tarcieri mentioned this pull request Jun 17, 2026

k256: replace wnaf with built-in vartime scalar mul/lincomb #1810

Merged

Uh oh!

Conversation

42Pupusas commented Apr 24, 2026

Summary

What changes

Perf (Schnorr verify, default features, x86_64)

Test plan

Notes

Uh oh!

tarcieri commented May 10, 2026

Uh oh!

42Pupusas commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented May 11, 2026

Uh oh!

42Pupusas commented May 14, 2026

Uh oh!

tarcieri commented May 14, 2026

Uh oh!

42Pupusas commented May 18, 2026

Uh oh!

42Pupusas commented Jun 6, 2026

Uh oh!

tarcieri commented Jun 6, 2026

Uh oh!

42Pupusas commented Jun 7, 2026

Updated Benchmark

Uh oh!

tarcieri commented Jun 9, 2026

Uh oh!

tarcieri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

42Pupusas commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

42Pupusas commented May 10, 2026 •

edited

Loading