k256: endomorphism-aware wNAF for vartime scalar multiplication#1745
Conversation
|
It would be good if this could be implemented in terms of types from the That will probably require API changes to |
|
that makes sense, I should have thought of that but was kind of tunnel visioned on this crate. will explore this direction @tarcieri , if an API change is needed to the |
|
Yes, and you should be able to use |
d22cd8a to
577a58b
Compare
|
@tarcieri the new approach maintained most of the performance gains, but still falls behind a bit behind the hand-rolled version due to some extra allocations that happen in the From my initial exploration, the If that approach seems valuable to you, I can follow up with a PR for that |
|
@42Pupusas I would suggest keeping the changes as minimal as possible, as they need to get upstreamed to https://github.com/zkcrypto/group It does look like there are a few things in the PR you opened which could be split out into their own PRs and upstreamed directly though |
|
Appreciate the guidance @tarcieri, thanks. I have now split the
The upstream doesn't seem very active in the last year, but should I push the PRs there instead ? |
For window size `w`, wNAF digits have max magnitude `2^(w-1) - 1`. The table is indexed by `|digit| / 2`, so the maximum index is `(2^(w-1) - 1) / 2 = 2^(w-2) - 1`, requiring `2^(w-2)` entries. The previous `2^(w-1)` allocation computed twice as many odd multiples as needed, wasting point additions during table setup. Ran the `k256` Schnorr signing and verifying benches and gave around 20% improvements for hot paths, however @tarcieri please note my benchmarks were ran against [this branch](RustCrypto/elliptic-curves#1745 (comment)) that does GLV decomposition + shared doublings, where the larger table + extra allocations actually hurt. While the fix is provably correct, the current vendored wNAF path only allocates a single table, so the effects a not really measurable. | Stage | wNAF tables built per verify | `2^(w-2)` (fix) | `2^(w-1)` (no fix) | Fix's contribution | | --- | --- | --- | --- | --- | | Pre-wNAF base | 0 (constant-time `lincomb`) | — | — | 0% (can't apply) | | GLV wNAF, separate ladders | 2 (only P; G uses basepoint table) | 53.2 µs | 57.6 µs | ~7% | | GLV + shared doublings | 4 (G via wNAF too, fused) | 50.6 µs | 59.8 µs | ~19% |
|
@tarcieri with #2437 being merged, I am now only missing Please confirm if I should proceed with PRs for those changes to the traits crate. |
|
Yes, you can go ahead and open PRs for those |
|
@tarcieri refactored with the new traits and cleaned up a bit to match style of rest of the crate. I've pointed Also I seem to have broken something in CI, maybe the crate patch? Updated Benchmark
|
|
@42Pupusas as of #1779 there is now a |
Replaces the placeholder MulVartime / MulByGeneratorVartime impls (which just called the constant-time path and had TODOs to match) with a width-5 wNAF that uses the GLV endomorphism to split each scalar into two ~128-bit halves. Schnorr verify: ~62 µs -> ~53 µs (14% faster, no precomputed-tables; ~55 µs with tables). Addresses RustCrypto#1725. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the combined `mul_by_generator_and_mul_add_vartime` into a single wNAF ladder over all 4 GLV sub-scalars (s1, s2 for G and the endomorphism; e1, e2 for P and the endomorphism). One `double()` per step instead of two independent ladders. Factors out a small `WnafSlot` (odd-multiples table + digits) and a `wnaf_ladder` helper so the single-point `mul_vartime` and the combined op share the same loop body. Schnorr verify: ~53 µs -> ~50 µs (no precomputed-tables; ~51 µs with tables). Total vs. pre-wNAF baseline: ~62 µs -> ~50 µs (~19% faster). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` writes into a fixed 130-entry buffer; the bound holds for the current `WNAF_WIDTH = 5` and the ≤128-bit GLV sub-scalars, but it's implicit. Add a `debug_assert!` in the loop so that any future change to `WNAF_WIDTH` that invalidates the bound is caught at test time rather than silently writing out of bounds in worst-case inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` tracked the residual scalar in two u64 limbs, but a negative recentered digit adds up to 2^(W-1) − 1 to the value, which can legitimately overflow past bit 127 when the input is close to 2^128 − 1. The old code let `hi.wrapping_add(1)` silently wrap, losing the carried bit and producing a NAF that reconstructs to the wrong value. The GLV decomposition's `(r1, r2)` each have magnitude strictly less than 2^128, so values in the carry-out window are possible (though vanishingly rare in random scalars — which is why the existing randomized tests never caught it). Fix by carrying the overflow bit into a third limb `top` that is absorbed back on the next right-shift. Perf impact is in the noise: the `top` branch is almost never taken and the predictor handles it cleanly. Add two regression tests: - `test_wnaf_128_reconstruction_adversarial` — reconstructs the NAF of a scalar with low 128 bits = 0xFF..FF and asserts it equals 2^128 − 1. - `test_mul_vartime_adversarial_scalars` — end-to-end check that `mul_vartime(P, k)` matches the constant-time reference when `k`'s low 128 bits trigger the carry window. Also add a `debug_assert!` on `idx` in `WnafSlot::apply` to guard the parallel invariant (`idx < WNAF_TABLE_SIZE`) if `WNAF_WIDTH` is ever widened without growing the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace custom wNAF implementation (wnaf_128, build_odd_multiples, WnafSlot, wnaf_ladder) with the group crate's WnafBase/WnafScalar types and WnafBase::multiscalar_mul_array. A new WnafScalar::from_le_bytes constructor accepts short (128-bit) GLV half-scalars, producing ~half the wNAF digits and ~half the doublings in the evaluation loop. multiscalar_mul_array avoids the two collect() heap allocations of the iterator-based multiscalar_mul. Depends on RustCrypto/group#15 for the group crate changes (wnaf_table size fix, from_le_bytes, multiscalar_mul_array, pre-sized Vec allocations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The WnafScalar::from_le_bytes (#2438) and WnafBase::multiscalar_mul_array (#2439) work this branch relied on has merged upstream, replacing the fork-only APIs used so far: - replace the fork-only PrimeField::to_le_repr with a local to_le_bytes helper (to_repr is big-endian; reverse for from_le_bytes) - handle the new fallible from_le_bytes signature; GLV half-scalars are < 2^128 so the canonical-range check cannot fail (.expect) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure cleanup of the GLV+wNAF vartime path, no behavior change: - Fold the loose `glv_wnaf_pair` / `mul_vartime_impl` / `mul_and_mul_add_vartime_impl` free functions into an `#[cfg(alloc)] impl ProjectivePoint` as methods, alongside the existing `mul_by_generator`. - Replace the in-body `#[cfg]` ladders in the `MulVartime` / `MulByGeneratorVartime` impls with per-fn `#[cfg]` definitions, matching the `mul_by_generator` twin-definition idiom. Without `alloc`, `mul_vartime` is plain `self * rhs` and the trait's default `mul_by_generator_and_mul_add_vartime` applies (no override needed). - Drop the `to_le_bytes` helper and its redundant big-endian->little- endian reverse: read the scalar's little-endian bytes directly via `U256::to_le_byte_array`. The wNAF path needs LE, and `to_repr` is BE, so the previous reverse was immediately undone inside `from_le_bytes`. - Replace the `.expect` on `from_le_bytes` with an `unwrap_or_else` fallback to the infallible full-width `WnafScalar::new`, removing the panic path from a crypto routine. Verified: builds across the feature matrix (arithmetic / schnorr, each with and without alloc; alloc without precomputed-tables); all k256 tests pass; schnorr verify still ~20% faster than master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The GLV rework only defined mul_by_generator_and_mul_add_vartime under `#[cfg(feature = "alloc")]`, so without `alloc` the impl fell through to the trait default, which computes `aG + bP` as two independent variable-time scalar multiplications (`mul_by_generator_vartime(a) + p.mul_vartime(b)`). That doubles the point doublings versus the pre-GLV behavior, where this was a single linear combination sharing doublings across both terms. Add a `#[cfg(not(feature = "alloc"))]` arm that restores the original `Self::lincomb(&[(G, a), (b_point, b)])`. The array-based LinearCombination impl uses stack tables and is not gated on `alloc`, so it works in no_std verifiers (the main consumers of the no-alloc path). Verified the fallback matches `aG + bP` and the identity/zero edge cases in both no-alloc configs (with and without precomputed-tables + critical-section). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
As of RustCrypto#1779 the forked WnafBase/WnafScalar implementation lives in the wnaf crate in this repository, re-exported by elliptic-curve behind its `wnaf` feature; the root re-exports are gone from traits master. Enable elliptic-curve/wnaf from the k256 alloc feature and import via elliptic_curve::wnaf, matching primeorder. from_le_bytes returns Option here rather than Result; adjust the fallback closures to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4094895 to
db2db3a
Compare
tarcieri
left a comment
There was a problem hiding this comment.
Seems like a start. There are a few other cases to wire up like linear combinations / multiscalar multiplications, but this is enough to do an initial release.
|
@tarcieri if you open issues to target the specific usecases you feel are missing I will be happy to keep contributing, but I would rather the direction came from your side to not push blindly in any direction. We depend heavily on |
Follows suit with #1798 and removes the integration with the `wnaf` crate originally added in #1745. This removes `wnaf` as a stabilization block for the time being. We can always add it back. The result is ~20% slower, but doesn't require `alloc`, and avoids the need to stabilize `wnaf` to stabilize `k256`. high-level operations/point-scalar mul (variable-time) time: [33.719 µs 33.895 µs 34.123 µs] change: [+15.985% +18.087% +20.911%] (p = 0.00 < 0.05) Performance has regressed. ecdsa/verify_prehashed time: [55.777 µs 56.194 µs 56.685 µs] change: [+9.2821% +13.258% +17.003%] (p = 0.00 < 0.05) Performance has regressed. The performance regression is fairly significant though, so we should investigate bringing it back when stabilization blockers are resolved.
Follows suit with #1798 and removes the integration with the `wnaf` crate originally added in #1745. This removes `wnaf` as a stabilization block for the time being. We can always add it back. The result is ~20% slower, but doesn't require `alloc`, and avoids the need to stabilize `wnaf` to stabilize `k256`. high-level operations/point-scalar mul (variable-time) time: [33.719 µs 33.895 µs 34.123 µs] change: [+15.985% +18.087% +20.911%] (p = 0.00 < 0.05) Performance has regressed. ecdsa/verify_prehashed time: [55.777 µs 56.194 µs 56.685 µs] change: [+9.2821% +13.258% +17.003%] (p = 0.00 < 0.05) Performance has regressed. The performance regression is fairly significant though, so we should investigate bringing it back when stabilization blockers are resolved.
Summary
Replaces the placeholder
MulVartime/MulByGeneratorVartimeimplsfor
k256::ProjectivePoint(which fell back to the constant-time pathand had TODOs to match) with a real endomorphism-aware width-5 wNAF,
then folds the combined
mul_by_generator_and_mul_add_vartimeinto asingle shared-doublings ladder over all 4 GLV sub-scalars.
Closes #1725.
What changes
halves, compute a width-5 signed-digit NAF of each magnitude, and run
a standard left-to-right double-and-add with precomputed odd
multiples
[P, 3P, ..., 15P]. Sign is folded into the precomputedpoints at setup.
WnafSlot(odd-multiples table + digits) and a
wnaf_ladderhelper. Thecombined
a*G + b*Pvariant runs one ladder over all 4 GLV slots(G + βG + P + βP), doing one
double()per step instead of twoindependent ladders.
buffer; the bound is currently implicit in
WNAF_WIDTH = 5, thismakes it explicit at test time.
Perf (Schnorr verify, default features, x86_64)
~19% faster end-to-end. Also speeds up any other user of
MulVartime/MulByGeneratorVartimeon the k256 curve.Test plan
cargo test -p k256 --lib --features getrandom— 89 passedmul_vartimeandmul_and_mul_add_vartimevs. the constant-time reference(32 iterations each with
ProjectivePoint::generate()andScalar::generate()).cargo bench -p k256 --bench schnorr -- verifyon an idle hostconfirms the numbers above (criterion-reported change is stable
across runs).
Notes
SECURITY:comments are on the two vartime impls.Only reachable via the
MulVartime/MulByGeneratorVartimetraitsthat callers opt into for non-secret scalars.
odd-multiples via Montgomery's trick; static precomputed G tables
with mixed-add; wider window for the G side). The first two
regressed perf at this width/scalar size; the last gave ~4% more but
added a ~6 KB static, const-generics, and a new
LazyLockpath —not worth the complexity for a single-curve specialization. This PR
sticks to the change that's pure upside.