feat(forknet): combined early chunk producer kickout#15870
Draft
stedfn wants to merge 49 commits into
Draft
Conversation
Activate V2 partial witness emission and processing, gated behind ProtocolFeature::EarlyKickout. Constructor: `VersionedPartialEncodedStateWitness::new()` now emits V2 when the epoch protocol version >= EarlyKickout, V1 otherwise. Consumer: replaces the unconditional V2 drop with EarlyKickout gating. V2 witnesses use `prev_block_hash`-based chunk producer lookup against `DBCol::ChunkProducers`. When the prev block isn't yet available, witnesses are deferred into `PendingV2WitnessCache` (64-entry LRU) and replayed on `BlockNotificationMessage` via scan-on-notification with three-outcome classification (Ready/Requeue/Retire). Includes: V2 validation with cross-check (epoch_id, height_created vs prev_block_hash-implied values), DB lookup metrics, pending cache size/eviction metrics, emitted/received version-labeled counters, `validate_chunk_relevant_as_validator` made public for replay pre-check, `is_early_kickout_enabled` helper, `BlockNotificationMessage` plumbing from client to PartialWitnessActor, and comprehensive cache tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After EarlyKickout activation, only V2 is accepted. V1 arriving over the wire is now dropped in both handle_partial_encoded_state_witness and handle_partial_encoded_state_witness_forward, symmetric with the existing V2 drop when EarlyKickout is not yet enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR A squash-merged with reviewer changes (single make_witness(version) helper, full V2-emitting constructor). Update rebased PR B accordingly: collapse stale 'Current scope/Eventual' docstring into the active rollout policy, and switch test_v1_signature_differentiator test to the unified helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR #15640 review (#15640): P1: three-valued kickout gate. is_early_kickout_enabled returned bool with Err mapped to false, which dropped legitimate post-kickout V2 during header-sync lag at epoch boundary. Replace with early_kickout_status returning Option<bool>; extract the gate decision into a pure kickout_gate(status, witness) function so unknown-epoch falls through to the producer-lookup defer path. With no part retransmission and one unique RS part per validator, this removes a correlated-loss failure mode at epoch transitions. P3: replay_forwarded_partial_witness re-defers on DBNotFoundErr. pre_check_replay classifies on the actor thread, then the spawner re-runs the lookup; a race could surface MissingBlock/ ChunkProducerNotInDB after Ready, permanently losing the forwarded part (the producer doesn't retransmit forwards). Mirror the init-emit arm's try_defer. P3: stop double-counting received parts on replay. Split handle_partial_encoded_state_witness into a thin public entry that increments PARTIAL_WITNESS_PART_MESSAGES_RECEIVED_TOTAL once, and a shared dispatch_partial_encoded_state_witness body that handle_block_ notification calls directly so replay does not re-increment. P3: V1 + DBNotFoundErr is not 'unexpected' — V1's epoch-based sampler can legitimately return it across epoch transitions. Downgrade warning to debug with corrected wording. Tests: kickout_gate symmetry for all six (status, variant) cases, including the unknown-epoch fall-through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit test asserting validate_partial_encoded_state_witness rejects a V2 witness whose chunk_header.height_created does not match the prev_block.height + 1 implied by its prev_block_hash. Guards the authentication boundary added in validate.rs: without the cross-check a producer authorized for (prev_block, shard) could sign a witness under any (epoch_id, height_created) and have it stored/forwarded under the forged key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Route all V2 defer call sites through PendingV2WitnessCache::try_defer and drop defer_pending_v2_witness. Success log moves into try_defer so spawner-side defers also log. Replace KickoutGate enum + kickout_gate fn + early_kickout_status method with witness_kicked_out free fn taking Option<ProtocolVersion>. Preserves three-state contract — None (epoch unknown) proceeds for both variants so header-sync lag does not poison legitimate V2 traffic. Trim PENDING_V2_WITNESS_CACHE_SIZE and PendingV2WitnessCache docstrings. Fix doc contradiction (producer does not retransmit witness parts). TODO(stedfn) markers for DoS caps, capacity vs block_fetch_horizon, capacity tuning from PARTIAL_WITNESS_PENDING_CACHE_EVICTIONS_TOTAL. Net: −76 lines. Zero behavior change. Tests rewritten against witness_kicked_out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compress V2 partial-witness defer + kickout comments touched by this branch: drop articles/filler/hedging, keep all load-bearing nuance (DoS capacity headroom, "for this chunk" filter, "can exhaust" risk framing). Codex plan-review surfaced the three precision risks; all restored before edit. Net: -30 lines across actor + tests. Comments only — zero behavior change. Build + 8/8 module tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply caveman-full compression to comments added or modified by this PR in the remaining touched files (actor.rs and actor_tests.rs were compressed in the prior commit). Files: client.rs (block-notification comment), metrics.rs (eviction shard-label rationale), validate.rs (4 V2 producer-resolution comments + cross-check rationale), validate_tests.rs (height-mismatch test docstrings), core/primitives partial_witness.rs (rollout-policy line tweak). Prometheus metric Help strings left untouched — externally consumed labels behave like log messages. Net: -7 lines. Comments only — zero behavior change. Build + 19/19 stateless_validation tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace "misroutes" with "routes wrong" in test docstring — cspell does not recognize it. No functional change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract pre_check_replay into a pub(super) free fn so the replay classifier can be unit-tested against a real setup() chain without an actor. Add per-arm tests for each ReplayDisposition outcome (Requeue on signer-unavailable / too-early / missing-block, Retire on too-late / not-a-chunk-validator / invalid-shard, Ready on V2 DB resolve). Add a nightly-gated test-loop smoke test for the end-to-end defer/replay path: a V2 witness whose prev_block is held back defers (pending-cache gauge rises) and dispatches once the block lands (gauge drains). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spice distributes witnesses via its own data-distribution path, not the V2 partial-witness pending cache, so the defer never fires and run_until times out. Gate the test off the spice lane like the other stateless-validation test-loop tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…' into forknet/early-kickout-witness-v2
Remediate Trisfald review comments plus a codex-found liveness gap on the V2 partial witness defer/replay path. - defer V2 witnesses on transient IOErr in pre_check_replay (C5) - gate forwarded witness at the replay act-site instead of the classifier, keeping pre_check_replay pure (C6) - collapse the duplicated DB-lookup metric block (C7) - switch PendingV2WitnessCache from put to push so capacity evictions are counted (C11), with a delta assertion (C8) - split the EpochOutOfBounds replay disposition: requeue a V2 witness whose prev block is not yet synced, retire once the prev block is known and the signed epoch still mismatches (C12) - read the receiver actor's own cache via a test_features pending_cache_bucket_count accessor instead of the process-global gauge (C10) - also defer EpochOutOfBounds forwarded V2 parts on initial receive: validate resolves validator assignments from the signed epoch before the producer-DB defer path, so an unsynced epoch was dropped instead of deferred (codex) - doc, import, and dead-code cleanups (C1, C2, C3, C4, C9) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address Jan's J3 review note that the V2 witness comments were too dense. Keep the non-obvious WHY (security cross-check rationale, kickout-gate, defer semantics) and cut enumerations that mirror match arms, mechanism restatements, and obvious inline notes. Comments only, no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-v2-consumer # Conflicts: # chain/client/src/stateless_validation/partial_witness/partial_witness_actor.rs # chain/client/src/stateless_validation/validate.rs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…' into forknet/early-kickout-witness-v2 # Conflicts: # CHANGELOG.md
…ut logs
Move EarlyKickout from nightly (PV 152) into the stable PV 85 feature group
and remove the 5 nightly cfg gates on DBCol::ChunkProducers so the feature
compiles into a plain (non-nightly) release binary. At PV 85 the shuffle and
contract-loading config gates never fire, so EarlyKickout is the only deviation
from stable mainnet semantics — clean forknet demo isolation.
Add a single greppable `early_kickout` tracing target:
- compute_chunk_producer_blacklist: info per blacklisted producer; warn when the
safety valve drops a shard (would exclude all producers).
- get_chunk_producer_blacklist: info the full {shard_id: [account_id]} map.
- save_chunk_producers_for_header: info on slot reassignment, guarded by
blacklist.contains(default_id) since weighted samplers renormalize on any
exclusion.
Throwaway forknet branch, not for landing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move EarlyKickout from PV 85 to PV 84 (alongside Wasmtime) and cap STABLE_PROTOCOL_VERSION at 84 so the demo binary never votes up to 85. At PV 85 the mainnet epoch config (85.json) is dynamic-resharding, which fork-network's genesis writer rejects (requires a static shard layout). At PV 84 get_config resolves to the static 81.json, so set-validators writes genesis cleanly, and DynamicResharding (PV 85) stays off — leaving EarlyKickout as the only deviation from stable mainnet semantics. Genesis runs with --genesis-protocol-version 84; EarlyKickout is live from genesis. Throwaway forknet branch, not for landing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DB_VERSION was gated on ContinuousEpochSync (PV 85). Capping the demo binary at PV 84 disabled that feature and dropped DB_VERSION to 48, so the binary refused to open the forked mainnet snapshot (already at v49) with "Database version 49 is higher than the expected version 48". Pin DB_VERSION to 49 unconditionally. The DB is already at 49 so no migration runs, and the 48->49 migration self-skips on forknet DBs anyway (genesis epoch / no head). ContinuousEpochSync being off at runtime only means the epoch-sync proof is not maintained going forward, which is irrelevant to the early-kickout demo. Throwaway forknet branch, not for landing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The loop persisting chain_store_cache_update.chunk_producers into DBCol::ChunkProducers was still #[cfg(feature = "nightly")]. In the non-nightly demo binary it compiled out, so the rows seeded by save_chunk_producers_for_header / save_genesis_chunk_producers were never written. Every get_chunk_producer_info_db read then missed, returning ChunkProducerNotInDB, which aborted all chunk production from genesis (produced=0 on every shard). EarlyKickout correctly flagged all producers but its safety valve suppressed every blacklist, so the chain produced no chunks and stalled at the second epoch boundary with "not enough approvals". Remove the cfg so the flush runs in the stable build. This was the last nightly gate on the EarlyKickout production path; remaining cfg(nightly) hits are test-only. Throwaway forknet branch, not for landing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-v2-consumer # Conflicts: # chain/client/src/stateless_validation/partial_witness/partial_witness_actor.rs # chain/client/src/stateless_validation/validate.rs
Pre-landing review follow-ups for the grandparent-anchor change: - send_state_witness_ack resolved the chunk producer via the canonical height-based sampler, which diverges from the anchored resolver under skipped slots and misdirected the ack. Resolve via the grandparent anchor, falling back to the canonical sampler when the parent block is not yet processed locally (the ack is best-effort and must not abort witness processing). - Add unit coverage: a parent-absent witness with a valid anchor is accepted, a forged anchor is rejected, and the init-emit handler drops an unresolvable anchor without spawning validation. - Read the parent BlockInfo once in get_chunk_producer_info_from_prev_block instead of twice (height plus a redundant read inside grandparent_anchor). - Name the +2 grandparent-anchor height offset and use it at the writer and validator; refresh the stale ChunkProducerNotInDB doc and the stale test name; derive Debug for ChunkRelevance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment-only cleanup pass on the witness V2 grandparent-anchor work: strip step-by-step narration and prose restatement, compress multi-line test docstrings to one-line intent, keep non-obvious WHY comments. Also fix stale/inaccurate doc comments surfaced during review: - partial_witness_actor: anchor-drop covers MissingBlock and the DB-row miss - validate: TooEarly references head (not final head); return doc is ChunkRelevance - shards_manager module doc: validate_chunk_header split into preliminary/full - test-loop chunk_producers: resharding-boundary resolution is canonical, not a DB read Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-witness-v2 Swap #15640 -> #15908. Re-strip ChunkProducers nightly gates (columns.rs x5, adapter.rs, store/mod.rs flush loop) + the paired not(nightly) fallbacks. Keep PV84/stable cap 84 + all #15908 PV85 features (NIGHTLY=155). Aggregator conflict: kept #15843 forward-recompute attribution and dropped #15908's DB-anchored anchored_chunk_producers_for_aggregator helper (codex- confirmed equivalent in the live demo, self-contained in tests). Witness-v2 unaffected: it uses the separate adapter::get_chunk_producer_info_anchored. Throwaway forknet demo branch, not for landing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts: - chain/epoch-manager/src/lib.rs: union imports (keep CHUNK_GRANDPARENT_ANCHOR_HEIGHT_OFFSET + master dynamic-resharding metrics) - tools/protocol-schema-check/res/protocol_schema.toml: regenerated (witness-v2 type hashes + transitive routing enums) version.rs demo pins preserved (EarlyKickout=84, STABLE_PROTOCOL_VERSION=84). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rev-prev-anchor-v2 # Conflicts: # chain/epoch-manager/src/lib.rs # tools/protocol-schema-check/res/protocol_schema.toml
Address PR #15908 review (Trisfald). - Aggregator now mirrors consensus on a missing grandparent-anchor row: consensus errors (ChunkProducerNotInDB), so the kickout aggregator omits the shard instead of height-sampling, which could attribute a different producer than consensus resolved. anchored_chunk_producers_for_aggregator drops the shard (and its now-unused height param); update_tail skips an absent shard under Some(map). (C3 / F2) - chunk_producers: add test_resolution_consistent_across_skipped_slot, a skipped-slot (height_created > anchor.height + 2) case that poisons DB[anchor] and asserts both self-select and witness resolution read it, proving resolution is DB-keyed by hash, not height-resampled. (C4) - epoch-manager tests: add test_aggregator_skips_shard_on_missing_anchor_row and a shared setup_anchored_aggregator_chain fixture that seeds each block's row in-loop at height + 2, mirroring save_chunk_producers_for_header. - Doc fixes: spell out ChunkProductionKey (C1), drop the nightly note from a test module doc (C2), reword signature_verification.rs to drop the fn name and the unconditional "errors on a missing DB entry" claim since cross-epoch and low-height fall back to the canonical sampler with no DB read (C6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or-v2' into stedfn/witness-v2-prev-prev-anchor-v2
…ampler The C3 change in ae2687e made the kickout aggregator skip a shard when its grandparent-anchor ChunkProducers row is missing, instead of the prior height-sample fallback. That broke 5 nightly tests (test_chunk_producer_kickout, test_chunk_validator_kickout_using_{production,endorsement}_stats, test_reward_multiple_shards, near-chain test_get_validator_info) and, via the `continue`, also dropped the shard's endorsement stats. The missing-row branch is unreachable for any epoch whose kickout a node recomputes: save_chunk_producers_for_header seeds the grandparent's row two blocks before the dependent chunk aggregates, so the aggregator reads DB[anchor] which is exactly the producer consensus resolved. The skip only diverged in the unseeded test harness (a state production never has). Reverting restores the height-sample fallback (exact for consecutive heights) and endorsement tracking in one move; it still mirrors consensus wherever the row exists, i.e. everywhere that feeds a real kickout decision. Reverts the aggregator portion of ae2687e (epoch_info_aggregator.rs update_tail, lib.rs anchored_chunk_producers_for_aggregator + its height param, and the tests/mod.rs skip test + fixture). The C1/C2/C4/C6 doc/test items from that commit are unrelated and stay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…row invariant Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
seed_chunk_producers_for_test gated EarlyKickout on the epoch after the anchor; production save_chunk_producers_for_header gates on the anchor's own epoch (get_epoch_protocol_version(header.epoch_id())). The epoch-after gate over-seeds dead rows for last-of-epoch anchors across an EarlyKickout activation edge (reachable via test_protocol_version_switch); those rows are never read by the aggregator/witness cross-epoch arm. Gate on the anchor's own epoch instead; sampling still uses the epoch after the anchor. Full near-epoch-manager nightly suite 106/0; near-chain runtime::tests + garbage_collection 55/0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Jan's authenticated cache-spam review on the V2 partial witness path. The parent-absent (raced) cross-check accepted any height >= anchor.height + 2, and the same-epoch ChunkProducers DB arm resolves the producer from the anchor row regardless of height_created. A staked producer could therefore sign witnesses for every height in [anchor+2, head+MAX_HEIGHTS_AHEAD] under one anchor, each a distinct ChunkProductionKey, flooding the 5-entry per-shard parts cache and evicting valid witnesses. Pin the parent-absent branch to exact height_created == anchor.height + 2, so a non-default same-epoch anchor authorizes exactly one ChunkProductionKey per shard. This intentionally drops a legitimately skipped-slot witness (height anchor+3) still racing its parent: the skip does not buy time for the parent to arrive first, so it is a conscious interim liveness tradeoff (witnesses are lossy). MAX_HEIGHTS_AHEAD is kept (still bounds V1 and the prev_prev==default fallback). The full fix that closes spam with no skip-race loss is anchor-keyed parent-absent caching, tracked as a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ose-branch comment Add v2_witness_with_parent_known_and_skipped_slot_is_accepted: with the parent at G.height+2 (slot G.height+1 skipped) processed locally, a chunk at G.height+3 validates via the tight (parent-known) branch. Documents that only the parent-absent race drops skipped-slot witnesses; once the parent is known the exact anchor+2 pin no longer applies. Also qualify the parent-absent comment: the exact anchor+2 pin authorizes one ChunkProductionKey per (non-default, same-epoch) anchor per shard; this branch pins height, not epoch, and cross-epoch / prev_prev==default anchors fall back to canonical sampling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ut-witness-v2 # Conflicts: # core/primitives-core/src/version.rs
…or-v2' into forknet/early-kickout-witness-v2
… forknet/early-kickout-witness-v2 # Conflicts: # chain/client/src/stateless_validation/validate.rs # chain/client/src/stateless_validation/validate_tests.rs
…producers-tests' into forknet/early-kickout-witness-v2 # Conflicts: # chain/epoch-manager/src/lib.rs
The forknet demo aggregator forward-recomputes the per-block blacklist instead of reading the producers consensus resolved from DBCol::ChunkProducers. The two agree because the stored row was written as sample_chunk_producer_excluding(blacklist) and the blacklist is deterministic from the same epoch prefix the walk reconstructs; the strict-anchored aggregator is deferred to the dynamic-blacklist PR. Also fix doc drift flagged by codex review: V2 parent-absent cross-check is exact (anchor+2), not loose, in validate_tests.rs comments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Throwaway branch for a forknet demo. Combines several in-flight PRs on top of latest master, pinned to a stable PV84 binary (
STABLE_PROTOCOL_VERSION = 84,Wasmtime | EarlyKickout => 84,NIGHTLY = 155).Merge set:
Demo-layer specifics: de-nightlied the
DBCol::ChunkProducers/ anchored-resolution / contract-v2 production path so it is active at PV84; aggregator stays forward-recompute (anchored aggregator dropped, strict-anchored deferred to the dynamic-blacklist PR); store seeder keeps the kickout-excluding sampler.Not intended to land. Build / forknet-config / demo-tuning tracked separately.
CI note: the
Cargo Nextest (Linux)stable job is expected-red on this branch and is not a regression. It pinsSTABLE_PROTOCOL_VERSION = 84(the Wasmtime + EarlyKickout activation edge), so master's PV86-calibrated test suite hits ~123 Wasmtime "unknown or invalid import" LinkErrors (standard test contracts declare PV85-gated host fns:chain_id,p256_verify,yield_with_id,gas_key_host_fns) plus feature-85-off tests. Pre-existing since the 2026-06-16 master merge. Nightly / Spice / MacOS are green.