Skip to content

feat(forknet): combined early chunk producer kickout#15870

Draft
stedfn wants to merge 49 commits into
masterfrom
forknet/early-kickout-witness-v2
Draft

feat(forknet): combined early chunk producer kickout#15870
stedfn wants to merge 49 commits into
masterfrom
forknet/early-kickout-witness-v2

Conversation

@stedfn

@stedfn stedfn commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Throwaway branch for a forknet demo. Combines several in-flight PRs on top of latest master, pinned to a stable PV84 binary (STABLE_PROTOCOL_VERSION = 84, Wasmtime | EarlyKickout => 84, NIGHTLY = 155).

Merge set:

Demo-layer specifics: de-nightlied the DBCol::ChunkProducers / anchored-resolution / contract-v2 production path so it is active at PV84; aggregator stays forward-recompute (anchored aggregator dropped, strict-anchored deferred to the dynamic-blacklist PR); store seeder keeps the kickout-excluding sampler.

Not intended to land. Build / forknet-config / demo-tuning tracked separately.


CI note: the Cargo Nextest (Linux) stable job is expected-red on this branch and is not a regression. It pins STABLE_PROTOCOL_VERSION = 84 (the Wasmtime + EarlyKickout activation edge), so master's PV86-calibrated test suite hits ~123 Wasmtime "unknown or invalid import" LinkErrors (standard test contracts declare PV85-gated host fns: chain_id, p256_verify, yield_with_id, gas_key_host_fns) plus feature-85-off tests. Pre-existing since the 2026-06-16 master merge. Nightly / Spice / MacOS are green.

stedfn and others added 30 commits May 21, 2026 11:01
Activate V2 partial witness emission and processing, gated behind
ProtocolFeature::EarlyKickout.

Constructor: `VersionedPartialEncodedStateWitness::new()` now emits V2
when the epoch protocol version >= EarlyKickout, V1 otherwise.

Consumer: replaces the unconditional V2 drop with EarlyKickout gating.
V2 witnesses use `prev_block_hash`-based chunk producer lookup against
`DBCol::ChunkProducers`. When the prev block isn't yet available,
witnesses are deferred into `PendingV2WitnessCache` (64-entry LRU)
and replayed on `BlockNotificationMessage` via scan-on-notification
with three-outcome classification (Ready/Requeue/Retire).

Includes: V2 validation with cross-check (epoch_id, height_created vs
prev_block_hash-implied values), DB lookup metrics, pending cache
size/eviction metrics, emitted/received version-labeled counters,
`validate_chunk_relevant_as_validator` made public for replay pre-check,
`is_early_kickout_enabled` helper, `BlockNotificationMessage` plumbing
from client to PartialWitnessActor, and comprehensive cache tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After EarlyKickout activation, only V2 is accepted. V1 arriving over
the wire is now dropped in both handle_partial_encoded_state_witness
and handle_partial_encoded_state_witness_forward, symmetric with the
existing V2 drop when EarlyKickout is not yet enabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR A squash-merged with reviewer changes (single make_witness(version)
helper, full V2-emitting constructor). Update rebased PR B accordingly:
collapse stale 'Current scope/Eventual' docstring into the active
rollout policy, and switch test_v1_signature_differentiator test to
the unified helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR #15640 review (#15640):

P1: three-valued kickout gate. is_early_kickout_enabled returned bool
with Err mapped to false, which dropped legitimate post-kickout V2
during header-sync lag at epoch boundary. Replace with
early_kickout_status returning Option<bool>; extract the gate
decision into a pure kickout_gate(status, witness) function so
unknown-epoch falls through to the producer-lookup defer path. With
no part retransmission and one unique RS part per validator, this
removes a correlated-loss failure mode at epoch transitions.

P3: replay_forwarded_partial_witness re-defers on DBNotFoundErr.
pre_check_replay classifies on the actor thread, then the spawner
re-runs the lookup; a race could surface MissingBlock/
ChunkProducerNotInDB after Ready, permanently losing the forwarded
part (the producer doesn't retransmit forwards). Mirror the init-emit
arm's try_defer.

P3: stop double-counting received parts on replay. Split
handle_partial_encoded_state_witness into a thin public entry that
increments PARTIAL_WITNESS_PART_MESSAGES_RECEIVED_TOTAL once, and a
shared dispatch_partial_encoded_state_witness body that handle_block_
notification calls directly so replay does not re-increment.

P3: V1 + DBNotFoundErr is not 'unexpected' — V1's epoch-based sampler
can legitimately return it across epoch transitions. Downgrade
warning to debug with corrected wording.

Tests: kickout_gate symmetry for all six (status, variant) cases,
including the unknown-epoch fall-through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit test asserting validate_partial_encoded_state_witness rejects a
V2 witness whose chunk_header.height_created does not match the
prev_block.height + 1 implied by its prev_block_hash. Guards the
authentication boundary added in validate.rs: without the cross-check a
producer authorized for (prev_block, shard) could sign a witness under
any (epoch_id, height_created) and have it stored/forwarded under the
forged key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Route all V2 defer call sites through PendingV2WitnessCache::try_defer and
drop defer_pending_v2_witness. Success log moves into try_defer so spawner-side
defers also log.

Replace KickoutGate enum + kickout_gate fn + early_kickout_status method with
witness_kicked_out free fn taking Option<ProtocolVersion>. Preserves
three-state contract — None (epoch unknown) proceeds for both variants so
header-sync lag does not poison legitimate V2 traffic.

Trim PENDING_V2_WITNESS_CACHE_SIZE and PendingV2WitnessCache docstrings. Fix
doc contradiction (producer does not retransmit witness parts). TODO(stedfn)
markers for DoS caps, capacity vs block_fetch_horizon, capacity tuning from
PARTIAL_WITNESS_PENDING_CACHE_EVICTIONS_TOTAL.

Net: −76 lines. Zero behavior change. Tests rewritten against witness_kicked_out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compress V2 partial-witness defer + kickout comments touched by this branch:
drop articles/filler/hedging, keep all load-bearing nuance (DoS capacity
headroom, "for this chunk" filter, "can exhaust" risk framing). Codex
plan-review surfaced the three precision risks; all restored before edit.

Net: -30 lines across actor + tests. Comments only — zero behavior change.
Build + 8/8 module tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply caveman-full compression to comments added or modified by this PR in the
remaining touched files (actor.rs and actor_tests.rs were compressed in the
prior commit).

Files: client.rs (block-notification comment), metrics.rs (eviction shard-label
rationale), validate.rs (4 V2 producer-resolution comments + cross-check
rationale), validate_tests.rs (height-mismatch test docstrings),
core/primitives partial_witness.rs (rollout-policy line tweak).

Prometheus metric Help strings left untouched — externally consumed labels
behave like log messages.

Net: -7 lines. Comments only — zero behavior change. Build + 19/19
stateless_validation tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace "misroutes" with "routes wrong" in test docstring — cspell does not
recognize it. No functional change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract pre_check_replay into a pub(super) free fn so the replay
classifier can be unit-tested against a real setup() chain without an
actor. Add per-arm tests for each ReplayDisposition outcome (Requeue
on signer-unavailable / too-early / missing-block, Retire on too-late /
not-a-chunk-validator / invalid-shard, Ready on V2 DB resolve).

Add a nightly-gated test-loop smoke test for the end-to-end defer/replay
path: a V2 witness whose prev_block is held back defers (pending-cache
gauge rises) and dispatches once the block lands (gauge drains).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spice distributes witnesses via its own data-distribution path, not the
V2 partial-witness pending cache, so the defer never fires and run_until
times out. Gate the test off the spice lane like the other
stateless-validation test-loop tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remediate Trisfald review comments plus a codex-found liveness gap on the
V2 partial witness defer/replay path.

- defer V2 witnesses on transient IOErr in pre_check_replay (C5)
- gate forwarded witness at the replay act-site instead of the classifier,
  keeping pre_check_replay pure (C6)
- collapse the duplicated DB-lookup metric block (C7)
- switch PendingV2WitnessCache from put to push so capacity evictions are
  counted (C11), with a delta assertion (C8)
- split the EpochOutOfBounds replay disposition: requeue a V2 witness whose
  prev block is not yet synced, retire once the prev block is known and the
  signed epoch still mismatches (C12)
- read the receiver actor's own cache via a test_features
  pending_cache_bucket_count accessor instead of the process-global gauge (C10)
- also defer EpochOutOfBounds forwarded V2 parts on initial receive: validate
  resolves validator assignments from the signed epoch before the producer-DB
  defer path, so an unsynced epoch was dropped instead of deferred (codex)
- doc, import, and dead-code cleanups (C1, C2, C3, C4, C9)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address Jan's J3 review note that the V2 witness comments were too dense.
Keep the non-obvious WHY (security cross-check rationale, kickout-gate, defer
semantics) and cut enumerations that mirror match arms, mechanism restatements,
and obvious inline notes. Comments only, no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-v2-consumer

# Conflicts:
#	chain/client/src/stateless_validation/partial_witness/partial_witness_actor.rs
#	chain/client/src/stateless_validation/validate.rs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…' into forknet/early-kickout-witness-v2

# Conflicts:
#	CHANGELOG.md
…ut logs

Move EarlyKickout from nightly (PV 152) into the stable PV 85 feature group
and remove the 5 nightly cfg gates on DBCol::ChunkProducers so the feature
compiles into a plain (non-nightly) release binary. At PV 85 the shuffle and
contract-loading config gates never fire, so EarlyKickout is the only deviation
from stable mainnet semantics — clean forknet demo isolation.

Add a single greppable `early_kickout` tracing target:
- compute_chunk_producer_blacklist: info per blacklisted producer; warn when the
  safety valve drops a shard (would exclude all producers).
- get_chunk_producer_blacklist: info the full {shard_id: [account_id]} map.
- save_chunk_producers_for_header: info on slot reassignment, guarded by
  blacklist.contains(default_id) since weighted samplers renormalize on any
  exclusion.

Throwaway forknet branch, not for landing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move EarlyKickout from PV 85 to PV 84 (alongside Wasmtime) and cap
STABLE_PROTOCOL_VERSION at 84 so the demo binary never votes up to 85.

At PV 85 the mainnet epoch config (85.json) is dynamic-resharding, which
fork-network's genesis writer rejects (requires a static shard layout). At
PV 84 get_config resolves to the static 81.json, so set-validators writes
genesis cleanly, and DynamicResharding (PV 85) stays off — leaving
EarlyKickout as the only deviation from stable mainnet semantics. Genesis
runs with --genesis-protocol-version 84; EarlyKickout is live from genesis.

Throwaway forknet branch, not for landing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DB_VERSION was gated on ContinuousEpochSync (PV 85). Capping the demo
binary at PV 84 disabled that feature and dropped DB_VERSION to 48, so the
binary refused to open the forked mainnet snapshot (already at v49) with
"Database version 49 is higher than the expected version 48".

Pin DB_VERSION to 49 unconditionally. The DB is already at 49 so no
migration runs, and the 48->49 migration self-skips on forknet DBs anyway
(genesis epoch / no head). ContinuousEpochSync being off at runtime only
means the epoch-sync proof is not maintained going forward, which is
irrelevant to the early-kickout demo.

Throwaway forknet branch, not for landing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The loop persisting chain_store_cache_update.chunk_producers into
DBCol::ChunkProducers was still #[cfg(feature = "nightly")]. In the
non-nightly demo binary it compiled out, so the rows seeded by
save_chunk_producers_for_header / save_genesis_chunk_producers were never
written. Every get_chunk_producer_info_db read then missed, returning
ChunkProducerNotInDB, which aborted all chunk production from genesis
(produced=0 on every shard). EarlyKickout correctly flagged all producers
but its safety valve suppressed every blacklist, so the chain produced no
chunks and stalled at the second epoch boundary with "not enough approvals".

Remove the cfg so the flush runs in the stable build. This was the last
nightly gate on the EarlyKickout production path; remaining cfg(nightly)
hits are test-only.

Throwaway forknet branch, not for landing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-v2-consumer

# Conflicts:
#	chain/client/src/stateless_validation/partial_witness/partial_witness_actor.rs
#	chain/client/src/stateless_validation/validate.rs
Pre-landing review follow-ups for the grandparent-anchor change:

- send_state_witness_ack resolved the chunk producer via the canonical
  height-based sampler, which diverges from the anchored resolver under
  skipped slots and misdirected the ack. Resolve via the grandparent
  anchor, falling back to the canonical sampler when the parent block is
  not yet processed locally (the ack is best-effort and must not abort
  witness processing).
- Add unit coverage: a parent-absent witness with a valid anchor is
  accepted, a forged anchor is rejected, and the init-emit handler drops
  an unresolvable anchor without spawning validation.
- Read the parent BlockInfo once in get_chunk_producer_info_from_prev_block
  instead of twice (height plus a redundant read inside grandparent_anchor).
- Name the +2 grandparent-anchor height offset and use it at the writer
  and validator; refresh the stale ChunkProducerNotInDB doc and the stale
  test name; derive Debug for ChunkRelevance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stedfn and others added 19 commits June 15, 2026 13:28
Comment-only cleanup pass on the witness V2 grandparent-anchor work:
strip step-by-step narration and prose restatement, compress multi-line
test docstrings to one-line intent, keep non-obvious WHY comments.

Also fix stale/inaccurate doc comments surfaced during review:
- partial_witness_actor: anchor-drop covers MissingBlock and the DB-row miss
- validate: TooEarly references head (not final head); return doc is ChunkRelevance
- shards_manager module doc: validate_chunk_header split into preliminary/full
- test-loop chunk_producers: resharding-boundary resolution is canonical, not a DB read

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-witness-v2

Swap #15640 -> #15908. Re-strip ChunkProducers nightly gates (columns.rs x5,
adapter.rs, store/mod.rs flush loop) + the paired not(nightly) fallbacks.
Keep PV84/stable cap 84 + all #15908 PV85 features (NIGHTLY=155).

Aggregator conflict: kept #15843 forward-recompute attribution and dropped
#15908's DB-anchored anchored_chunk_producers_for_aggregator helper (codex-
confirmed equivalent in the live demo, self-contained in tests). Witness-v2
unaffected: it uses the separate adapter::get_chunk_producer_info_anchored.

Throwaway forknet demo branch, not for landing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts:
- chain/epoch-manager/src/lib.rs: union imports (keep
  CHUNK_GRANDPARENT_ANCHOR_HEIGHT_OFFSET + master dynamic-resharding metrics)
- tools/protocol-schema-check/res/protocol_schema.toml: regenerated
  (witness-v2 type hashes + transitive routing enums)

version.rs demo pins preserved (EarlyKickout=84, STABLE_PROTOCOL_VERSION=84).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rev-prev-anchor-v2

# Conflicts:
#	chain/epoch-manager/src/lib.rs
#	tools/protocol-schema-check/res/protocol_schema.toml
Address PR #15908 review (Trisfald).

- Aggregator now mirrors consensus on a missing grandparent-anchor row:
  consensus errors (ChunkProducerNotInDB), so the kickout aggregator omits
  the shard instead of height-sampling, which could attribute a different
  producer than consensus resolved. anchored_chunk_producers_for_aggregator
  drops the shard (and its now-unused height param); update_tail skips an
  absent shard under Some(map). (C3 / F2)
- chunk_producers: add test_resolution_consistent_across_skipped_slot, a
  skipped-slot (height_created > anchor.height + 2) case that poisons
  DB[anchor] and asserts both self-select and witness resolution read it,
  proving resolution is DB-keyed by hash, not height-resampled. (C4)
- epoch-manager tests: add test_aggregator_skips_shard_on_missing_anchor_row
  and a shared setup_anchored_aggregator_chain fixture that seeds each block's
  row in-loop at height + 2, mirroring save_chunk_producers_for_header.
- Doc fixes: spell out ChunkProductionKey (C1), drop the nightly note from a
  test module doc (C2), reword signature_verification.rs to drop the fn name
  and the unconditional "errors on a missing DB entry" claim since cross-epoch
  and low-height fall back to the canonical sampler with no DB read (C6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or-v2' into stedfn/witness-v2-prev-prev-anchor-v2
…ampler

The C3 change in ae2687e made the kickout aggregator skip a shard when its
grandparent-anchor ChunkProducers row is missing, instead of the prior
height-sample fallback. That broke 5 nightly tests (test_chunk_producer_kickout,
test_chunk_validator_kickout_using_{production,endorsement}_stats,
test_reward_multiple_shards, near-chain test_get_validator_info) and, via the
`continue`, also dropped the shard's endorsement stats.

The missing-row branch is unreachable for any epoch whose kickout a node
recomputes: save_chunk_producers_for_header seeds the grandparent's row two
blocks before the dependent chunk aggregates, so the aggregator reads DB[anchor]
which is exactly the producer consensus resolved. The skip only diverged in the
unseeded test harness (a state production never has). Reverting restores the
height-sample fallback (exact for consecutive heights) and endorsement tracking
in one move; it still mirrors consensus wherever the row exists, i.e. everywhere
that feeds a real kickout decision.

Reverts the aggregator portion of ae2687e (epoch_info_aggregator.rs update_tail,
lib.rs anchored_chunk_producers_for_aggregator + its height param, and the
tests/mod.rs skip test + fixture). The C1/C2/C4/C6 doc/test items from that
commit are unrelated and stay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…row invariant

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
seed_chunk_producers_for_test gated EarlyKickout on the epoch after the
anchor; production save_chunk_producers_for_header gates on the anchor's
own epoch (get_epoch_protocol_version(header.epoch_id())). The epoch-after
gate over-seeds dead rows for last-of-epoch anchors across an EarlyKickout
activation edge (reachable via test_protocol_version_switch); those rows
are never read by the aggregator/witness cross-epoch arm.

Gate on the anchor's own epoch instead; sampling still uses the epoch
after the anchor. Full near-epoch-manager nightly suite 106/0; near-chain
runtime::tests + garbage_collection 55/0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Jan's authenticated cache-spam review on the V2 partial witness
path. The parent-absent (raced) cross-check accepted any height >=
anchor.height + 2, and the same-epoch ChunkProducers DB arm resolves the
producer from the anchor row regardless of height_created. A staked
producer could therefore sign witnesses for every height in
[anchor+2, head+MAX_HEIGHTS_AHEAD] under one anchor, each a distinct
ChunkProductionKey, flooding the 5-entry per-shard parts cache and
evicting valid witnesses.

Pin the parent-absent branch to exact height_created == anchor.height + 2,
so a non-default same-epoch anchor authorizes exactly one
ChunkProductionKey per shard. This intentionally drops a legitimately
skipped-slot witness (height anchor+3) still racing its parent: the skip
does not buy time for the parent to arrive first, so it is a conscious
interim liveness tradeoff (witnesses are lossy). MAX_HEIGHTS_AHEAD is kept
(still bounds V1 and the prev_prev==default fallback). The full fix that
closes spam with no skip-race loss is anchor-keyed parent-absent caching,
tracked as a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ose-branch comment

Add v2_witness_with_parent_known_and_skipped_slot_is_accepted: with the parent
at G.height+2 (slot G.height+1 skipped) processed locally, a chunk at
G.height+3 validates via the tight (parent-known) branch. Documents that only
the parent-absent race drops skipped-slot witnesses; once the parent is known
the exact anchor+2 pin no longer applies.

Also qualify the parent-absent comment: the exact anchor+2 pin authorizes one
ChunkProductionKey per (non-default, same-epoch) anchor per shard; this branch
pins height, not epoch, and cross-epoch / prev_prev==default anchors fall back
to canonical sampling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ut-witness-v2

# Conflicts:
#	core/primitives-core/src/version.rs
…or-v2' into forknet/early-kickout-witness-v2
… forknet/early-kickout-witness-v2

# Conflicts:
#	chain/client/src/stateless_validation/validate.rs
#	chain/client/src/stateless_validation/validate_tests.rs
…producers-tests' into forknet/early-kickout-witness-v2

# Conflicts:
#	chain/epoch-manager/src/lib.rs
The forknet demo aggregator forward-recomputes the per-block blacklist
instead of reading the producers consensus resolved from
DBCol::ChunkProducers. The two agree because the stored row was written
as sample_chunk_producer_excluding(blacklist) and the blacklist is
deterministic from the same epoch prefix the walk reconstructs; the
strict-anchored aggregator is deferred to the dynamic-blacklist PR.

Also fix doc drift flagged by codex review: V2 parent-absent cross-check
is exact (anchor+2), not loose, in validate_tests.rs comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@stedfn stedfn changed the title feat(forknet): combine early chunk producer kickout and witness v2 feat(forknet): combined early chunk producer kickout Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant