cherry-pick: async-backing backports from v1.13 onto upgrade/1.12.0 (parked)#26
Closed
DrudgeRajen wants to merge 4 commits into
Closed
cherry-pick: async-backing backports from v1.13 onto upgrade/1.12.0 (parked)#26DrudgeRajen wants to merge 4 commits into
DrudgeRajen wants to merge 4 commits into
Conversation
... when backing group is of size 1. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
…h#4471) Implements paritytech#4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve paritytech#4167 TODO: - [x] fix unit tests
Implements most of paritytech#1797 Core sharing (two parachains or more marachains scheduled on the same core with the same `PartsOf57600` value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session. Core sharing should also work when collators are building collations ahead of time TODOs: - [x] Add a zombienet test verifying that the behaviour mentioned above works. - [x] prdoc --------- Co-authored-by: alindima <alin@parity.io>
After cherry-picking paritytech#4724 (core sharing + scheduling_lookahead) on top of v1.12.0, five upstream references do not resolve because the enabling PRs landed later: 1. `polkadot_statement_table::{…}` → use local `statement-table` alias (crate package name is `polkadot-statement-table`, imported via `statement-table = { package = … }` in Cargo.toml). 2. `polkadot_primitives::{…}` in `scheduler.rs` and `runtime_api_impl/vstaging.rs` → use local `primitives` alias. 3. `AvailabilityStoreMessage::StoreAvailableData::{core_index, node_features}` → the two new fields were added by the systematic-chunks PR paritytech#1644, which is NOT cherry-picked (it's 7540 LoC / 84 files and brings an unrelated availability-recovery rewrite). Keep the outer function signature as-is for API parity with upstream and drop the two fields at the message-send boundary via `let _ = …;`. The erasure-root check that protects consensus continues to run. 4. `free_cores_and_fill_claimqueue` → in our tree the method is named `free_cores_and_fill_claim_queue` (the rename was a small cleanup that happened concurrently in upstream). Update the two call sites (paras_inherent and runtime_api v10). Build clean: `cargo check -p polkadot -p polkadot-service -p polkadot-runtime-parachains -p thxnet-testnet-runtime` all pass with only pre-existing warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DrudgeRajen
added a commit
that referenced
this pull request
Apr 19, 2026
5 tasks
Collaborator
Author
|
Superseded by #30 (consolidated branch |
DrudgeRajen
added a commit
that referenced
this pull request
Apr 19, 2026
#30) * statement-distribution: Fix false warning (paritytech#4727) ... when backing group is of size 1. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> * Remove the prospective-parachains subsystem from collators (paritytech#4471) Implements paritytech#4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve paritytech#4167 TODO: - [x] fix unit tests * Fix core sharing and make use of scheduling_lookahead (paritytech#4724) Implements most of paritytech#1797 Core sharing (two parachains or more marachains scheduled on the same core with the same `PartsOf57600` value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session. Core sharing should also work when collators are building collations ahead of time TODOs: - [x] Add a zombienet test verifying that the behaviour mentioned above works. - [x] prdoc --------- Co-authored-by: alindima <alin@parity.io> * fix(1.12.0-backports): adapt paritytech#4724 to v1.12.0 subsystem APIs After cherry-picking paritytech#4724 (core sharing + scheduling_lookahead) on top of v1.12.0, five upstream references do not resolve because the enabling PRs landed later: 1. `polkadot_statement_table::{…}` → use local `statement-table` alias (crate package name is `polkadot-statement-table`, imported via `statement-table = { package = … }` in Cargo.toml). 2. `polkadot_primitives::{…}` in `scheduler.rs` and `runtime_api_impl/vstaging.rs` → use local `primitives` alias. 3. `AvailabilityStoreMessage::StoreAvailableData::{core_index, node_features}` → the two new fields were added by the systematic-chunks PR paritytech#1644, which is NOT cherry-picked (it's 7540 LoC / 84 files and brings an unrelated availability-recovery rewrite). Keep the outer function signature as-is for API parity with upstream and drop the two fields at the message-send boundary via `let _ = …;`. The erasure-root check that protects consensus continues to run. 4. `free_cores_and_fill_claimqueue` → in our tree the method is named `free_cores_and_fill_claim_queue` (the rename was a small cleanup that happened concurrently in upstream). Update the two call sites (paras_inherent and runtime_api v10). Build clean: `cargo check -p polkadot -p polkadot-service -p polkadot-runtime-parachains -p thxnet-testnet-runtime` all pass with only pre-existing warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(leafchain,v1.12.0): UNINCLUDED_SEGMENT_CAPACITY 2 → 1 to sidestep fragment-chain fork deadlock # Problem On v1.12.0 rootchain with async backing enabled (depth=1, ancestry=2) and a small backing topology (3 validators, 1 group, 1 core), the parachain stalls permanently after a few blocks. Validator log symptom (repeats every ~18s): - `Refusing to second candidate at leaf. Is not a potential member.` - `Rejected v2 advertisement ... error=BlockedByBacking` - Provisioner: `candidates_count=0` in every inherent # Root cause The v1.12.0 relay-side prospective-parachains subsystem is the pre-paritytech#4937 ("prospective-parachains rework: take II") fragment-chain. In `polkadot/node/core/prospective-parachains/src/fragment_chain/mod.rs:797` `is_fork_or_cycle` returns true if ANY other candidate already has the same `parent_head_hash`, rejecting it as a fork. When inclusion is slow, the collator's cumulus aura-ext consensus hook (FixedVelocityConsensusHook<V=1, C=2>) allows authoring a 2nd block every ~18s with the same parent while the first one is still waiting to be included. Each re-author produces a new block N with different extrinsics → same parent_head_hash → fragment-chain rejects all but the first → backing pipeline never completes a full cycle → permanent stall. Upstream fix is paritytech#4937, first in stable2409. Not cherry-pickable into v1.12.0 without dragging Constraints types from later releases (~1355 LoC rewrite). # Workaround Set UNINCLUDED_SEGMENT_CAPACITY = 1. `FixedVelocityConsensusHook::can_build_upon` checks `size_after_included >= C` and returns false if true. With C=1, once the collator has produced one unincluded block, all further authoring attempts at the same parent are blocked until that block is included on the relay. No forks are ever created at the cumulus side → fragment-chain sees a single candidate per cycle → accepts it normally. Tradeoff: para block production becomes synchronous-backing pace (one block per ~18s = 3 relay slots) instead of async's potential 1:1 with relay. Acceptable for small networks that need v1.12.0 as a stable endpoint rather than a transient upgrade hop. # Validation Rehearsed on forked-testnet 2026-04-18/19: - v0.9.40 → binary upgrade → leafchain setCode spec 20 → 21 (capacity=1) - rootchain setCode v1.12.0 (spec 112000003) - Para advanced from stuck-at-13 → 4556+ overnight, finalization keeping pace with head, validator provisioner consistently reports `candidates_count=1` on backing cycles. Prior capacity=2 rehearsal on the same topology reproducibly stalled para at 13–30 within minutes of rootchain setCode and never recovered. Bump leafchain spec_version 20 → 21 so setCode applies after the v1.12.0 rootchain upgrade. The stable2512 hop restores capacity to 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v1.12.0): free stuck AvailabilityCores + ClaimQueue atomically with setCode Extends `EnableAsyncBackingAndCoretime` migration to force-free any occupied `AvailabilityCores` and kill `ClaimQueue` right after rewriting `ActiveConfig`. Mirrors what `Scheduler::push_occupied_cores_to_assignment_provider` does at session rotation, so the next `ParaInherent` pass can schedule fresh candidates without waiting ~1 hour for the prod-testnet session boundary. Background: when the v1.12.0 rootchain runtime enables async backing while a candidate is still occupying a core (common under capacity=2 cumulus + pre-paritytech#4937 relay), the occupying entry never reaches availability and the core stays stuck until the next session. Empirically measured on forked-testnet: - plain migration → para stuck 56 min (until session rotation) - this patch + validator restart → para advancing 16 s after setCode InBlock The restart step is required because prospective-parachains / SessionInfo caches live in validator-process memory and need flushing to pick up the new scheduler state. Runbook: `kubectl rollout restart deploy/validator-*` post-setCode. - Bumps spec_version 112_000_003 → 112_000_004 - Idempotent: re-running on a chain where cores are already free is a no-op Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: Alexandru Gheorghe <49718502+alexggh@users.noreply.github.com> Co-authored-by: Alin Dima <alin@parity.io> Co-authored-by: Tsvetomir Dimitrov <tsvetomir@parity.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-picks of v1.13+ upstream fixes for async-backing inclusion stalls, adapted to v1.12.0 subsystem APIs.
Status: PARKED / for reference. Compiles clean, deploys clean, but does NOT fully resolve the v1.12.0 parachain inclusion stall we hit on mini-forknet. The underlying fix is upstream #4937 prospective-parachains rework (first in stable2409), which is a ~3000 LoC subsystem rewrite not cherry-pickable without rebasing
polkadot/node/core/prospective-parachainsonto stable2409 HEAD.Commits
b036e8ea— PR #4727: statement-distribution Fix false warning6c05e873— PR #4471: Remove the prospective-parachains subsystem from collators6714c540— PR #4724: Fix core sharing and make use of scheduling_lookahead7c807f12— adapter commit for v1.12.0 API gaps:polkadot_statement_table→ localstatement-tablealiaspolkadot_primitives::{…}→ localprimitivesalias (scheduler.rs, runtime_api_impl/vstaging.rs)AvailabilityStoreMessage::StoreAvailableData::{core_index, node_features}dropped at send boundary (systematic-chunks PR #1644 not cherry-picked — too large, 7540 LoC)free_cores_and_fill_claimqueue→free_cores_and_fill_claim_queuerename reconciledWhy this was tried
Forknet-small topology (3 validators, 1 group, 1 core) deadlocks the v1.12.0 fragment-chain under async-backing. We hoped the v1.13 backports would unblock it without pulling paritytech#4937 wholesale.
Outcome
Empirically: para still stalls at ~15 blocks. Backports help a bit (~6 inclusions happen before stall vs 0-1 on plain voterfix) but don't reach steady state. Only stable2512 rootchain setCode unblocks fully — as documented in CLAUDE.md.
Prod implication
Do NOT ship this branch to prod. Prod upgrade plan stays: v0.9.40 → v1.12.0 → stable2512 back-to-back in one maintenance window, accepting the ~5 min transient inclusion gap between setCodes. Chain self-heals once stable2512 lands (verified on mini-forknet: stuck-at-15 → advancing past 26 within seconds of stable2512 setCode).
Test plan
cargo check -p polkadot -p polkadot-service -p polkadot-runtime-parachains -p thxnet-testnet-runtimecleancargo build --release -p polkadot -p thxnet-testnet-runtime -p general-runtime -p thxnet-leafchainall succeed🤖 Generated with Claude Code