fix(thxnet,thxnet-testnet): unify EnableAsyncBackingAndCoretime — close mainnet async-backing gap#37
Merged
kumanoko24 merged 1 commit intoMay 5, 2026
Conversation
…pology-aware + atomic with setCode Brings mainnet (`thxnet`) to async-backing parity with testnet by adding the `EnableAsyncBackingAndCoretime` migration, and converges both runtimes onto a single best-of-both-worlds variant. ## Why Pre-this-PR, `release/v1.12.0` ships: - testnet @ spec 112_000_004 — has `EnableAsyncBackingAndCoretime` (always-set semantics + AvailabilityCores force-free + ClaimQueue::kill) - mainnet @ spec 112_000_001 — has NO async-backing migration at all, only the runtime API binding Mainnet rolling out v1.12.0 in this state would leave the relay-side async-backing config at v0.9.40-era defaults (zero `num_cores`, `async_backing_params=(0,0)`, `node_features[3]=false`). v1.12.0+ cumulus collators advertise CandidateReceiptV2; without bit 3 the relay rejects them with `BlockedByBacking`, freezing all 4 mainnet leafchains (avatect-mainnet, lmt-mainnet, ecq-mainnet, thx-mainnet) post-upgrade until a follow-up runtime upgrade lands. ## Unified design (single source of truth) Both runtimes now share the same migration body. Topology-aware writes (from W1 `fix/async-backing-migration-prod-safety`) preserve operator-set values; atomic-with-setCode defense (from PR #30 inherited into testnet) force-frees stuck cores so the next ParaInherent pass schedules without waiting for session rotation. Topology-aware: - `num_cores` : set to `max(para_count,1)` only if currently 0 - `max_vals/core` : `Some(5)` only if `None && active_validators >= 15 && num_cores >= 3`; otherwise leave - `lookahead` : ensure >= 1 - `async_backing_params` : ensure (depth>=1, ancestry>=2) - `node_features` : always force bits 0,1,3 Atomic-with-setCode: - `AvailabilityCores::mutate(|cores| *core = Free for all)` - `ClaimQueue::kill()` ## Live evidence (try-runtime on-runtime-upgrade against archive RPC) testnet (wss://node.testnet.thxnet.org): num_cores=5, max_vals_per_core=Some(5), lookahead=1, async_backing=(1,2), node_features[0,1,3]=true, AvailabilityCores freed, ClaimQueue cleared, active_validators=19 → topology rule fires (19 >= 15, 5 >= 3) → Some(5) → 60 try-state checks PASS, exit 0, idempotent (4 re-runs identical) mainnet (wss://node.mainnet.thxnet.org): num_cores=4, max_vals_per_core=Some(5), lookahead=1, async_backing=(1,2), node_features[0,1,3]=true, AvailabilityCores freed, ClaimQueue cleared, active_validators=16 → topology rule fires (16 >= 15, 4 >= 3) → Some(5) → 60 try-state checks PASS, exit 0, idempotent (4 re-runs identical) ## spec_version bumps - `thxnet` : 112_000_001 → 112_000_002 (new migration added) - `thxnet-testnet` : 112_000_004 → 112_000_005 (migration body replaced) ## Operator runbook (post-setCode) `kubectl rollout restart deploy/validator-*` is REQUIRED after the runtime upgrade applies. The migration force-frees `AvailabilityCores` in storage but relay-client subsystems (prospective-parachains / fragment-chain / SessionInfo) cache scheduler state per session in validator-process memory. Without restart, those caches pin the stale state until the next real session boundary. THXLAB AI Team
kumanoko24
added a commit
that referenced
this pull request
May 7, 2026
…script Path B partial success: patched genesis ParasShared::ActiveValidatorKeys storage to 15 entries (3 real + 13 fake) and registered 3 paraIds. Migration log line confirmed: EnableAsyncBackingAndCoretime: num_cores=3, max_vals_per_core=Some(5), active_validators=15, node_features[0,1,3]=true, ... This is the first time the migration's topology rule has fired in the small forknet — proves PR #37 logic is correct. However, para 6s/block still not achievable with fake validators because backing quorum (majority of group_size=5 = 3 online votes) requires actual online validators matching group assignment. Forknet has 3 online distributed across 3 groups → no group reaches quorum → cumulus UnincludedSegment fills → para stuck. Production reality: 16-19 ALL-online validators per testnet/mainnet will have full quorum in every group → 6s/block engages naturally post-rollout (already validated by PR #37 try-runtime live runs). Adds polkadot/scripts/forknet/patch-avk-then-setcode.ts as a reusable helper for future rehearsals that need to patch storage at runtime before triggering setCode (sudo.system.setStorage + sudo.setCodeWithoutChecks in sequence). W1/W2/W4 drift PASS throughout. THXLAB AI Team
3 tasks
kumanoko24
added a commit
that referenced
this pull request
May 8, 2026
Capture production-faithful upgrade rehearsal evidence for release/v1.12.0 (`6b7ee05aea`) on freshly-rsynced testnet livenet seed. REPORT-rehearsal-2026-05-06.md (Path A flow): - P6.4-equivalent cumulus 2-step setCode mechanics PASS - Setup with v1.12.0 polkadot fork-genesis + leafchain --chain=dev REPORT-rehearsal-v5-2026-05-07.md (Path E.1 + E.2 flows): - Path E.1 (dev para): production-faithful flow with OLD polkadot fork-genesis + v1.12.0 binary boot via WASM execution + real setCode → spec 94000004 → 112000005 → EnableAsyncBackingAndCoretime migration log line confirmed - Path E.2 (livenet sand-testnet para): all 4 upgrade dimensions exercised (rootchain binary swap, rootchain runtime upgrade, leafchain binary swap, leafchain runtime upgrade) - Storage delta verified via state_getStorage decode: async_backing_params=(depth=1, ancestry=2), scheduler_params.lookahead=1, node_features[0,1,3]=true (CandidateReceiptV2 acceptance — the critical mainnet unfreeze fix), AvailabilityCores cleared, ClaimQueue cleared - HostConfiguration v0.9.x → v1.12.0 layout migration succeeded (no decode panic post-transition) - Forknet topology limitations documented (para 6s/block gated on 15+ vals × 3+ cores; v0.3.3 capacity=2 stall blocks Phase 4 v0.3.3→v1.12.0 transition in small forknet — production topology has 16-19 validators where this is non-issue) Combined with PR #37 try-runtime live evidence, release/v1.12.0 is production-rollout-ready for testnet rollout. Mainnet rollout pending mainnet seed DB acquisition for analogous rehearsal. THXLAB AI Team
kumanoko24
added a commit
that referenced
this pull request
May 8, 2026
…script Path B partial success: patched genesis ParasShared::ActiveValidatorKeys storage to 15 entries (3 real + 13 fake) and registered 3 paraIds. Migration log line confirmed: EnableAsyncBackingAndCoretime: num_cores=3, max_vals_per_core=Some(5), active_validators=15, node_features[0,1,3]=true, ... This is the first time the migration's topology rule has fired in the small forknet — proves PR #37 logic is correct. However, para 6s/block still not achievable with fake validators because backing quorum (majority of group_size=5 = 3 online votes) requires actual online validators matching group assignment. Forknet has 3 online distributed across 3 groups → no group reaches quorum → cumulus UnincludedSegment fills → para stuck. Production reality: 16-19 ALL-online validators per testnet/mainnet will have full quorum in every group → 6s/block engages naturally post-rollout (already validated by PR #37 try-runtime live runs). Adds polkadot/scripts/forknet/patch-avk-then-setcode.ts as a reusable helper for future rehearsals that need to patch storage at runtime before triggering setCode (sudo.system.setStorage + sudo.setCodeWithoutChecks in sequence). W1/W2/W4 drift PASS throughout. THXLAB AI Team
This was referenced May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the mainnet async-backing gap in
release/v1.12.0and converges mainnet (thxnet) + testnet (thxnet-testnet) onto a single best-of-both-worldsEnableAsyncBackingAndCoretimemigration.Why now: before this PR,
release/v1.12.0ships mainnet @ spec112_000_001with no async-backing migration at all — only the runtime API getter binding. Rolling out v1.12.0 in this state would leave the relay-side scheduler config at v0.9.40 defaults: zeronum_cores,async_backing_params=(0,0),node_features[3]=false. v1.12.0+ cumulus collators advertise CandidateReceiptV2; without bit 3 the relay rejects them withBlockedByBacking, freezing all 4 mainnet leafchains (avatect-mainnet, lmt-mainnet, ecq-mainnet, thx-mainnet) post-upgrade until a follow-up runtime upgrade lands.testnet was already protected via PR #30's
EnableAsyncBackingAndCoretime(always-set semantics + atomicAvailabilityCoresforce-free +ClaimQueue::kill). This PR brings mainnet to parity AND upgrades both runtimes to a topology-aware variant that preserves operator-set values.Pre/Post diff
thxnet(mainnet)112_000_001, no async-backing migration112_000_002, has unified migrationthxnet-testnet112_000_004, always-set + atomic112_000_005, topology-aware + atomicgeneral-runtime(×9 leafchains)21, completeUnified migration body (single source of truth, identical in both runtimes)
Topology-aware writes — preserve any operator-set values via governance:
num_coresmax(para_count, 1)only if currently0max_validators_per_coreSome(5)only ifNoneANDactive_validators ≥ 15ANDnum_cores ≥ 3; otherwise leave unchangedlookahead≥ 1async_backing_params.max_candidate_depth≥ 1async_backing_params.allowed_ancestry_len≥ 2node_features[0,1,3]Atomic-with-setCode defense — close the timing window where stuck cores from prior session linger past upgrade:
CoreOccupied::Freeparachains_scheduler::AvailabilityCoreskill()so next block'sfree_cores_and_fill_claimqueuerebuildsparachains_scheduler::ClaimQueueLive evidence (try-runtime against archive RPC, all PASS)
testnet —
wss://node.testnet.thxnet.org/archive-001/ws:19 ≥ 15AND5 ≥ 3→Some(5)✓mainnet —
wss://node.mainnet.thxnet.org/archive-001/ws:16 ≥ 15AND4 ≥ 3→Some(5)✓Operator runbook (post-setCode)
kubectl rollout restart deploy/validator-*is REQUIRED after the runtime upgrade applies. The migration force-freesAvailabilityCoresin storage, but relay-client subsystems (prospective-parachains / fragment-chain / SessionInfo) cache scheduler state per session in validator-process memory. Without restart, those caches pin the stale state until the next real session boundary.Test plan
Cargo nextest,Build binaries,Build try-runtime & fast-runtime, all 11try-runtime (*)per-chain checksRustfmt,TOML format,Feature propagation(zepter),Feature alignmentcargo build --release -p polkadotPASS (17m 09s, this PR)taplo format --checkPASS,zepter run checkPASSon-runtime-upgrade --checks=allagainst live testnet PASS (this PR, evidence above)on-runtime-upgrade --checks=allagainst live mainnet PASS (this PR, evidence above)Things explicitly NOT in this PR
release/v1.12.0(ClearStaleHostConfiguration,cumulus_pallet_parachain_system::Migrationv2→v3,UNINCLUDED_SEGMENT_CAPACITY=1)fix/async-backing-migration-prod-safety)THXLAB AI Team