fix(thxnet,thxnet-testnet): unify EnableAsyncBackingAndCoretime — close mainnet async-backing gap by kumanoko24 · Pull Request #37 · thxnet/thxnet-sdk

kumanoko24 · 2026-05-05T13:43:29Z

Summary

Closes the mainnet async-backing gap in release/v1.12.0 and converges mainnet (thxnet) + testnet (thxnet-testnet) onto a single best-of-both-worlds EnableAsyncBackingAndCoretime migration.

Why now: before this PR, release/v1.12.0 ships mainnet @ spec 112_000_001 with no async-backing migration at all — only the runtime API getter binding. Rolling out v1.12.0 in this state would leave the relay-side scheduler config at v0.9.40 defaults: zero num_cores, async_backing_params=(0,0), node_features[3]=false. v1.12.0+ cumulus collators advertise CandidateReceiptV2; without bit 3 the relay rejects them with BlockedByBacking, freezing all 4 mainnet leafchains (avatect-mainnet, lmt-mainnet, ecq-mainnet, thx-mainnet) post-upgrade until a follow-up runtime upgrade lands.

testnet was already protected via PR #30's EnableAsyncBackingAndCoretime (always-set semantics + atomic AvailabilityCores force-free + ClaimQueue::kill). This PR brings mainnet to parity AND upgrades both runtimes to a topology-aware variant that preserves operator-set values.

Pre/Post diff

Runtime	Before	After
`thxnet` (mainnet)	spec `112_000_001`, no async-backing migration	spec `112_000_002`, has unified migration
`thxnet-testnet`	spec `112_000_004`, always-set + atomic	spec `112_000_005`, topology-aware + atomic
`general-runtime` (×9 leafchains)	spec `21`, complete	unchanged

Unified migration body (single source of truth, identical in both runtimes)

Topology-aware writes — preserve any operator-set values via governance:

Field	Rule
`num_cores`	set to `max(para_count, 1)` only if currently `0`
`max_validators_per_core`	`Some(5)` only if `None` AND `active_validators ≥ 15` AND `num_cores ≥ 3`; otherwise leave unchanged
`lookahead`	ensure `≥ 1`
`async_backing_params.max_candidate_depth`	ensure `≥ 1`
`async_backing_params.allowed_ancestry_len`	ensure `≥ 2`
`node_features[0,1,3]`	always force-on (CandidateReceiptV2 is critical — bit 3)

Atomic-with-setCode defense — close the timing window where stuck cores from prior session linger past upgrade:

Action	Storage
Force every entry to `CoreOccupied::Free`	`parachains_scheduler::AvailabilityCores`
`kill()` so next block's `free_cores_and_fill_claimqueue` rebuilds	`parachains_scheduler::ClaimQueue`

Live evidence (try-runtime against archive RPC, all PASS)

testnet — wss://node.testnet.thxnet.org/archive-001/ws:

EnableAsyncBackingAndCoretime: num_cores=5, max_vals_per_core=Some(5), lookahead=1,
async_backing=(depth=1, ancestry=2), node_features[0,1,3]=true,
AvailabilityCores freed, ClaimQueue cleared, active_validators=19

topology rule fires: 19 ≥ 15 AND 5 ≥ 3 → Some(5) ✓
60 try-state pallet checks PASS, exit 0
idempotent: 4 re-runs produce identical output

mainnet — wss://node.mainnet.thxnet.org/archive-001/ws:

EnableAsyncBackingAndCoretime: num_cores=4, max_vals_per_core=Some(5), lookahead=1,
async_backing=(depth=1, ancestry=2), node_features[0,1,3]=true,
AvailabilityCores freed, ClaimQueue cleared, active_validators=16

topology rule fires: 16 ≥ 15 AND 4 ≥ 3 → Some(5) ✓
60 try-state pallet checks PASS, exit 0
idempotent: 4 re-runs produce identical output

Operator runbook (post-setCode)

kubectl rollout restart deploy/validator-* is REQUIRED after the runtime upgrade applies. The migration force-frees AvailabilityCores in storage, but relay-client subsystems (prospective-parachains / fragment-chain / SessionInfo) cache scheduler state per session in validator-process memory. Without restart, those caches pin the stale state until the next real session boundary.

Test plan

CI: Cargo nextest, Build binaries, Build try-runtime & fast-runtime, all 11 try-runtime (*) per-chain checks
CI: Rustfmt, TOML format, Feature propagation (zepter), Feature alignment
Local: cargo build --release -p polkadot PASS (17m 09s, this PR)
Local: taplo format --check PASS, zepter run check PASS
Local: try-runtime on-runtime-upgrade --checks=all against live testnet PASS (this PR, evidence above)
Local: try-runtime on-runtime-upgrade --checks=all against live mainnet PASS (this PR, evidence above)

Things explicitly NOT in this PR

general-runtime / leafchain runtime changes — leafchain side is already complete on release/v1.12.0 (ClearStaleHostConfiguration, cumulus_pallet_parachain_system::Migration v2→v3, UNINCLUDED_SEGMENT_CAPACITY=1)
stable2512 substrate adaptations — out of scope for v1.12.0 rollout (parked on fix/async-backing-migration-prod-safety)
Forknet rehearsal of the new mainnet migration — fork-genesis CLI works against testnet runtime, not yet against mainnet runtime; would need a separate harness change

THXLAB AI Team

…pology-aware + atomic with setCode Brings mainnet (`thxnet`) to async-backing parity with testnet by adding the `EnableAsyncBackingAndCoretime` migration, and converges both runtimes onto a single best-of-both-worlds variant. ## Why Pre-this-PR, `release/v1.12.0` ships: - testnet @ spec 112_000_004 — has `EnableAsyncBackingAndCoretime` (always-set semantics + AvailabilityCores force-free + ClaimQueue::kill) - mainnet @ spec 112_000_001 — has NO async-backing migration at all, only the runtime API binding Mainnet rolling out v1.12.0 in this state would leave the relay-side async-backing config at v0.9.40-era defaults (zero `num_cores`, `async_backing_params=(0,0)`, `node_features[3]=false`). v1.12.0+ cumulus collators advertise CandidateReceiptV2; without bit 3 the relay rejects them with `BlockedByBacking`, freezing all 4 mainnet leafchains (avatect-mainnet, lmt-mainnet, ecq-mainnet, thx-mainnet) post-upgrade until a follow-up runtime upgrade lands. ## Unified design (single source of truth) Both runtimes now share the same migration body. Topology-aware writes (from W1 `fix/async-backing-migration-prod-safety`) preserve operator-set values; atomic-with-setCode defense (from PR #30 inherited into testnet) force-frees stuck cores so the next ParaInherent pass schedules without waiting for session rotation. Topology-aware: - `num_cores` : set to `max(para_count,1)` only if currently 0 - `max_vals/core` : `Some(5)` only if `None && active_validators >= 15 && num_cores >= 3`; otherwise leave - `lookahead` : ensure >= 1 - `async_backing_params` : ensure (depth>=1, ancestry>=2) - `node_features` : always force bits 0,1,3 Atomic-with-setCode: - `AvailabilityCores::mutate(|cores| *core = Free for all)` - `ClaimQueue::kill()` ## Live evidence (try-runtime on-runtime-upgrade against archive RPC) testnet (wss://node.testnet.thxnet.org): num_cores=5, max_vals_per_core=Some(5), lookahead=1, async_backing=(1,2), node_features[0,1,3]=true, AvailabilityCores freed, ClaimQueue cleared, active_validators=19 → topology rule fires (19 >= 15, 5 >= 3) → Some(5) → 60 try-state checks PASS, exit 0, idempotent (4 re-runs identical) mainnet (wss://node.mainnet.thxnet.org): num_cores=4, max_vals_per_core=Some(5), lookahead=1, async_backing=(1,2), node_features[0,1,3]=true, AvailabilityCores freed, ClaimQueue cleared, active_validators=16 → topology rule fires (16 >= 15, 4 >= 3) → Some(5) → 60 try-state checks PASS, exit 0, idempotent (4 re-runs identical) ## spec_version bumps - `thxnet` : 112_000_001 → 112_000_002 (new migration added) - `thxnet-testnet` : 112_000_004 → 112_000_005 (migration body replaced) ## Operator runbook (post-setCode) `kubectl rollout restart deploy/validator-*` is REQUIRED after the runtime upgrade applies. The migration force-frees `AvailabilityCores` in storage but relay-client subsystems (prospective-parachains / fragment-chain / SessionInfo) cache scheduler state per session in validator-process memory. Without restart, those caches pin the stale state until the next real session boundary. THXLAB AI Team

…script Path B partial success: patched genesis ParasShared::ActiveValidatorKeys storage to 15 entries (3 real + 13 fake) and registered 3 paraIds. Migration log line confirmed: EnableAsyncBackingAndCoretime: num_cores=3, max_vals_per_core=Some(5), active_validators=15, node_features[0,1,3]=true, ... This is the first time the migration's topology rule has fired in the small forknet — proves PR #37 logic is correct. However, para 6s/block still not achievable with fake validators because backing quorum (majority of group_size=5 = 3 online votes) requires actual online validators matching group assignment. Forknet has 3 online distributed across 3 groups → no group reaches quorum → cumulus UnincludedSegment fills → para stuck. Production reality: 16-19 ALL-online validators per testnet/mainnet will have full quorum in every group → 6s/block engages naturally post-rollout (already validated by PR #37 try-runtime live runs). Adds polkadot/scripts/forknet/patch-avk-then-setcode.ts as a reusable helper for future rehearsals that need to patch storage at runtime before triggering setCode (sudo.system.setStorage + sudo.setCodeWithoutChecks in sequence). W1/W2/W4 drift PASS throughout. THXLAB AI Team

Capture production-faithful upgrade rehearsal evidence for release/v1.12.0 (`6b7ee05aea`) on freshly-rsynced testnet livenet seed. REPORT-rehearsal-2026-05-06.md (Path A flow): - P6.4-equivalent cumulus 2-step setCode mechanics PASS - Setup with v1.12.0 polkadot fork-genesis + leafchain --chain=dev REPORT-rehearsal-v5-2026-05-07.md (Path E.1 + E.2 flows): - Path E.1 (dev para): production-faithful flow with OLD polkadot fork-genesis + v1.12.0 binary boot via WASM execution + real setCode → spec 94000004 → 112000005 → EnableAsyncBackingAndCoretime migration log line confirmed - Path E.2 (livenet sand-testnet para): all 4 upgrade dimensions exercised (rootchain binary swap, rootchain runtime upgrade, leafchain binary swap, leafchain runtime upgrade) - Storage delta verified via state_getStorage decode: async_backing_params=(depth=1, ancestry=2), scheduler_params.lookahead=1, node_features[0,1,3]=true (CandidateReceiptV2 acceptance — the critical mainnet unfreeze fix), AvailabilityCores cleared, ClaimQueue cleared - HostConfiguration v0.9.x → v1.12.0 layout migration succeeded (no decode panic post-transition) - Forknet topology limitations documented (para 6s/block gated on 15+ vals × 3+ cores; v0.3.3 capacity=2 stall blocks Phase 4 v0.3.3→v1.12.0 transition in small forknet — production topology has 16-19 validators where this is non-issue) Combined with PR #37 try-runtime live evidence, release/v1.12.0 is production-rollout-ready for testnet rollout. Mainnet rollout pending mainnet seed DB acquisition for analogous rehearsal. THXLAB AI Team

…script Path B partial success: patched genesis ParasShared::ActiveValidatorKeys storage to 15 entries (3 real + 13 fake) and registered 3 paraIds. Migration log line confirmed: EnableAsyncBackingAndCoretime: num_cores=3, max_vals_per_core=Some(5), active_validators=15, node_features[0,1,3]=true, ... This is the first time the migration's topology rule has fired in the small forknet — proves PR #37 logic is correct. However, para 6s/block still not achievable with fake validators because backing quorum (majority of group_size=5 = 3 online votes) requires actual online validators matching group assignment. Forknet has 3 online distributed across 3 groups → no group reaches quorum → cumulus UnincludedSegment fills → para stuck. Production reality: 16-19 ALL-online validators per testnet/mainnet will have full quorum in every group → 6s/block engages naturally post-rollout (already validated by PR #37 try-runtime live runs). Adds polkadot/scripts/forknet/patch-avk-then-setcode.ts as a reusable helper for future rehearsals that need to patch storage at runtime before triggering setCode (sudo.system.setStorage + sudo.setCodeWithoutChecks in sequence). W1/W2/W4 drift PASS throughout. THXLAB AI Team

kumanoko24 merged commit 6b7ee05 into release/v1.12.0 May 5, 2026
32 of 42 checks passed

kumanoko24 mentioned this pull request May 7, 2026

feat(node): backport polkadot-sdk#4937 (prospective-parachains rework) + 6-val dev set #38

Merged

3 tasks

This was referenced May 8, 2026

ops(v1.12.0): operator handoff + hardened forknet/setcode scripts + chopsticks configs #40

Merged

fix(v1.12.0): silence spurious node_features bit-3 warning #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(thxnet,thxnet-testnet): unify EnableAsyncBackingAndCoretime — close mainnet async-backing gap#37

fix(thxnet,thxnet-testnet): unify EnableAsyncBackingAndCoretime — close mainnet async-backing gap#37
kumanoko24 merged 1 commit into
release/v1.12.0from
fix/v1.12.0-mainnet-async-backing-migration

kumanoko24 commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kumanoko24 commented May 5, 2026

Summary

Pre/Post diff

Unified migration body (single source of truth, identical in both runtimes)

Live evidence (try-runtime against archive RPC, all PASS)

Operator runbook (post-setCode)

Test plan

Things explicitly NOT in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant