Skip to content

fix(thxnet,thxnet-testnet): unify EnableAsyncBackingAndCoretime — close mainnet async-backing gap#37

Merged
kumanoko24 merged 1 commit into
release/v1.12.0from
fix/v1.12.0-mainnet-async-backing-migration
May 5, 2026
Merged

fix(thxnet,thxnet-testnet): unify EnableAsyncBackingAndCoretime — close mainnet async-backing gap#37
kumanoko24 merged 1 commit into
release/v1.12.0from
fix/v1.12.0-mainnet-async-backing-migration

Conversation

@kumanoko24

Copy link
Copy Markdown

Summary

Closes the mainnet async-backing gap in release/v1.12.0 and converges mainnet (thxnet) + testnet (thxnet-testnet) onto a single best-of-both-worlds EnableAsyncBackingAndCoretime migration.

Why now: before this PR, release/v1.12.0 ships mainnet @ spec 112_000_001 with no async-backing migration at all — only the runtime API getter binding. Rolling out v1.12.0 in this state would leave the relay-side scheduler config at v0.9.40 defaults: zero num_cores, async_backing_params=(0,0), node_features[3]=false. v1.12.0+ cumulus collators advertise CandidateReceiptV2; without bit 3 the relay rejects them with BlockedByBacking, freezing all 4 mainnet leafchains (avatect-mainnet, lmt-mainnet, ecq-mainnet, thx-mainnet) post-upgrade until a follow-up runtime upgrade lands.

testnet was already protected via PR #30's EnableAsyncBackingAndCoretime (always-set semantics + atomic AvailabilityCores force-free + ClaimQueue::kill). This PR brings mainnet to parity AND upgrades both runtimes to a topology-aware variant that preserves operator-set values.

Pre/Post diff

Runtime Before After
thxnet (mainnet) spec 112_000_001, no async-backing migration spec 112_000_002, has unified migration
thxnet-testnet spec 112_000_004, always-set + atomic spec 112_000_005, topology-aware + atomic
general-runtime (×9 leafchains) spec 21, complete unchanged

Unified migration body (single source of truth, identical in both runtimes)

Topology-aware writes — preserve any operator-set values via governance:

Field Rule
num_cores set to max(para_count, 1) only if currently 0
max_validators_per_core Some(5) only if None AND active_validators ≥ 15 AND num_cores ≥ 3; otherwise leave unchanged
lookahead ensure ≥ 1
async_backing_params.max_candidate_depth ensure ≥ 1
async_backing_params.allowed_ancestry_len ensure ≥ 2
node_features[0,1,3] always force-on (CandidateReceiptV2 is critical — bit 3)

Atomic-with-setCode defense — close the timing window where stuck cores from prior session linger past upgrade:

Action Storage
Force every entry to CoreOccupied::Free parachains_scheduler::AvailabilityCores
kill() so next block's free_cores_and_fill_claimqueue rebuilds parachains_scheduler::ClaimQueue

Live evidence (try-runtime against archive RPC, all PASS)

testnetwss://node.testnet.thxnet.org/archive-001/ws:

EnableAsyncBackingAndCoretime: num_cores=5, max_vals_per_core=Some(5), lookahead=1,
async_backing=(depth=1, ancestry=2), node_features[0,1,3]=true,
AvailabilityCores freed, ClaimQueue cleared, active_validators=19
  • topology rule fires: 19 ≥ 15 AND 5 ≥ 3Some(5)
  • 60 try-state pallet checks PASS, exit 0
  • idempotent: 4 re-runs produce identical output

mainnetwss://node.mainnet.thxnet.org/archive-001/ws:

EnableAsyncBackingAndCoretime: num_cores=4, max_vals_per_core=Some(5), lookahead=1,
async_backing=(depth=1, ancestry=2), node_features[0,1,3]=true,
AvailabilityCores freed, ClaimQueue cleared, active_validators=16
  • topology rule fires: 16 ≥ 15 AND 4 ≥ 3Some(5)
  • 60 try-state pallet checks PASS, exit 0
  • idempotent: 4 re-runs produce identical output

Operator runbook (post-setCode)

kubectl rollout restart deploy/validator-* is REQUIRED after the runtime upgrade applies. The migration force-frees AvailabilityCores in storage, but relay-client subsystems (prospective-parachains / fragment-chain / SessionInfo) cache scheduler state per session in validator-process memory. Without restart, those caches pin the stale state until the next real session boundary.

Test plan

  • CI: Cargo nextest, Build binaries, Build try-runtime & fast-runtime, all 11 try-runtime (*) per-chain checks
  • CI: Rustfmt, TOML format, Feature propagation (zepter), Feature alignment
  • Local: cargo build --release -p polkadot PASS (17m 09s, this PR)
  • Local: taplo format --check PASS, zepter run check PASS
  • Local: try-runtime on-runtime-upgrade --checks=all against live testnet PASS (this PR, evidence above)
  • Local: try-runtime on-runtime-upgrade --checks=all against live mainnet PASS (this PR, evidence above)

Things explicitly NOT in this PR

  • general-runtime / leafchain runtime changes — leafchain side is already complete on release/v1.12.0 (ClearStaleHostConfiguration, cumulus_pallet_parachain_system::Migration v2→v3, UNINCLUDED_SEGMENT_CAPACITY=1)
  • stable2512 substrate adaptations — out of scope for v1.12.0 rollout (parked on fix/async-backing-migration-prod-safety)
  • Forknet rehearsal of the new mainnet migration — fork-genesis CLI works against testnet runtime, not yet against mainnet runtime; would need a separate harness change

THXLAB AI Team

…pology-aware + atomic with setCode

Brings mainnet (`thxnet`) to async-backing parity with testnet by
adding the `EnableAsyncBackingAndCoretime` migration, and converges
both runtimes onto a single best-of-both-worlds variant.

## Why

Pre-this-PR, `release/v1.12.0` ships:
- testnet @ spec 112_000_004 — has `EnableAsyncBackingAndCoretime`
  (always-set semantics + AvailabilityCores force-free + ClaimQueue::kill)
- mainnet @ spec 112_000_001 — has NO async-backing migration at all,
  only the runtime API binding

Mainnet rolling out v1.12.0 in this state would leave the relay-side
async-backing config at v0.9.40-era defaults (zero `num_cores`,
`async_backing_params=(0,0)`, `node_features[3]=false`). v1.12.0+
cumulus collators advertise CandidateReceiptV2; without bit 3 the
relay rejects them with `BlockedByBacking`, freezing all 4 mainnet
leafchains (avatect-mainnet, lmt-mainnet, ecq-mainnet, thx-mainnet)
post-upgrade until a follow-up runtime upgrade lands.

## Unified design (single source of truth)

Both runtimes now share the same migration body. Topology-aware
writes (from W1 `fix/async-backing-migration-prod-safety`) preserve
operator-set values; atomic-with-setCode defense (from PR #30 inherited
into testnet) force-frees stuck cores so the next ParaInherent pass
schedules without waiting for session rotation.

Topology-aware:
- `num_cores`     : set to `max(para_count,1)` only if currently 0
- `max_vals/core` : `Some(5)` only if `None && active_validators >= 15
                    && num_cores >= 3`; otherwise leave
- `lookahead`     : ensure >= 1
- `async_backing_params` : ensure (depth>=1, ancestry>=2)
- `node_features` : always force bits 0,1,3

Atomic-with-setCode:
- `AvailabilityCores::mutate(|cores| *core = Free for all)`
- `ClaimQueue::kill()`

## Live evidence (try-runtime on-runtime-upgrade against archive RPC)

testnet (wss://node.testnet.thxnet.org):
  num_cores=5, max_vals_per_core=Some(5), lookahead=1,
  async_backing=(1,2), node_features[0,1,3]=true,
  AvailabilityCores freed, ClaimQueue cleared, active_validators=19
  → topology rule fires (19 >= 15, 5 >= 3) → Some(5)
  → 60 try-state checks PASS, exit 0, idempotent (4 re-runs identical)

mainnet (wss://node.mainnet.thxnet.org):
  num_cores=4, max_vals_per_core=Some(5), lookahead=1,
  async_backing=(1,2), node_features[0,1,3]=true,
  AvailabilityCores freed, ClaimQueue cleared, active_validators=16
  → topology rule fires (16 >= 15, 4 >= 3) → Some(5)
  → 60 try-state checks PASS, exit 0, idempotent (4 re-runs identical)

## spec_version bumps

- `thxnet`         : 112_000_001 → 112_000_002 (new migration added)
- `thxnet-testnet` : 112_000_004 → 112_000_005 (migration body replaced)

## Operator runbook (post-setCode)

`kubectl rollout restart deploy/validator-*` is REQUIRED after the
runtime upgrade applies. The migration force-frees `AvailabilityCores`
in storage but relay-client subsystems (prospective-parachains /
fragment-chain / SessionInfo) cache scheduler state per session in
validator-process memory. Without restart, those caches pin the stale
state until the next real session boundary.

THXLAB AI Team
@kumanoko24 kumanoko24 merged commit 6b7ee05 into release/v1.12.0 May 5, 2026
32 of 42 checks passed
kumanoko24 added a commit that referenced this pull request May 7, 2026
…script

Path B partial success: patched genesis ParasShared::ActiveValidatorKeys
storage to 15 entries (3 real + 13 fake) and registered 3 paraIds. Migration
log line confirmed:

  EnableAsyncBackingAndCoretime: num_cores=3, max_vals_per_core=Some(5),
  active_validators=15, node_features[0,1,3]=true, ...

This is the first time the migration's topology rule has fired in the
small forknet — proves PR #37 logic is correct.

However, para 6s/block still not achievable with fake validators because
backing quorum (majority of group_size=5 = 3 online votes) requires
actual online validators matching group assignment. Forknet has 3 online
distributed across 3 groups → no group reaches quorum → cumulus
UnincludedSegment fills → para stuck.

Production reality: 16-19 ALL-online validators per testnet/mainnet
will have full quorum in every group → 6s/block engages naturally
post-rollout (already validated by PR #37 try-runtime live runs).

Adds polkadot/scripts/forknet/patch-avk-then-setcode.ts as a reusable
helper for future rehearsals that need to patch storage at runtime
before triggering setCode (sudo.system.setStorage + sudo.setCodeWithoutChecks
in sequence).

W1/W2/W4 drift PASS throughout.

THXLAB AI Team
kumanoko24 added a commit that referenced this pull request May 8, 2026
Capture production-faithful upgrade rehearsal evidence for release/v1.12.0
(`6b7ee05aea`) on freshly-rsynced testnet livenet seed.

REPORT-rehearsal-2026-05-06.md (Path A flow):
- P6.4-equivalent cumulus 2-step setCode mechanics PASS
- Setup with v1.12.0 polkadot fork-genesis + leafchain --chain=dev

REPORT-rehearsal-v5-2026-05-07.md (Path E.1 + E.2 flows):
- Path E.1 (dev para): production-faithful flow with OLD polkadot
  fork-genesis + v1.12.0 binary boot via WASM execution + real
  setCode → spec 94000004 → 112000005 → EnableAsyncBackingAndCoretime
  migration log line confirmed
- Path E.2 (livenet sand-testnet para): all 4 upgrade dimensions
  exercised (rootchain binary swap, rootchain runtime upgrade,
  leafchain binary swap, leafchain runtime upgrade)
- Storage delta verified via state_getStorage decode:
  async_backing_params=(depth=1, ancestry=2),
  scheduler_params.lookahead=1,
  node_features[0,1,3]=true (CandidateReceiptV2 acceptance — the
  critical mainnet unfreeze fix),
  AvailabilityCores cleared, ClaimQueue cleared
- HostConfiguration v0.9.x → v1.12.0 layout migration succeeded
  (no decode panic post-transition)
- Forknet topology limitations documented (para 6s/block gated
  on 15+ vals × 3+ cores; v0.3.3 capacity=2 stall blocks Phase 4
  v0.3.3→v1.12.0 transition in small forknet — production
  topology has 16-19 validators where this is non-issue)

Combined with PR #37 try-runtime live evidence, release/v1.12.0
is production-rollout-ready for testnet rollout. Mainnet rollout
pending mainnet seed DB acquisition for analogous rehearsal.

THXLAB AI Team
kumanoko24 added a commit that referenced this pull request May 8, 2026
…script

Path B partial success: patched genesis ParasShared::ActiveValidatorKeys
storage to 15 entries (3 real + 13 fake) and registered 3 paraIds. Migration
log line confirmed:

  EnableAsyncBackingAndCoretime: num_cores=3, max_vals_per_core=Some(5),
  active_validators=15, node_features[0,1,3]=true, ...

This is the first time the migration's topology rule has fired in the
small forknet — proves PR #37 logic is correct.

However, para 6s/block still not achievable with fake validators because
backing quorum (majority of group_size=5 = 3 online votes) requires
actual online validators matching group assignment. Forknet has 3 online
distributed across 3 groups → no group reaches quorum → cumulus
UnincludedSegment fills → para stuck.

Production reality: 16-19 ALL-online validators per testnet/mainnet
will have full quorum in every group → 6s/block engages naturally
post-rollout (already validated by PR #37 try-runtime live runs).

Adds polkadot/scripts/forknet/patch-avk-then-setcode.ts as a reusable
helper for future rehearsals that need to patch storage at runtime
before triggering setCode (sudo.system.setStorage + sudo.setCodeWithoutChecks
in sequence).

W1/W2/W4 drift PASS throughout.

THXLAB AI Team
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant