Skip to content

fix(v1.12.0): free stuck AvailabilityCores + ClaimQueue atomically with setCode#29

Closed
DrudgeRajen wants to merge 1 commit into
upgrade/1.12.0-capacity1-workaroundfrom
fix/v1.12.0-free-stuck-cores-on-setcode
Closed

fix(v1.12.0): free stuck AvailabilityCores + ClaimQueue atomically with setCode#29
DrudgeRajen wants to merge 1 commit into
upgrade/1.12.0-capacity1-workaroundfrom
fix/v1.12.0-free-stuck-cores-on-setcode

Conversation

@DrudgeRajen
Copy link
Copy Markdown
Collaborator

Summary

  • Extends EnableAsyncBackingAndCoretime migration to force-free occupied AvailabilityCores and kill ClaimQueue atomically with the v1.12.0 setCode
  • Bumps spec_version 112_000_003 → 112_000_004

Why

When v1.12.0 rootchain runtime enables async backing while a para candidate is still occupying a core (common under capacity=2 cumulus + pre-paritytech#4937 relay), the occupying entry never reaches availability. The core stays stuck until the next session rotation (Scheduler::push_occupied_cores_to_assignment_provider is called at pre_new_session).

Our earlier forked-testnet rehearsals confirmed this pattern: the chain sat stuck for ~56 minutes after rootchain setCode, resuming only at the session-0→1 boundary on a 1-hour-epoch testnet. In prod that's the same 1 hour (mainnet has a longer epoch, so worse).

What the patch does

After ActiveConfig::<Runtime>::mutate(...):

parachains_scheduler::AvailabilityCores::<Runtime>::mutate(|cores| {
    for core in cores.iter_mut() {
        *core = parachains_scheduler::CoreOccupied::Free;
    }
});
parachains_scheduler::ClaimQueue::<Runtime>::kill();

Next block's free_cores_and_fill_claimqueue repopulates from AssignmentProvider — same mechanism session rotation uses.

Measurement (forked-testnet prod-runtime, 1-hour epoch)

Approach Stuck duration after setCode
plain migration (pre-patch) 56 min (wait for session rotation)
this patch + validator restart 16 sec

~210× faster.

Operator runbook addition

Validator-process Rust subsystems (prospective-parachains, SessionInfo cache) don't pick up the new scheduler state until their next ActiveLeavesUpdate with a new session OR process restart. After setCode:

kubectl rollout restart -n <ns> deploy/validator-*

Collator restart is NOT needed (cumulus reads per-relay-parent, no stale session cache of this kind).

Relationship to other v1.12.0 patches

Test plan

  • forked-testnet rehearsal 1 (plain migration, no patch): para stuck 56 min ✅
  • forked-testnet rehearsal 2 (this patch + validator restart): para advancing 16 s after setCode InBlock ✅
  • run against full v1.12.0 upgrade path on forked-testnet including leafchain setCode
  • smoke-test migration is a no-op when cores already free (idempotency)

🤖 Generated with Claude Code

…th setCode

Extends `EnableAsyncBackingAndCoretime` migration to force-free any occupied
`AvailabilityCores` and kill `ClaimQueue` right after rewriting `ActiveConfig`.
Mirrors what `Scheduler::push_occupied_cores_to_assignment_provider` does at
session rotation, so the next `ParaInherent` pass can schedule fresh candidates
without waiting ~1 hour for the prod-testnet session boundary.

Background: when the v1.12.0 rootchain runtime enables async backing while a
candidate is still occupying a core (common under capacity=2 cumulus +
pre-paritytech#4937 relay), the occupying entry never reaches availability and the core
stays stuck until the next session. Empirically measured on forked-testnet:

  - plain migration              → para stuck 56 min (until session rotation)
  - this patch + validator restart → para advancing 16 s after setCode InBlock

The restart step is required because prospective-parachains / SessionInfo
caches live in validator-process memory and need flushing to pick up the new
scheduler state. Runbook: `kubectl rollout restart deploy/validator-*` post-setCode.

- Bumps spec_version 112_000_003 → 112_000_004
- Idempotent: re-running on a chain where cores are already free is a no-op

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DrudgeRajen
Copy link
Copy Markdown
Collaborator Author

Superseded by #30 (consolidated branch upgrade/1.12.0-all).

DrudgeRajen added a commit that referenced this pull request Apr 19, 2026
Brings the consolidated v1.12.0 fixes (PR #30) into the stable2512 upgrade path.

Resolution strategy:
- Took stable2512 side for polkadot/node/*, polkadot/runtime/parachains/*,
  zombienet_tests/*: stable2512 has upstream paritytech#4937 + related PRs natively,
  so the v1.12.0 backport cherry-picks (paritytech#4727, paritytech#4471, paritytech#4724) are redundant.
- Deleted prdoc/pr_4471.prdoc + prdoc/pr_4724.prdoc: those PRs are already
  in the stable2512 branch history.
- Kept stable2512 leafchain/runtime/general/src/lib.rs (capacity=2) —
  paritytech#4937 fixes the fragment-chain fork deadlock natively so capacity=1
  is not needed on stable2512.
- Manually merged thxnet/runtime/thxnet-testnet/src/lib.rs:
  - spec_version stays at 125_120_005 (stable2512)
  - EnableAsyncBackingAndCoretime now includes the free-stuck-cores logic
    from PR #29 (also useful on stable2512 for setCode-atomic unstick).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DrudgeRajen added a commit that referenced this pull request Apr 19, 2026
…ble2512 scheduler

After merging upgrade/1.12.0-all, the free-stuck-AvailabilityCores code from
PR #29 doesn't compile on stable2512 — the upstream paritytech#4937 rewrite removed
`AvailabilityCores` + `CoreOccupied` storage from `pallet_scheduler`. Only
`ClaimQueue` survives, now storing `VecDeque<Assignment>` instead of
`VecDeque<ParasEntryType<T>>`.

On stable2512 the stuck-core scenario doesn't manifest either: paritytech#4937 fixes the
fragment-chain fork deadlock natively, so candidates always reach availability
under the normal flow. We still kill `ClaimQueue` in the migration so the
scheduler rebuilds from the just-updated `ActiveConfig`.

Also dedupes `scale-type-resolver` from Cargo.lock (duplicate entry introduced
by the merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DrudgeRajen added a commit that referenced this pull request Apr 19, 2026
#30)

* statement-distribution: Fix false warning (paritytech#4727)

... when backing group is of size 1.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

* Remove the prospective-parachains subsystem from collators (paritytech#4471)

Implements paritytech#4429

Collators only need to maintain the implicit view for the paraid they
are collating on.
In this case, bypass prospective-parachains entirely. It's still useful
to use the GetMinimumRelayParents message from prospective-parachains
for validators, because the data is already present there.

This enables us to entirely remove the subsystem from collators, which
consumed resources needlessly

Aims to resolve paritytech#4167 

TODO:
- [x] fix unit tests

* Fix core sharing and make use of scheduling_lookahead (paritytech#4724)

Implements most of
paritytech#1797

Core sharing (two parachains or more marachains scheduled on the same
core with the same `PartsOf57600` value) was not working correctly. The
expected behaviour is to have Backed and Included event in each block
for the paras sharing the core and the paras should take turns. E.g. for
two cores we expect: Backed(a); Included(a)+Backed(b);
Included(b)+Backed(a); etc. Instead of this each block contains just one
event and there are a lot of gaps (blocks w/o events) during the
session.

Core sharing should also work when collators are building collations
ahead of time

TODOs:

- [x] Add a zombienet test verifying that the behaviour mentioned above
works.
- [x] prdoc

---------

Co-authored-by: alindima <alin@parity.io>

* fix(1.12.0-backports): adapt paritytech#4724 to v1.12.0 subsystem APIs

After cherry-picking paritytech#4724 (core sharing + scheduling_lookahead) on top
of v1.12.0, five upstream references do not resolve because the enabling
PRs landed later:

  1. `polkadot_statement_table::{…}` → use local `statement-table` alias
     (crate package name is `polkadot-statement-table`, imported via
     `statement-table = { package = … }` in Cargo.toml).
  2. `polkadot_primitives::{…}` in `scheduler.rs` and
     `runtime_api_impl/vstaging.rs` → use local `primitives` alias.
  3. `AvailabilityStoreMessage::StoreAvailableData::{core_index, node_features}`
     → the two new fields were added by the systematic-chunks PR paritytech#1644,
     which is NOT cherry-picked (it's 7540 LoC / 84 files and brings an
     unrelated availability-recovery rewrite). Keep the outer function
     signature as-is for API parity with upstream and drop the two
     fields at the message-send boundary via `let _ = …;`. The
     erasure-root check that protects consensus continues to run.
  4. `free_cores_and_fill_claimqueue` → in our tree the method is named
     `free_cores_and_fill_claim_queue` (the rename was a small cleanup
     that happened concurrently in upstream). Update the two call sites
     (paras_inherent and runtime_api v10).

Build clean: `cargo check -p polkadot -p polkadot-service
-p polkadot-runtime-parachains -p thxnet-testnet-runtime` all pass with
only pre-existing warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(leafchain,v1.12.0): UNINCLUDED_SEGMENT_CAPACITY 2 → 1 to sidestep fragment-chain fork deadlock

# Problem

On v1.12.0 rootchain with async backing enabled (depth=1, ancestry=2)
and a small backing topology (3 validators, 1 group, 1 core), the
parachain stalls permanently after a few blocks.

Validator log symptom (repeats every ~18s):
- `Refusing to second candidate at leaf. Is not a potential member.`
- `Rejected v2 advertisement ... error=BlockedByBacking`
- Provisioner: `candidates_count=0` in every inherent

# Root cause

The v1.12.0 relay-side prospective-parachains subsystem is the
pre-paritytech#4937 ("prospective-parachains rework: take II") fragment-chain. In
`polkadot/node/core/prospective-parachains/src/fragment_chain/mod.rs:797`
`is_fork_or_cycle` returns true if ANY other candidate already has the
same `parent_head_hash`, rejecting it as a fork.

When inclusion is slow, the collator's cumulus aura-ext consensus hook
(FixedVelocityConsensusHook<V=1, C=2>) allows authoring a 2nd block
every ~18s with the same parent while the first one is still waiting to
be included. Each re-author produces a new block N with different
extrinsics → same parent_head_hash → fragment-chain rejects all but the
first → backing pipeline never completes a full cycle → permanent stall.

Upstream fix is paritytech#4937, first in stable2409. Not
cherry-pickable into v1.12.0 without dragging Constraints types from
later releases (~1355 LoC rewrite).

# Workaround

Set UNINCLUDED_SEGMENT_CAPACITY = 1.

`FixedVelocityConsensusHook::can_build_upon` checks
`size_after_included >= C` and returns false if true. With C=1, once
the collator has produced one unincluded block, all further authoring
attempts at the same parent are blocked until that block is included
on the relay. No forks are ever created at the cumulus side →
fragment-chain sees a single candidate per cycle → accepts it normally.

Tradeoff: para block production becomes synchronous-backing pace (one
block per ~18s = 3 relay slots) instead of async's potential 1:1 with
relay. Acceptable for small networks that need v1.12.0 as a stable
endpoint rather than a transient upgrade hop.

# Validation

Rehearsed on forked-testnet 2026-04-18/19:
- v0.9.40 → binary upgrade → leafchain setCode spec 20 → 21 (capacity=1)
- rootchain setCode v1.12.0 (spec 112000003)
- Para advanced from stuck-at-13 → 4556+ overnight, finalization keeping
  pace with head, validator provisioner consistently reports
  `candidates_count=1` on backing cycles.

Prior capacity=2 rehearsal on the same topology reproducibly stalled
para at 13–30 within minutes of rootchain setCode and never recovered.

Bump leafchain spec_version 20 → 21 so setCode applies after the v1.12.0
rootchain upgrade. The stable2512 hop restores capacity to 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(v1.12.0): free stuck AvailabilityCores + ClaimQueue atomically with setCode

Extends `EnableAsyncBackingAndCoretime` migration to force-free any occupied
`AvailabilityCores` and kill `ClaimQueue` right after rewriting `ActiveConfig`.
Mirrors what `Scheduler::push_occupied_cores_to_assignment_provider` does at
session rotation, so the next `ParaInherent` pass can schedule fresh candidates
without waiting ~1 hour for the prod-testnet session boundary.

Background: when the v1.12.0 rootchain runtime enables async backing while a
candidate is still occupying a core (common under capacity=2 cumulus +
pre-paritytech#4937 relay), the occupying entry never reaches availability and the core
stays stuck until the next session. Empirically measured on forked-testnet:

  - plain migration              → para stuck 56 min (until session rotation)
  - this patch + validator restart → para advancing 16 s after setCode InBlock

The restart step is required because prospective-parachains / SessionInfo
caches live in validator-process memory and need flushing to pick up the new
scheduler state. Runbook: `kubectl rollout restart deploy/validator-*` post-setCode.

- Bumps spec_version 112_000_003 → 112_000_004
- Idempotent: re-running on a chain where cores are already free is a no-op

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Alexandru Gheorghe <49718502+alexggh@users.noreply.github.com>
Co-authored-by: Alin Dima <alin@parity.io>
Co-authored-by: Tsvetomir Dimitrov <tsvetomir@parity.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant