Feat/tick fork rollback by hackerby888 · Pull Request #35 · qubic/core-lite

hackerby888 · 2026-06-17T12:26:56Z

No description provided.

…tle-not-cull + recv-length fix Track 1a: mark changed spectrum leaves at the transfer sites (increase/decreaseEnergy) and re-hash only those in the per-tick salted digest, instead of scanning all SPECTRUM_CAPACITY (16.7M) entries every tick. reorganizeSpectrum and the mismatch-reprocess recompute reset the dirty list. Opt-in VERIFY_SPECTRUM_DIGEST recomputes from scratch each tick and halts on any divergence. net: push() drops a message on a full send buffer instead of closePeer (slow consumers survive); pushCustom skips peers whose buffer is backed up; receiveProcessor recv()s the token FragmentLength, not BUFFER_SIZE (fixes receiveBuffer overrun).

A behind node requests a whole tick RANGE; the peer streams every tick's data (tickData + votes + txs) as one chunk of standard-framed sub-messages, which the requester re-feeds to the existing signature-verifying handlers (+ a direct-store for far-future txs). Removes the per-tick request-pacing cap on catch-up. Auto-activates on AUX when far behind; single in-flight chunk per peer. 240/241 are unused upstream so a UEFI node drops a stray 240.

…/241 Requester probes all connected peers with a tiny 240; the first 241 reply marks that peer capable (by address) and the requester pulls chunks only from it. Fixes the mixed-network case where random peer selection mostly hit non-bulk (stock) nodes and fell back to per-tick prefetch.

appendTick appended a whole tick (tickData + up to 676 votes + up to 4096 txs) before the cap was checked, so a tx-heavy tick overflowed gChunkBuf (segfault in the memcpy). Make it cap-aware: check before every sub-frame and roll the whole tick back if it does not fit.

…t walk - Bulk: onRespondChunk no longer verifies a whole chunk inline on one thread. Sub-frames go to a work queue that the requestProcessor pool drains in parallel; delivery is a continuous gap-free request chain with backpressure. - Track 1b: the spectrum + universe Merkle walks skip whole clean 64-bit flag words (parent write index computed from i, not a running counter), so the per-tick digest is ~O(dirty) instead of O(CAPACITY).

Requester kept only one chunk request in flight, so a provider served one at a time and its requestProcessor pool could not parallelize -> requester delivery-starved. Now N (8) in-flight slots, round-robined across all peers that answered 241, with a running chunk-size estimate for the frontier and the normal prefetch as a gap backstop. Responder pools now serve in parallel and a behind node pulls from several archives.

Catch-up fetched each tick's votes+txs from 5 peers (CATCHUP_FANOUT 3+2), so ~80% of the multi-Mbit/s prefetch intake was duplicate (measured 84% *duplicate vs +processed, 138 Mbit/s on a behind node). pushPreferringAtOrAbove already targets peers that have the tick, so 1 fullnode + 1 random suffices; a slow peer costs a 500ms re-poll (pipeline-hidden), not bandwidth.

Fan-out 2 starved the current-tick quorum fetch at fresh boot (0 fullnodes classified yet -> ~1 peer, no hedge) -> node stuck at the initial tick. Restore 3+2. The duplicate-prefetch win needs a boot-safe / prefetch-only form, not a blanket cut.

When bulk catch-up is delivering the range from a capable peer, the per-tick +N prefetch both floods (duplicate fan-out) and competes with bulk for the same connection. Drop it to a +1/+2 safety net so bulk pulls exclusively and the bulk peer connection stays hot.

…tride) The multi-slot striping advanced the request frontier by a running average chunk-tick estimate. Tick size varies ~100x (empty/void vs full-tx ticks), so the guessed stride overshoots dense regions and leaves unrequested gaps the ticker stalls on. Advance the frontier only by each response's confirmed end tick: one request in flight, strictly contiguous. 4MB/chunk per RTT is ample bandwidth.

…nsition The bulkProcessOne hook existed only inside the epochTransitionState wait block, so outside an epoch transition the bulk sub-frame queue was never drained: received chunks accumulated and were overwritten in the ring, and the ticker advanced only at the gated prefetch rate. Drain one sub-frame per idle request processor in the normal empty-queue branch so the whole pool applies bulk data in parallel.

…temp)

…onus Gating prefetch to depth 2 while bulk is 'active' starves the node when the bulk peer underdelivers (serves void/auto-flushed ticks): the clean data path (full-depth prefetch from the network) was being capped while waiting on a bulk peer that had nothing useful. Let prefetch always run at full depth; bulk runs alongside and helps only when a capable peer actually has the range.

Single-outstanding is RTT-bound (~chunkTicks per RTT); at the 112ms test link that caps ~200 t/s well under the ~538 t/s 1Gbit vote-bandwidth ceiling. Pipeline N=8 in-flight requests, each for an EXACT BULK_SPAN(8) tick range (new wire field maxTicks, VERSION=2) so ranges stay contiguous without density-dependent guessing. A partial response (range overflowed the byte cap) re-requests its tail as a priority hole, so nothing is skipped and the contiguous frontier never stalls. Responder honors maxTicks (clamped to REQUEST_SPAN). Keeps the link full across the RTT to approach the bandwidth ceiling.

…ode) random(numberOfPublicPeers) is a modulo, so when a node has zero public peers (e.g. started with only a self/loopback --peers entry, or before the peer list is populated) random(0) divides by zero -> SIGFPE. Mainnet never hits it (always has peers); a testnet/dev node does. Skip the random peer pick when the pool is empty.

Tick storage spans the whole epoch (tickEnd = tickBegin + MAX_NUMBER_OF_TICKS_PER_EPOCH), so an unproduced tick still passes the in-storage guard and appendTick appended nothing yet returned 1 (success) -- the responder served empty frames for ticks past its tip, and the requester (frontier = system.tick + up to BULK_MAX_AHEAD) raced over them whenever the provider wasn't that far ahead. Result: ap=0, frontier pinned at the +8000 cap, bulk stalls, node falls back to slow prefetch. Fix three ways: - responder: appendTick returns 0 (not 1) when a tick has no content, so the chunk stops at the real tip instead of serving emptiness; - requester: cap the request range at the capable peers' reported tick (capableTip) so we never ask past what's produced; - requester: on an empty response, pull the frontier back to re-poll the tip as the provider advances, instead of churning a hole. This was the reason bulk never delivered (every prior run measured prefetch, not bulk) -- a healthy node DOES retain all votes for ticks below its tip; we were just asking above it.

Replace the marching-watermark + holes + pull-back + provider-tip cap with a fixed sliding window: requester always asks for [system.tick+1, system.tick+BULK_PREFETCH_WINDOW(128)] in BULK_SPAN chunks, N in flight, watermark anchored to system.tick. Responder now COVERS the whole requested span: a tick it doesn't have (void/unproduced) contributes zero bytes and is skipped, it keeps serving the rest (stops only when the chunk is byte-full). So a missing tick is never re-requested by bulk — the node's normal per-tick prefetch covers gaps. No holes, no pull-back, no tip negotiation. Removes the empty-chunk churn (was 30M+ empty round-trips): when the window is full and the ticker is waiting, fillSlots simply stops issuing. Blindly request the window, apply what comes back, waste nothing.

Deeper prefetch window + matching in-flight slots so the whole window can be outstanding (64 x BULK_SPAN(8) = 512). More parallel chunk requests to the provider pool; buffer-paced by the receive buffer. Effective only when the provider holds a real lead.

FAST_TX_WINDOW_TICKS 32->512, CATCHUP_MAX_PREFETCH 20->512. broadcast/eager-fetch txs for ticks beyond +32 were dropped at the fastTxWindow.add gate; prefetch only looked 20 ahead. Both now match the bulk window so the tx supply keeps up. FastTxWindow ~2.6GB at 512.

…2->32

delete extensions/lite_bulk_catchup.h + all qubic.cpp glue (include, 240/241 switch cases, init, drain hooks, kicker, status diag). prefetch/fastTxWindow/merkle catch-up paths untouched.

Configure(NULL) (called by closePeer to abort a connection) was a no-op in the OS-port emulation, so the in-flight send kept running, isTransmitting never cleared, and the slot was stuck isClosing forever (fd leaked). Now shutdown() the socket so the worker send/recv error out immediately and the slot can free. fd still closed in DestroyChild.

Connect() did a blocking connect() with no timeout; a dead/firewalled peer pinned the slot (isConnectingAccepting) for the OS SYN timeout (~127s), so the 5s zombie reaper could not free it -> zombieConnect pile-up. Now non-blocking connect + 5s poll; Configure(NULL) shutdown wakes the poll early.

…ive pool #1 connect poll was 5s == reaper CONNECT_TIMEOUT_SECS=5 -> boundary race (reaper could reap a just-connected slot). Drop to 4s so connect self-resolves before the reaper backstop. #2 forgetPublicPeer now parks peers in an inactive pool instead of dropping them; when active candidates fall to the keep-floor (10) they are recycled back in. Node no longer permanently loses peers.

…efault unlimited) One peer was flooding many incoming slots (same IP in 15-20 slots) -> wasted slots + redundant sends -> recvErr churn. New --max-inbound-per-ip N rejects incoming beyond N slots from one IP (loopback exempt). 0 = unlimited (default).

The 25% periodic cull randomly dropped productive (handshaked) connections, forcing re-handshake churn. Exclude exchangedPublicPeers peers so the cull only rotates unproductive slots.

virtual_memory.h includes disk_shadow.h on every platform, but its <thread>/<filesystem>/<mutex> pull MSVC <process.h>, which collides with system.h's `system` macro in the test TUs (C2365) and broke the Windows test build. Fork rollback is Linux-only, so gate the heavy includes + DiskShadow + the real VM hooks behind __linux__ and provide inert pass-through hooks (and a no-op request-park) on other platforms. The CLI-settable flags stay cross-platform on <atomic>. Linux build unchanged.

Forces a single-tick fork + rollback (rewind to the tick-1 state, then strict replay) every N ticks on an AUX node (0 = off), to stress the rollback path at a controlled cadence instead of every tick. The forced tick establishes a fresh checkpoint and verdict forces the mismatch branch; MAIN mode and the strict re-run are skipped as usual. Validated live (.49 testnet MAIN + local AUX, N=5): 42 promotes all at tick%5==0, log-state digest matches the canonical MAIN at every checkpoint, no crashes.

Catch-up replays finalized ticks (no optimistic divergence); skip fork if behind the tip.

Markers (quiesce/locks, pre-fork, post-fork) localize the mainnet fork hang.

Replaces ~600 eager sched_yield-spinning threads that starved fork() on mainnet; fork hooks reworked (no respawn/park, lazy on reconnect).

Catch-up speedups + net robustness + incremental spectrum digest; kept lazy-spawn networking + fork-rollback, dropped legacy reprocess + old recv-queue.

Child inherited pinCount from non-surviving parent threads -> cache slots pinned forever -> fatal; clear pins (keep cache) + releaseThreadPins. CATCHUP_MAX_PREFETCH 128->20.

512 was 25x the only constraint (>= CATCHUP_MAX_PREFETCH=20); cuts AUX staging RAM + fork PTEs.

Configure resets a reused TcpData (emplace no-op left stale sendIo bound to the old fd); send/recv workers now bail on stop without writing a possibly re-armed token.

Last tick of epoch skips verdict and the next tick runs beginEpoch; retire any live checkpoint there and don't open one (a carried-over checkpoint would force a promoted child to replay strict across beginEpoch, which blocks on the operator clean-memory flag).

…OUTE path

Pipe stdout is block-buffered by default; fork-rollback _exit() dropped buffered lines and the child re-emitted the inherited buffer. Line-buffering flushes per newline (no-op on a tty).

…ned") rpcUnixHandleConn is a one-shot detached thread with no PinScope, so tickData/ticks/tx pins taken by handlers leaked forever (thread_local arena dies unreleased) and accumulated to CACHE_PAGE over an epoch -> fatal at epoch end. Wrap dispatch in a PinScope.

…async fetch captures)

Re-run starts from clean checkpoint, so minerSolutionFlags is authoritative; bypass re-applied rewards for solutions counted before the window -> spectrum mismatch (score correct, --fv passes).

…ow on AUX->MAIN, retry-then-fatal commit; add gtests shadow commit now parks request procs + locks gRpcDispatchLock + honors drain so unlocked miss-IO writebacks can't race the /s->real rename; maybeForkBeforeTick retires a live window on AUX->MAIN (verdict is MAIN-gated); commit() retries then exit(1) instead of the false RAM-authoritative assumption.

… list) SmartMutex/SmartSharedMutex + ACQUIRE census; gate skips fork->runs tick strict on a foreign-held lock (never crashes); overflow fail-safe; --no-fork-census; check_smart_mutex build gate; /v1/fork-stats + /v1/unforkable-ticks; gtests.

hackerby888 added 30 commits June 12, 2026 01:36

net: show bulk chunk counters (received/served) on the status line

eccf98a

diag: bulk apply/queue/frontier/future-vote counters in status line (…

79fd651

…temp)

tune catch-up windows: REQUEST_SPAN 512->128, CATCHUP_MAX_PREFETCH 51…

b23a3c0

…2->32

catch-up: CATCHUP_MAX_PREFETCH 32->128

9c3bd3a

remove lite bulk catch-up (240/241 chunk protocol)

4c649c7

delete extensions/lite_bulk_catchup.h + all qubic.cpp glue (include, 240/241 switch cases, init, drain hooks, kicker, status diag). prefetch/fastTxWindow/merkle catch-up paths untouched.

net: don't cull handshaked peers in the 120s peer refresh

3de6619

The 25% periodic cull randomly dropped productive (handshaked) connections, forcing re-handshake churn. Exclude exchangedPublicPeers peers so the cull only rotates unproductive slots.

fine tune params

ce9851e

accept loopback by default

80c9ef2

hackerby888 and others added 30 commits June 16, 2026 23:49

fix comments

376dd13

merge main

114be0c

fork rollback: only fork at the network frontier, not while catching up

d3b5209

Catch-up replays finalized ticks (no optimistic divergence); skip fork if behind the tip.

fork rollback: revert catch-up gate; log each BSP fork step

2ade182

Markers (quiesce/locks, pre-fork, post-fork) localize the mainnet fork hang.

net: lazy-spawn per-socket tx/rx workers (cv-blocked when idle)

0a6c74e

Replaces ~600 eager sched_yield-spinning threads that starved fork() on mainnet; fork hooks reworked (no respawn/park, lazy on reconnect).

Merge fast-sync into feat/tick-fork-rollback

0c26405

Catch-up speedups + net robustness + incremental spectrum digest; kept lazy-spawn networking + fork-rollback, dropped legacy reprocess + old recv-queue.

fork rollback: reset swapVM pins on child promote; cap catch-up prefetch

8ea0191

Child inherited pinCount from non-surviving parent threads -> cache slots pinned forever -> fatal; clear pins (keep cache) + releaseThreadPins. CATCHUP_MAX_PREFETCH 128->20.

net: FAST_TX_WINDOW_TICKS 512->64 (~2.4GB -> ~300MB)

9f6bf1b

512 was 25x the only constraint (>= CATCHUP_MAX_PREFETCH=20); cuts AUX staging RAM + fork PTEs.

net: fix per-socket worker reuse on slot reconnect + worker stop-check

e46d01f

Configure resets a reused TcpData (emplace no-op left stale sendIo bound to the old fd); send/recv workers now bail on stop without writing a possibly re-armed token.

rpc_routes.h: drop route-path divider comments that restate the RPC_R…

08b7f24

…OUTE path

hide debug logs

9cd0ab5

log: line-buffer stdout so docker/pipe logs survive fork _exit

fee4f38

Pipe stdout is block-buffered by default; fork-rollback _exit() dropped buffered lines and the child re-emitted the inherited buffer. Line-buffering flushes per newline (no-op on a tty).

pump gForkWindowK

3db83fe

merge main

0e3773f

Merge main: checkin-thread SIGSEGV fix (serialize JSON per-call, fix …

3484ed3

…async fetch captures)

fix fork re-run double-reward: drop isRevalidation dedup bypass

f50d3e7

Re-run starts from clean checkpoint, so minerSolutionFlags is authoritative; bypass re-applied rewards for solutions counted before the window -> spectrum mismatch (score correct, --fv passes).

merge main

3705094

fork-stats: guard gmtime_r for MSVC build

a2591ba

fix more subtle bugs

882736c

Increase gForkWindowK from 32 to 64

0dbe64e

Merge branch 'main' into feat/tick-fork-rollback

28a61b4

merge main

99f6b94

Merge branch 'main' into feat/tick-fork-rollback

b917597

fix duplicated options bug

326e51c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/tick fork rollback#35

Feat/tick fork rollback#35
hackerby888 wants to merge 79 commits into
mainfrom
feat/tick-fork-rollback

hackerby888 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hackerby888 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant