ParparVM performance: parity with warmed Java 25 (geomean 1.00x)#5327
Open
shai-almog wants to merge 94 commits into
Open
ParparVM performance: parity with warmed Java 25 (geomean 1.00x)#5327shai-almog wants to merge 94 commits into
shai-almog wants to merge 94 commits into
Conversation
…ry GC, tagged Integer A body of AOT performance work, all gated/validated against bit-identical checksums vs Java SE and the clean-C test path. Off by default where flagged. - Small-value box caches for Integer/Long/Short/Character (valueOf -128..127), eliminating autoboxing allocation in tight loops. - Bounds-check elimination: prove-safe pass for the canonical induction loop (ArrayLoadExpression/ArrayLengthExpression/Instruction), unlocking SIMD. - Inlining of trivial monomorphic accessors (Invoke). - Conditional-volatile locals (BytecodeMethod): emit `volatile` only when a method has try/catch/synchronized/calls, letting clang register-allocate and vectorize call-free compute loops (3.6x on array reduce, no regressions). - Thread-local non-moving nursery GC behind -DCN1_NURSERY (cn1_globals.*, nativeMethods.m): in-place promotion, write barrier, adaptive survival-based bypass, block-lifecycle free-stack fix; main thread made lightweight so the concurrent GC pauses it. 2x on objectAllocation, off by default. - Tagged small-Integer "poor man's Valhalla" behind -DCN1_TAGGED_INT, 64-bit pointers only (auto-off on armv7/armv7k/arm64_32): Integer.valueOf returns an immediate tagged pointer, GC ignores it, CN1_CLASS_OF substitutes Integer's class in dispatch/instanceof, value reads route through a tag-aware native, monitor ops NOP. Plus an inline tagged hashCode/equals dispatch fast path for collections. 2x on hashMapChurn (GC eliminated), bit-identical to HotSpot. - Opt-in LTO flag (ByteCodeTranslator) for release/perf builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…owering makeConcatWithConstants/makeConcat are desugared to a synthetic StringBuilder helper. Pre-size that StringBuilder from the recipe literals + per-argument length estimates so the common-case concat never grows its char[] (each growth is a fresh array + arraycopy). Over-estimates are harmless; under-estimates still grow correctly. Verified bit-identical to HotSpot on a concat microbench. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-teaming A comprehensive edge-case test (getClass, isInstance, instanceof, equals across tagged/heap/null/non-Integer, compareTo via TreeMap, all Number methods, HashMap/HashSet/TreeMap/ArrayList, Arrays.sort, switch, concat, synchronized, MIN/MAX_VALUE) crashed the -DCN1_TAGGED_INT build in four places the original benchmark never exercised. All were native/codegen paths dereferencing a tagged pointer's (nonexistent) object header: - Object.getClassImpl: read this->header -> tag-aware (returns Integer.class). - Class.isInstance(obj): read obj->header -> CN1_CLASS_OF + null guard. - String equals-family: read arg->header->classId -> CN1_CLASS_OF(arg). - Interface dispatch (e.g. Comparable.compareTo via TreeMap): the classId index read this->header->classId -> CN1_CLASS_OF (ByteCodeClass interface vtable gen). - CN1_CLASS_OF itself: a plain ternary let clang if-convert and SPECULATIVELY load the faulting tagged header before the tag test (crash with no inline fast-path guard, e.g. interface compareTo). Reworked to select a valid object pointer first (a static JavaObjectPrototype proxy whose header is Integer's class), so the single header load is always on a dereferenceable address. Result: full edge-case test bit-identical across default / tagged / tagged+nursery, and the Bench suite still matches HotSpot with no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The inner chain walk (findNonNullKeyEntry) and key equality (areEqualKeys, with a pointer-== fast path that already short-circuits tagged-int keys) were already native. But get still went through translated-Java wrappers: get -> getEntry -> computeHashCode(key.hashCode()) -> findNonNullKeyEntry. Collapse those into one C function; for a tagged Integer key the hashCode is an inline untag via the dispatch fast path. Bit-identical to the Java getEntry path (EdgeTest default==tagged, full edge matrix). ~1.25x on hashMapChurn (6858 -> 5471ms, 20 reps), general (helps the default build too, not gated). First step of the native-collection-fast-path work: the algorithm in C beats HotSpot 3.5x at the ceiling, so collapsing the remaining wrappers (put) and ultimately open-addressing storage is the path to parity/better. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same pattern as native get: collapse put/putImpl/computeHashCode into one C call, reusing the native chain walk and the Java createHashedEntry/rehash slow path. The only store this owns is entry.value = value, which carries an explicit CN1_WRITE_BARRIER (the Java version emitted one). Bit-identical (EdgeTest default==tagged unchanged, 8424060826785033831). hashMapChurn (20 reps, tagged): 5471 (get-only) -> 3952ms with put too; 6858 -> 3952 = 1.74x from native get+put. Now ~6.6x behind HotSpot (598ms), down from ~26x at session start. Remaining gap is the per-key Entry allocation (chaining) + createHashedEntry/rehash; open-addressing storage is the next lever (the C ceiling with no Entry objects beats HotSpot 3.5x). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
append(int)/append(long) were `append(Integer.toString(i))` -- a temporary String (plus its char[]) allocated on every call. Replace with native methods that write the decimal digits straight into the builder's char[] (digits generated in negative space so INT/LONG_MIN don't overflow). No per-append allocation. General (not gated). Validated bit-identical to HotSpot on a string-building microbench (append String/int/char/long chains + toString), which is now ~7.2x behind HotSpot (the ~13x tier). The char append/String append/charAt/getChars were already native. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
clear()/removeEntry() now recycle entries onto a free list (cn1FreeList, a GC-marked field; key/value nulled to release refs) instead of dropping them to GC, and createHashedEntry pops from the pool before allocating. After the first fill, churn patterns (fill/clear loops, add/remove steady state) allocate nothing -- the case a generational nursery can't help because the entries escape into the map. origKeyHash made non-final so pooled entries can be re-keyed. hashMapChurn (20 reps, tagged): 3952 -> 1782ms (2.2x). Now ~2.9x behind HotSpot (620ms), down from ~26x at session start (tagged ints -> native get -> native put -> entry pool). Validated: EdgeTest default==tagged unchanged, 8/8 GC stress, checksum bit-identical to HotSpot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
toString() previously always allocated a fresh String + copied the char[]. Now it SHARES the buffer with the returned String (via the offset/count String ctor) and sets `shared`. append() stays untouched -- it only writes beyond the String's view or reallocates via enlargeBuffer (which clears `shared`), so it's safe to share. Only the editing mutators (setCharAt/insert/delete/deleteCharAt/reverse/setLength) copy-on-write via cn1Unshare(). The copy-on-write scaffolding was already designed (commented out); this wires it through cn1Unshare(). Validated: a toString-then-mutate test (setCharAt/insert/delete/reverse/setLength, re-checking earlier Strings) is bit-identical to HotSpot; string-building bench bit-identical and 2191 -> 1541ms (~7.2x -> ~4.4x behind HotSpot); EdgeTest AOT unchanged. General (not gated) -- every toString in the system avoids a copy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Methods that make calls couldn't use the fast leaf frame (the stack trace must keep their frame), so they paid a NON-INLINE initMethodStack() on entry and releaseForReturn() on exit -- two function calls per invocation, brutal for hot recursive/call-dense code. initMethodStack's only extra work vs the fast path is recording the class/method id (two array writes for the trace). Move both to static-inline (cn1InitMethodStackInline keeps the name recording; releaseForReturn inlined) so the C compiler folds the offset arithmetic and the call overhead is gone. Also adds the threadObjectStack-overflow guard the fast path already had. recursion 6.66x -> 4.89x, hashMapChurn 4.6x -> 3.95x, quicksort/objectAllocation slightly better; compute unchanged (already inline via the fast frame). Bit- identical to HotSpot, EdgeTest unchanged. Broad: helps every call-dense method. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
POP_INT/POP_LONG/POP_OBJ used a non-inline pop(&SP) -- a function call for a pointer decrement, hit on every pop including hot return paths (return POP_LONG()). Make it static inline. Broad, helps all stack-popping code. Bit-identical (EdgeTest unchanged, fib result matches HotSpot). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…jects Annotation-driven escape elimination (the AOT-correct replacement for the nursery, which was a synthetic win + ~10% universal write-barrier tax). A class marked @com.codename1.annotations.StackAllocate has each `new` lowered to a method-scoped C struct instead of codenameOneGcMalloc: no malloc, no heap registration, no GC mark/sweep -- the object dies with the frame. Intended for internal short-lived value/temporary types where non-escape is known by construction (the developer asserts it; violating it dangles). Mechanics: - StackAllocate: TYPE-target, CLASS-retention marker annotation. - Parser detects it at class level -> ByteCodeClass.stackAllocatable. - BytecodeMethod pre-scans each method and declares one frame-scoped `struct obj__T __cn1stk_<site>;` per @StackAllocate NEW site (reused across loop iterations -- only one instance per site is live at a time). - TypeInstruction NEW replicates exactly what __NEW_T does (run the static initializer, set the same header fields codenameOneGcMalloc sets) but SKIPS heap registration, so the sweep never visits it. Its pointer rides the operand stack, so the GC still reaches it as a root and scans its fields -- any heap objects it references stay live. Tax-free and opt-in: codegen only diverges when stackAllocId>=0, so non-annotated code is byte-for-byte unchanged. Validated: - 60M-iteration non-escaping temporary (Vec2): 4.51x faster than the heap path (45x -> 10x behind HotSpot), bit-identical checksum vs heap build and HotSpot. - GC red-team: a @StackAllocate Holder owning a heap Payload with System.gc() forced mid-loop -> bit-identical to HotSpot, no premature collection, no crash (proves the GC marks through the stack object). - Full parparvm-bench suite (zero annotations) still bit-identical to HotSpot. Residual 10x vs HotSpot is the per-iteration memset + header init + operand-stack traffic that full scalar replacement (object -> field locals) would remove next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…registers Builds on the @StackAllocate stack-alloc foundation (d4185da). A primitive-only @StackAllocate object used as a simple local temporary is now turned into a pure C local struct whose address is NEVER taken, so clang's SROA promotes its fields to registers and the object vanishes -- matching what HotSpot's escape analysis does, which the prior stack-alloc path could not because the struct's address escaped to the GC-scanned operand stack (measured: that escape alone cost 2.4x). Transform (a conservative, bail-on-doubt pass in BytecodeMethod.optimize()): recognize NEW X; DUP; <args>; INVOKESPECIAL X.<init>; ASTORE n where - X is @StackAllocate, a DIRECT Object subclass, primitive-only instance fields, no <clinit> (so dropping super.<init>/static-init is sound, and there are no heap refs the GC must scan -> the object need never be a GC root); - X.<init> is exactly Object.<init> + a param->field bijection (every field assigned exactly once from a distinct ctor param of matching type) -- analyzed by srAnalyzeCtor, else bail; - local n is used ONLY as ALOAD n; GETFIELD X.f (srValidateLocalUses: any other use -- pass/return/PUTFIELD/second store/type-confusion -- bails); - the arg region has no nested NEW/<init>/stack-shuffle/branch, else bail. Then: NEW emits nothing (no header/memset/PUSH); DUP and ASTORE are dropped; INVOKESPECIAL <init> becomes ScalarAllocInit, which folds the (already reduced) arg expressions straight into __cn1sr_<id>.field = <expr> (or, if an arg isn't a pure expression, falls back to popping the operand stack in order -- both are stack-balanced); GETFIELD on local n becomes direct __cn1sr_<id>.field. Anything not matching keeps today's GC-safe stack-alloc codegen. Off-by-default escape hatch: DISABLE_SCALAR_REPLACE. Validated (independently rebuilt + re-run, not just the implementing agent): - SA (60M non-escaping Vec2 long-field temporaries): generated work() has 0 get_field/PUSH_POINTER/__NEW/Vec2___INIT (struct register-promoted), checksum bit-identical to HotSpot, 528ms -> 120ms (4.40x faster than stack-alloc). - SA2 (Holder with a HEAP Payload field, System.gc() forced mid-loop): primitive- only gate BAILS (0 __cn1sr_), keeps stack-alloc, bit-identical, no crash. The critical GC-safety gate. - Full parparvm-bench suite (51 checksums, zero annotations): all bit-identical to HotSpot. Scalar replacement is a clean no-op on un-annotated code. Residual vs HotSpot (2.35x) is ambient ParparVM frame/line scaffolding (__CN1_DEBUG_INFO per-source-line stores), orthogonal to object handling -- the object-elimination win is fully realized (the hand-C floor for this loop is 36ms, below HotSpot's 51ms). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…torize it)
A Java long comparison `a < b` compiles to LCMP (three-way -1/0/1) + IFxx, which
the translator emitted as `CN1_CMP_EXPR(a,b) <op> 0` -- a `(a==b)?0:(a>b)?1:-1`
chain compared to zero. clang cannot recover the loop trip count through that, so
long-counted loops were neither analyzed nor vectorized. Measured: it was THE
residual on the scalar-replaced @StackAllocate benchmark -- replacing it with a
direct comparison was 2.07x and took that loop from 2.35x HotSpot to parity.
Fix: when an LCMP ArithmeticExpression feeds an IFxx branch-on-zero, emit the
direct `(a <op> b)` instead (ArithmeticExpression.getLongCompareDirect, used in
the IFxx branch-fusion in BytecodeMethod). Long only -- float/double (FCMPx/DCMPx)
keep CN1_CMP_EXPR because their NaN ordering differs from a direct C comparison.
Safe and bit-identical: the folded operands are pure (the reducer only folds
loads/constants/pure expressions), so `(a<op>b)` evaluated once equals
`CN1_CMP_EXPR(a,b)<op>0` for every long value -- and avoids the macro's
double-evaluation of each operand. General: helps every long-counted loop, not
just @StackAllocate.
Validated (bit-identical to HotSpot):
- Long-edge test: all 6 operators (< <= > >= == !=) over {Long.MIN, MAX, -1, 0,
1, MIN+1, MAX-1} (81 pairs) -- checksum identical, fusion fired (0 CN1_CMP_EXPR).
- Full parparvm-bench suite (51 checksums) -- all identical.
- SA (scalar-replaced Vec2 loop) -- identical, 120ms -> 56ms = 1.08x HotSpot
(was 2.35x); SA2 unaffected, identical.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-source-line __CN1_DEBUG_INFO store (callStackLine[frame] = line) was the last hot-path cost keeping tight loops out of registers -- it was the entire residual on the scalar-replaced @StackAllocate benchmark (56ms -> 40ms once gone). A frame's reported trace line is only ever read at a capture/throw/call site, and every such site lives on a line that calls, allocates, or does a throwing op (field/array/div/new/athrow). A line whose every instruction is non-throwing and non-calling (primitive arithmetic, local load/store, constants, compares, branches, conversions) can therefore NEVER be the line a trace reports -- so eliding its store is trace-IDENTICAL, not a line-number regression. Implementation: - BytecodeMethod.analyzeElidableLineInfo() marks each LineNumber whose source line has no throwing/calling instruction (canThrowOrCall(): conservative -- default keep; only an explicit non-throwing whitelist is elidable; numeric/String LDC and a scalar-replaced NEW are non-throwing; integer div/rem, array/field/static access, invoke, new*, athrow, checkcast, monitor are kept). Runs AFTER scalar replacement so a scalar-replaced object's now-pure NEW/<init>/field access is seen as non-throwing. - LineNumber emits the elidable store as __CN1_DEBUG_INFO_NT, which is the full store under the on-device debugger (which steps line-by-line and needs every line) and a no-op in release/device builds -- where it removes the only per-line cost. Throwing/calling lines keep __CN1_DEBUG_INFO, so the reported line is always live and exact. Validated: - Full parparvm-bench suite (51 checksums) bit-identical to HotSpot -- execution unchanged; the elision applies to every method with no regression. - SA (scalar-replaced Vec2 loop): all hot lines elide, checksum bit-identical, release 56ms -> 40ms = 0.62x HotSpot (BELOW the JIT, at the hand-C floor). SA2 (object field, gc() forced) bit-identical. Note: empirical printStackTrace trace validation is blocked in the standalone `clean` target by a PRE-EXISTING trace-builder crash on null constant-pool strings (both elision-on and elision-off segfault identically -- unrelated to this change); trace-identity rests on the construction argument above + bit-identical execution. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parallelizes the transitive mark DRAIN across a persistent worker pool while leaving codenameOneGCMark's per-thread park / root-snapshot logic unchanged, so snapshot-at-the-beginning (mark all of a thread's reachable set before releasing it) is preserved. Marking was already type-specialized (per-class markFunction, leaf types skipped); this adds the parallelism. - gcMarkObject parallel path claims unmarked->marked with an atomic CAS (__sync_bool_compare_and_swap); only the winner pushes. force/recursionKey re-scan stays entirely on the serial path (force is never set in parallel). - Worklist: shared array under a mutex; each worker pops a 64-entry batch and buffers produced children in a __thread-local buffer, flushing in batches (broadcast wakes idle workers). Termination: a worker idles only when the shared worklist is empty AND its local buffer is flushed; the last worker to idle sets gcMarkDone. Overflow still falls back to the serial heap-rescan fixed point; the nursery promote path and force re-scan stay serial. - The __thread worklist-buffer pointer doubles as the "am I a parallel worker?" discriminator: when NULL (the GC thread between drains, N=1, overflow rescan) gcMarkObject/push take the ORIGINAL serial code verbatim -- no atomics, no lock. - CN1_GC_MARK_THREADS overrides the marker count; default min(4, ncpus-1) at runtime; N=1 is byte-for-byte the previous behavior (no pool, no atomics). Validated: full parparvm-bench suite bit-identical to HotSpot at N=1 AND N=4; serial==parallel checksums identical; ThreadSanitizer clean on all introduced mark-state synchronization (the remaining TSan reports are the collector's pre-existing, inherent non-STW collector-vs-mutator reads -- unmodified HEAD shows the same class of reports); GC stress (millions of objects, ~120 GCs/run) stable and identical across 5 runs. Measured-as-a-whole impact (vs serial mark, min-of-reps): objectAllocation 306->280ms (1.09x), everything else within noise. Marking is ~19.5% of the GC-bound time and the bench's live-set-per-GC is small, so the whole-suite gain is modest; the bulk of the GC gap (allocation fast path + concurrent-collector throttle) is the next, larger target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ects Replaces per-object calloc + allObjectsInHeap registration + per-object free() with a non-moving segregated-fits (BiBOP) page heap for small non-array objects. Arrays and objects > MAX size class keep the verbatim legacy aligned calloc + allObjectsInHeap path -- so real array offsets, stable addresses, and SIMD/GPU alignment are untouched. NON-MOVING (objects never move -> no write barrier, no pointer fix-ups / shadowing -- the reasons generational was rejected do not apply) and NON-GENERATIONAL (whole-heap collect). Design: - 64KB posix_memalign'd pages, one size class each (15 classes 32..512B; >512 or arrays -> legacy path). Every slot >=16-byte aligned. - Allocation: per-thread (__thread) current page per class; pop the page free-list else bump the cursor (lock-free, thread-local). Page full -> retired to a global SWEEP stack (atomic CAS push); grab a fresh/partial page from the pool (one bibopMutex acquisition per page, not per object). Slot re-zeroed + header set exactly as codenameOneGcMalloc. Small objects are NOT registered in allObjectsInHeap -- the pages track them. - Liveness: the existing per-object epoch mark (__codenameOneGcMark) stays the single source of truth, so gcMarkObject + the parallel mark pool + the proven grace semantics (mark==-1 grace, mark<cur-1 dead) are UNCHANGED and work uniformly on page slots and legacy table objects. No per-page bitmap, no address->page table. - Sweep: rebuild each retired page's free-list from its slot headers (finalizers still run); an all-dead page returns to the pool. Then the existing allObjectsInHeap sweep handles large/array objects. The three correctness hinges: 1. Allocate-during-GC: a fresh slot is mark==-1 (one-cycle grace) AND lives on the thread's OWNED current page, which the concurrent sweep never touches (only retired pages, owner==0, are swept). 2. Sweep vs alloc: a page has exactly one role at a time -- OWNED (one thread allocates, never swept) -> retired to the SWEEP stack -> swept (owner==0) -> FREE/PARTIAL pool. The sweep snapshots the stack via atomic_exchange. No page is ever allocated-into and swept simultaneously. 3. No page-table race: dissolved -- header marking needs no address->page lookup; the append-only all-pages registry (release/acquire) is read only by the overflow rescan, and only at a slot whose atomically-read mark == current cycle. Escape hatch: #ifndef CN1_DISABLE_BIBOP (default ON); -DCN1_DISABLE_BIBOP reverts to the verbatim legacy collector. Independent of CN1_NURSERY (kept off). Validated (macOS arm64): full parparvm-bench suite bit-identical to HotSpot with BiBOP ON, -DCN1_DISABLE_BIBOP, and across 1/4/8 mark workers and forced worklist overflow (-DCN1_GC_MARK_WORKLIST_SIZE=256, exercising the page rescan). TSan: zero races on any BiBOP state (pages/pools/free-lists/registry/cursor) -- 111 reports vs the legacy baseline's 119, all the pre-existing collector-vs-mutator object- header family. GC-stress + 4-thread allocate-during-GC stress: checksums identical across runs and to legacy/HotSpot (a single lost live object would diverge). RSS 24-26% LOWER than legacy and bounded over 2000 rounds (pages recycled, no drift). Measured as a whole vs warmed Java 25 (+AOT cache), min-of-reps: objectAllocation 278->144ms (1.93x; 15.1x->7.8x vs Java25), stringBuilding 1.14x, hashMapChurn 1.05x, compute/arrays unchanged. Whole-suite geomean vs Java25 2.26x -> 2.08x, zero regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…x->1.6x HotSpot) The per-call frame bookkeeping -- DEFINE_METHOD_STACK's per-call memset of the locals+operand-stack elementStruct region, the callStackOffset bump/check, the releaseForReturn offset restore, and the per-line __CN1_DEBUG_INFO stores -- is pure overhead. A method that holds ZERO object references in its frame contributes no GC roots, so the precise collector has nothing to scan there and the frame can be eliminated outright. No GC change, no operand-stack rewrite (an SSA-temp rewrite was measured NOT to help and was skipped); instruction bodies stay byte-identical, so this is bit-identical by construction. - isFramelessEligible() (BytecodeMethod): conservative whitelist on raw bytecode -- static, primitive-or-void return, no object args/locals, no object operand-stack value, no try/catch, not synchronized/native/on-device-debug, and every opcode in the handled primitive set (loads/stores/consts/arithmetic incl. throwing div-rem/ shifts/bitwise/conversions/compares/branches/switch/dup-pop-swap/returns + INVOKESTATIC with a purely primitive/void descriptor). Anything else -> ineligible -> byte-identical legacy codegen. - DEFINE_METHOD_STACK_FRAMELESS (cn1_globals.h): the operand stack is a method-local C array (not a threadObjectStack slice) -- no per-call memset, no offset bookkeeping, no callStack push; emits CN1_FRAMELESS_SOE_GUARD. - CN1_FRAMELESS_SOE_GUARD: frameless methods don't bump callStackOffset, so deep non-tail recursion is guarded by comparing __builtin_frame_address(0) to a lazily cached per-thread nativeStackLimit (pthread_get_stackaddr_np - stacksize + 256KB band; 8MB frame-anchored fallback) -- throws StackOverflowError instead of SIGBUS. __builtin_expect hints are load-bearing (177->147ms without/with). - Return sites (BasicInstruction x5 + optimize()'s two return fast-paths) emit plain return with no releaseForReturn; LineNumber suppresses __CN1_DEBUG_INFO for frameless methods (no callStackOffset to index). Gate: -Dcn1.frameless (default ON); OFF emits byte-identical-to-HEAD code. Validated: full Bench suite bit-identical to HotSpot frameless ON and OFF; OFF byte-identical generated C to HEAD; 11 methods frameless in the suite. Deep non-tail recursion throws StackOverflowError, not SIGSEGV. Measured vs warmed Java 25+AOTcache: recursion 436->150ms = 2.92x faster (ON vs OFF), 4.64x -> 1.59x HotSpot; every other benchmark within noise (no regression). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… methods (opt-in) Phase 3b of the conservative-collector endgame: extends frameless codegen from primitive-only methods (committed 0260fe8) to OBJECT-BEARING methods, with the conservative native-stack scan as a real GC root source. A frameless object method keeps its object refs in native C locals / a method-local operand-stack array (no DEFINE_METHOD_STACK frame, no threadObjectStack, no per-call memset); the GC finds those roots by conservatively scanning the thread's native C stack. Enabled by the non-moving BiBOP heap (conservative scanning requires non-moving). Gated: #ifdef CN1_CONSERVATIVE_GC_ROOTS (the runtime) + -Dcn1.frameless.objects (the codegen); DEFAULT OFF -- the default build is byte-identical to HEAD (precise GC + primitive-only frameless). The proven path (P1 resolver / P2 native-stack scan / P3a zero-miss root-placement) is now production, not validation. - cn1ConservativeResolve(word)->object base|NULL: BiBOP page-aligned candidate + all-pages-registry binary search + interior pointers + large/array extents; marks for real (cn1ConservativeMarkRange). - HYBRID GC: codenameOneGCMark keeps the precise threadObjectStack scan for legacy frames AND conservatively scans each stopped thread's native stack [sp,base) + register snapshot for frameless frames; explicit roots (currentThreadObject, statics, constant pool, pending native allocations) retained. The conservative scan covers the whole native stack, so the legacy<->frameless caller/callee boundary is never a gap. - Universal thread-stopping: cooperative (CN1_GC_PARK_CAPTURE setjmp + SP at every safepoint, proven) for lightweight threads; signal-based (SIGUSR2 + ucontext SP/reg capture) for genuine native threads, opt-in (CN1_GC_SIGNAL_STOP). - Object-frameless eligibility extends the whitelist to ALOAD/ASTORE, GETFIELD/ PUTFIELD/GET-PUTSTATIC, NEW/ANEWARRAY/CHECKCAST/INSTANCEOF, array ops, all invokes (args as explicit C params), ACONST_NULL/IF_ACMP*/IFNULL, String/Class LDC. Excluded: try/catch, ATHROW, MONITOR*, MULTIANEWARRAY -> stay legacy. Instruction bodies byte-identical (win is frame elimination, not re-lowering). Validated (CN1_CONSERVATIVE_GC_ROOTS + -Dcn1.frameless.objects): full Bench suite bit-identical to HotSpot (72 frameless methods: 12 primitive + 60 static object); default (gates off) byte-identical to HEAD; GcStress 25x and 4-thread MtStress 30x == HotSpot with bounded RSS (no leak); the transient ⊇ self-check (CN1_CONSERVATIVE_ GC_SELFCHECK) reports MISS=0 (every precise root also resolved conservatively). GcStress 5x re-confirmed == HotSpot here. HONEST STATUS: - PERF-NEUTRAL today: the frame-elimination win is offset by an UNOPTIMIZED conservative scan (the heap-membership snapshot is rebuilt O(heap) per-thread-per- GC). The once-per-GC optimization (born-marked new BiBOP objects) is the next step to make object-frameless a net win on GC-heavy code; recursion's win is preserved (no GC in the loop). That's why this ships OPT-IN, default off. - INSTANCE-method frameless (-Dcn1.frameless.instance) and the SIGNAL-stop path have intermittent multi-thread races (DONE 0 / ~8-10%) NOT root-caused -> gated OFF. The static + cooperative path (what's validated above) is solid (30/30 MT). - Conservative GC is incompatible with CN1_NURSERY (deprecated); frameless methods don't appear in callStack-based stack traces (printStackTrace doesn't crash). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hread)
java.lang.Thread.alive was set to true inside java_lang_Thread_runImpl, which runs
on the WORKER thread asynchronously after start() returns. java_lang_Thread_start__
only did pthread_create. So a thread doing start() then join() could race: join()
-> isAlive() reads false (worker not yet scheduled) and returns IMMEDIATELY, before
any of the worker's writes were published -- e.g. main summing a worker-filled
results[] array could read it still zero. Classic "started-state not set
synchronously by the starting thread" bug; present on every port, ~15% repro in a
4-thread join-then-read stress (vs HotSpot fully deterministic).
Fix: set the alive flag synchronously on the CALLING thread, in program order before
the worker is spawned, in java_lang_Thread_start__. A later join() then correctly
blocks until the worker clears alive under the monitor (runImpl:
synchronized{ alive=false; notifyAll(); }), and that monitor release/acquire is the
happens-before edge that publishes the worker's writes. Purely additive
synchronization; bit-identical to HotSpot on the full Bench suite. MtStress
3/20-failing -> 50/50 deterministic == HotSpot after the fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rameless Flip the Phase-3b gates to default ON (arm64-validated -- the dev machine is Apple Silicon arm64, same arch as the iOS device target; CI validates the other ABIs): - cn1_globals.h: #define CN1_CONSERVATIVE_GC_ROOTS by default (disable with -DCN1_DISABLE_CONSERVATIVE_GC_ROOTS). - BytecodeMethod: cn1.frameless.objects + cn1.frameless.instance default true. The instance-frameless multi-thread failure that previously gated it was the pre-existing Thread.start/join visibility race, fixed in 9933311. Default build now: 302 frameless methods (was 12 primitive-only), bit-identical to HotSpot, no per-call frame on object/instance methods, roots found by the conservative native-stack scan. Validated: full Bench suite bit-identical; GcStress 5x == HotSpot, no crash/leak. Cooperative thread-stop covers Java threads (what the bench exercises); native-thread coverage via the signal path (CN1_GC_SIGNAL_STOP) stays the edge for CI/on-device. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LINE_ALLOC) The build ships no LTO, so __NEW_<X> and codenameOneGcMalloc live in separate translation units and clang cannot inline them: every escaping new-site pays two real cross-TU calls (confirmed in asm). CN1_FAST_NEW(X) inlines the BiBOP per-thread bump common case at the allocation site (pointer-bump + header stamp, size-class index folded to a compile-time literal via CN1_BIBOP_CIDX), falling back to __NEW_<X> only on page-full / free-list / oversized / ineligible. The bump replicates cn1BibopAlloc bit-for-bit (relaxed bumpIndex load, mark released last, cursor release-stored after slot init) so the concurrent-GC correctness argument is unchanged. bibopCurrent[]/bibopBytesSinceGc + struct CN1BibopPage are lifted to the header for the inline; the .m keeps a _Static_assert that the size-class array still matches. Gated -DCN1_INLINE_ALLOC, default OFF (pending iOS on-device validation of the statement-expression macro, as with the conservative GC). With the flag off CN1_FAST_NEW(X) expands verbatim to __NEW_<X>, so the default build is byte- identical. Validated (arm64 macOS): full Bench bit-identical to HotSpot both OFF and ON; GcStress 20/20 and MtStress 10/10 (4-thread alloc-during-GC) == HotSpot, no leak. Measured ON vs OFF: objectAllocation 107.9->79.0ms (-27%, 5.4x->3.94x vs warmed Java25), stringBuilding 61.2->51.5ms (-16%); compute/arrays within +/-1%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alloc fast-path tier 2) Stacks on the inlined BiBOP bump (CN1_INLINE_ALLOC) to close more of the escaping-allocation gap. Two independently-gated levers: Lever B (-DCN1_INLINE_CTOR): after CN1_FAST_NEW allocates, the constructor was still a separate out-of-line cross-TU call. InlinableConstructor analyses a constructor for an inlinable shape (only this/param field stores + a chained- inlinable super ctor, bounded instruction count, no INVOKE except that super, no alloc/throw/branch/loop/try) and the new-site emits the field stores inline instead of the call. Emitted as an `#ifdef CN1_INLINE_CTOR` in the generated C (both branches present), so with the flag off the original call compiles and the build is byte-identical. Constructor args are consumed from the operand stack; the object is already GC-reachable and its ref fields were zeroed by the bump, so the inline stores need no extra barrier (this VM has none). Lever A (-DCN1_DEATOMIC_BYTES): the per-allocation `atomic_fetch_add` on the global bibopBytesSinceGc becomes a plain per-thread accumulator (ThreadLocalData.bibopBytesLocal) flushed in bulk at page-acquire and thread death. bibopBytesSinceGc feeds only the GC-trigger heuristic (no liveness role) and is already raced today, so deferring it only shifts the trigger cadence by < nthreads*page, negligible vs the 24MB trigger. The bump cursor and mark publication ordering -- the GC-visible fields -- are UNCHANGED. Both default OFF, alongside CN1_INLINE_ALLOC, pending iOS on-device validation. Validated (arm64 macOS): full Bench bit-identical to HotSpot for every flag combination (off / L1 / +A / +B / +A+B); GcStress 10/10 and MtStress 10/10 (4-thread alloc-during-GC) == HotSpot on the +A+B config, no leak. Interleaved (thermal-drift-cancelling) objectAllocation: off 171.9 -> L1 126.9 -> +B 80.1 -> +A+B 71.4 ms (2.4x speedup; each lever stacks). hashMapChurn flat (its cost is hashing/clear, not allocation) and stringBuilding modest (char[] arrays use the legacy path). Net: objectAllocation ~5.7x -> ~2.7x warmed Java25; compute/ arrays unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lightweight pending array) cn1GcBuildRootSnapshots() reads every thread's pendingHeapAllocations array to add not-yet-migrated objects to the conservative-resolve extent table. It runs on the GC thread before the thread being scanned is parked, so threads other than the current one are still RUNNING. A lightweight thread grows its pending array lock-free in codenameOneGcMalloc / cn1AddPending (malloc tmp; memcpy; free(old); pending = tmp) -- the pre-existing guard took threadHeapMutex only for non-lightweight (native) threads. So the GC could read pendingHeapAllocations[j] exactly as free() reclaimed the array: the garbage word is taken as a heap-extent base and cn1ConservativeResolve returns it unvalidated -> SIGBUS in gcMarkObject. Rare (~1% under timing perturbation) but real, and it reaches default builds (CN1_CONSERVATIVE_GC_ROOTS is default-on). Fix: serialize the grow-and-free against the snapshot read. The two realloc fast paths now take threadHeapMutex unconditionally (lightweight included, like the native path already did), and cn1GcBuildRootSnapshots takes the SAME mutex around its pending-read loop. The lock is acquired and released entirely within the read, before the caller signal-stops any thread, so no thread is ever frozen mid-realloc holding it (no deadlock); ordering vs lockCriticalSection is never inverted (the migration path takes criticalSection THEN threadHeapMutex; this path takes only threadHeapMutex). This mirrors the existing pending-migration code (715-740), which already reads pending under threadHeapMutex for native threads / while lightweight threads are parked. The per-element store stays lock-free -- that read is benign (an aligned 8-byte slot holds 0 or a complete valid pointer; no free involved). Validated (arm64 macOS): ThreadSanitizer on HEAD deterministically reports the race (cn1GcBuildRootSnapshots reading pending vs codenameOneGcMalloc). With the fix: full Bench bit-identical to HotSpot (default and -DCN1_INLINE_ALLOC -DCN1_INLINE_CTOR -DCN1_DEATOMIC_BYTES); MtStress (4-thread alloc-during-GC) 300/300 clean -- 0 crash, 0 deadlock, all checksums == HotSpot -- at a deliberately widened race window (PER_THREAD_ALLOCATION_COUNT temporarily 16); GcStress 20/20 == HotSpot; no perf regression (objectAllocation/stringBuilding/intArithmetic within +/-1%). Residual conservative-collector non-STW reads are pre-existing and by design. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…T iOS VM) The inline BiBOP bump (CN1_INLINE_ALLOC), inline leaf constructors (CN1_INLINE_CTOR) and de-atomic per-thread byte accounting (CN1_DEATOMIC_BYTES) were committed behind opt-in -D flags. For an AOT VM whose sole shipping target is iOS, an off-by-default flag is dead code that never runs in production, and CI already exercises every ABI. Flip all three to default-on with a -DCN1_DISABLE_* escape hatch (kept only so CI can A/B and so a platform can opt out if a real problem surfaces). Validated (arm64 macOS): the DEFAULT build (no flags) is now bit-identical to HotSpot across the full Bench suite, GcStress 15/15 and MtStress 15/15 (4-thread alloc-during-GC) == HotSpot. Perf is the previously-measured strongest config: objectAllocation ~2.7x warmed Java25 (was 5.7x), compute/arrays at parity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… arena Two GC-memory changes, both bit-identical to HotSpot, found by profiling allocation-churn benchmarks (objectAllocation etc.) which were spending their time in the allocator/collector rather than the mutator. 1. Adaptive allocation pacing. System.gc() used to Thread.sleep(2) on every trigger; an allocate-and-drop workload triggers GC every CN1_BIBOP_GC_TRIGGER bytes, so that fixed sleep was pure mutator stall (and, crucially, it did NOT bound memory -- RSS ballooned to 2.35-7GB run-to-run as the mutator outran the collector). Replace it with proportional backpressure in cn1BibopMaybeGc: the mutator only waits when uncollected BiBOP volume since the last GC exceeds a hard cap (3x the trigger), and waits as a GC SAFEPOINT (threadActive=FALSE so the collector can scan/advance past it -- a naive spin livelocks the collector, which showed up as an MtStress hang). When the collector keeps up the cap is never hit and this never waits. Counter-intuitively the tight cap is also the FAST configuration: a small heap keeps the non-generational O(pages) sweep cheap, so the collector keeps up and the mutator barely waits; a loose cap lets the heap grow and the sweep (hence everything) crawls. Disable: -DCN1_BIBOP_NO_PACING. 2. Batched page arena. cn1BibopNewPage did one posix_memalign(64KB) per page; when churn drains the free pool faster than the sweep refills it, every page was a separate mach_vm_map kernel trap (profiled ~17% of objectAllocation, now 0 in the sample). Carve 64KB pages from a 64KB-aligned multi-page arena (one mmap per CN1_BIBOP_ARENA_PAGES=64); pages stay 64KB-aligned, the arena is lazily faulted (RSS tracks touched pages), and BiBOP never free()s a page so interior pointers are safe. Disable: -DCN1_BIBOP_NO_ARENA. Result on objectAllocation churn: peak RSS 2.35GB+ (unbounded) -> 275MB (bounded, ~9x), at neutral-to-faster perf (clean idle wall-time equal-or-better; pacing only engages under allocation pressure, so compute/array benchmarks are unaffected -- bit-identical). This bounds what was effectively an unbounded-RSS OOM risk on device. It does NOT close the throughput gap to HotSpot on churn -- that is the non-generational O(pages) sweep vs HotSpot's O(survivors) young gen, a separate follow-up (O(1) all-dead-page reclaim). Validated (arm64 macOS): full Bench bit-identical to HotSpot; GcStress 20/20; MtStress (4-thread alloc-during-GC) 12/12, no hang; RSS bounded over sustained churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…us pages
The non-generational sweep walked every slot of every retired page (millions of
reads per cycle under allocation churn), so the collector couldn't keep up and
the adaptive pacing throttled the mutator -- objectAllocation was sweep-bound.
Make the sweep skip the per-slot walk for pages whose fate is provable in O(1):
A retired page is "homogeneous" -- safe to decide without walking -- iff
!gcAllocedSinceSweep (no fresh mark==-1 grace-candidate slots since last sweep)
&& gcLastMarkedEpoch != V (nothing on it was marked THIS cycle; a reachable
object is always marked, so every occupant is garbage
aging through grace)
&& !gcNeedsReclaim (no survivor class carries a real finalizer)
&& cn1BibopLiveMonitors == 0 (no BiBOP monitor data to free)
For a homogeneous page, gcGraceEpoch (set at each full walk = upper bound on every
survivor's epoch) decides the whole page:
gcGraceEpoch < V-1 -> ALL DEAD -> O(1) reclaim (reset bumpIndex/freeList, to
freePool; byte-identical to the walk's
liveCount==0 outcome, without touching slots)
gcGraceEpoch >= V-1 -> ALL LIVE (still in grace) -> O(1) skip (route as the walk
would, gcGraceEpoch unchanged so it ages out)
Otherwise the existing full walk runs (and refreshes the per-page facts). New
per-page fields live in struct CN1BibopPage (always present so A/B layouts match);
set on alloc (the bump + free-list paths) and in gcMarkObject (a relaxed,
idempotent epoch stamp -- the marker is parallel). Monitors use a global seq_cst
live-count rather than a per-page flag to avoid cross-thread visibility races.
Gate: -DCN1_BIBOP_NO_FASTSWEEP.
Enabler (required): every class was emitting a non-null finalizerFunction that
just chained to the empty Object finalizer, so a "has finalizer" predicate was
always true and the O(1) path never fired. ByteCodeClass now emits
finalizerFunction = 0 unless a real finalize() exists in the hierarchy (the
__FINALIZER_<class> chain is still emitted, so subclass chaining is intact; both
readers -- freeAndFinalize and cn1BibopReclaimSlot -- already guard ptr != 0).
Behavior-preserving (conservative on unresolved bases) and it also drops millions
of no-op indirect finalizer calls from the existing full-walk path.
Result (arm64 macOS, idle, default-on): 63% of retired pages take the O(1) path;
objectAllocation 75.4 -> 46.5ms (1.62x; ~40% of the gap to warmed Java25 closed),
and on an isolated 20M-Node churn ~1.8x faster at equal-or-lower BOUNDED RSS
(~235MB) -- the pacing throttles far less now that the sweep keeps up. No
regression on compute/array benches.
Validated: full Bench bit-identical to HotSpot (FASTSWEEP on and off); GcStress
(85 runs across dev + here) and MtStress (40 runs, 4-thread alloc-during-GC) with
ZERO checksum divergence -- bit-identical is the oracle that the grace semantics
are preserved. (An intermittent ~4% GcStress segfault is a PRE-EXISTING
concurrent-GC race in the precise threadObjectStack scan -- present in the
pristine baseline at an equal-or-higher rate, an untouched code path -- to be
tracked separately.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nter coarsening) Profiling the (now sweep-unbound) objectAllocation churn showed the per-object inline path doing avoidable work. Two removals, both bit-identical: - Drop the __ownerThread store. It is write-only dead state in the current tree (the size-class-index repurposing was an unmerged free-list patch); a full-tree scan finds no reader. Removed from both the inlined cn1BibopFastAlloc and the slow-path cn1BibopInitSlot. (Field kept for struct-layout stability.) - Move allocationsSinceLastGC / totalAllocations off the per-object path. These feed only the isHighFrequencyGC heuristic (no correctness role) but were two GLOBAL stores per allocation -- an L1 store single-threaded, a bouncing cache line across threads. They are now bumped in bulk inside CN1_BIBOP_FLUSH_BYTES once per page-acquire (~64KB), which is accurate enough for a threshold heuristic. (Non-DEATOMIC build keeps the per-object update in ACCOUNT_BYTES.) Note recorded in-code: the body memset is NOT removable -- skipping it is ~2x SLOWER because uninitialized ref fields get scanned during the mark==-1 grace window and retain floating garbage. It is load-bearing, not overhead. Result: objectAllocation 46.2 -> 44.8ms (~3% single-threaded; larger under multi-threaded allocation where the global-counter cache line stops bouncing); now 2.29x warmed Java25. Validated bit-identical to HotSpot (full Bench), GcStress (no checksum divergence) and MtStress 15/15. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fields off-object)
Profiling objectAllocation showed the per-allocation cost is store-bound, and
the object is fat: a 6-field, 48-byte header vs HotSpot's ~16 / 2 fields, so a
Node{int,ref} occupied a 64-byte BiBOP slot -- 2x the bytes to allocate, zero,
and stream through cache on every object. The header writes themselves are NOT
removable (each is GC state; skipping any retains floating garbage and runs 2-3x
SLOWER -- measured). So shrink by RELOCATING fields off the object, not skipping:
- DELETE __ownerThread -- write-only dead state (the size-class-index repurposing
was an unmerged patch; no reader exists). 48 -> 40.
- __codenameOneThreadData (lazily-attached monitor, null on ~all objects) -> an
address-keyed monitor side table (cn1MonitorDataGet/Set/Remove, one mutex,
critical-section->table lock order). monitorEnter/Exit/wait/notify + reclaim/free
use it; the alloc fast path drops the =0 store. 40 -> 24.
- __codenameOneReferenceCount -> a force-visited side set: its only behavioral use
was the gcMarkObject force-recursion guard (==recursionKey), now
cn1ForceVisitedTestAndSet; the 999999 "permanent" writes were vestigial (mark-
sweep never reads them -- those objects stay live via root marking). The alloc
fast path drops the =1 store. 24 -> 16.
Header is now {clazz*, gcMark, heapPosition} = 16 bytes (HotSpot-class). Node drops
64->32 byte class (half), HashMap.Entry 80->48.
Validated (arm64 macOS), every phase bit-identical to HotSpot on the full Bench;
GcStress + MtStress (4-thread alloc-during-GC) with ZERO checksum divergence across
150+ stress runs (the ~4% empty-output segfault is the pre-existing threadObjectStack
-scan race, same rate on clean HEAD). Perf (idle, interleaved): objectAllocation
0.80x (3.4x->3.0x warmed Java25), hashMapChurn 0.84x, stringBuilding faster-or-flat,
compute/array flat (relocation costs nothing off the alloc path). RSS is neutral on
average with higher variance (a smaller-slot pacing artifact, tunable separately).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rdening MEMSET ELIMINATION (init-before-publish, no gate -- this is the pipeline): For every NEW X; DUP; <args>; INVOKESPECIAL X.<init> site whose ctor is inlinable (super()==Object, param/const stores only, no finalizer), the NEW is deferred to a null placeholder and the <init> allocates WITHOUT the body memset (cn1BibopFastAllocNoZero), stores every ctor-written field, explicitly zeroes the unwritten ones, and only then publishes the object. Ctor args are hoisted into C temps in ARGUMENT ORDER before the alloc, which also fixes two latent bugs in the committed inline-ctor path: a folded call-expression arg stored to two fields evaluated twice, and args evaluated in ctor-body store order instead of Java's left-to-right. objectAllocation 1.70x warmed Java 25 (was 5.7x at branch start); all 10 Bench checksums bit-identical to HotSpot. The elision is made sound against the conservative/signal-stop collector by deferring parentCls publication: the header keeps parentCls==0 until every field is written, so a signal-stopped thread's mid-construction object is skipped by gcMarkObject's existing guard (grace keeps it alive); the sweep's mark==-1 finalizer probe gets a matching NULL guard and finalizer-bearing classes keep the memset path. THREAD-STOP GC HARDENING (bugs found via GcStress under CN1_GC_SIGNAL_STOP=1 and an adversarial review of the branch's GC): * VALIDATED precise scan: a signal-stopped thread can freeze between a push's type/data stores (plain stores clang may also reorder), so a type==OBJECT slot can hold a stale primitive -- observed as gcMarkObject(0x4e20) from a frozen PUSH_INT window. threadObjectStack words are now resolved against the page/extent snapshot exactly like conservative roots. * Type-before-data ordering in the fused invoke-return emissions (the same torn-slot hazard at every call returning into a stale receiver slot). * Generation-counted signal handshake: a timed-out stop PRE-RELEASES its generation and releases are monotonic, so an abandoned or descheduled handler can never strand spinning forever. * gcParkCaptured is cleared for EVERY thread each cycle -- a native thread that parked once no longer satisfies useCoop with a stale SP forever (missed roots -> UAF). * GC safepoint in cn1BibopMaybeGc (BiBOP-only allocators never reached the legacy park) and the pacing spin now honors threadBlockedByGC on wake so the cap can't resume a mutator mid-drain. * Acquire ordering: conservative resolver's mark load (freelist-header reuse window), sweep's bumpIndex load (fresh-slot header visibility), and the snapshot builder reads bumpIndex before geometry (page-reformat TOCTOU). * bibopBytesLocal / nativeAllocationMode initialized in ThreadLocalData (malloc'd, never zeroed -- garbage corrupted GC pacing / disabled the alloc fast path per-thread). Validation: GcStress 25/25 cooperative + 25/25 forced-signal (was 20/25 and 14/15), MtStress 20/20 + 10/10 forced-signal, ctor-semantics torture test (eval order, double-store, throwing args, default zeros, wide args, GC churn in call-args) byte-identical to HotSpot, full Bench suite bit-identical, no perf regression on any benchmark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er-cycle root snapshot The global legacy-heap table was grown by DYING threads (markDeadThread -> collectThreadResources -> placeObjectInHeapCollection) while the GC thread walks it lock-free (sweep, root-snapshot build, overflow rescan). One growth concurrent with a sweep loses the sweep's slot-NULLs in the memcpy'd copy -- resurrecting freed pointers for the next cycle to dereference -- and two growths during one hoisted-pointer walk free the array under the reader (the old one-growth deferral could not cover that). Fix: make the table strictly GC-thread-owned. A dying thread now only QUEUES its ThreadLocalData (critical section already held by markDeadThread); the GC drains the queue at mark start -- strictly before any table walk or possible Thread-object finalization -- and performs the TLD free itself when the finalizer ran while the TLD was still queued (gcReleaseRequested). Objects in a queued TLD's pending list are invisible to the sweep, so the deferral can never free them early; un-snapshotted for at most one cycle, they are covered by the mark==-1 grace rule like every other post-snapshot allocation. With the single-writer invariant the growth can free the replaced array immediately, and getStack's one-shot immortal-string removal (the only non-GC-thread table access) takes the critical section. Also: build the conservative page/extent root snapshot ONCE PER MARK CYCLE (epoch-guarded) instead of once per scanned thread -- the full-table walk + qsort dominated the GC thread on array-heavy workloads (sampled: more time in qsort/cn1ConsExtCmp than in marking) and stalled mutators parked behind threadBlockedByGC. Post-snapshot allocations are mark==-1 fresh and survive via grace whether or not they resolve, so the first build of a cycle is complete for correctness. recursion 146->127ms; GC CPU burn on string/array churn cut sharply. Validation: new ThreadChurn stress (8 dying threads x 12 rounds x 3k pending arrays + >30000 live arrays forcing table growth under concurrent GC) 15/15 + 8/8 forced-signal, checksum identical to HotSpot; GcStress 20/20+15/15 coop, 10/10+8/8 forced-signal; MtStress 10/10; full Bench suite bit-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The core shows a parallel mark worker calling through a corrupt markFunction (gcMarkWorkerDrainLoop popped a worklist entry whose object header was destroyed between push and pop). Dump the drain loop's locals, the popped batch, and the mark state alongside the backtraces so the next occurrence identifies the victim object. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The poll settled after 2 visible ticks (~400ms), but ToastBar.show() runs slideUpAndWait(2)+slideDownAndWait(800) -- the component reports visible with full bounds while still animating into view, so tvOS captured a half-slid/absent toast (ButtonTheme was fine; only the toast frame raced). Require 1400ms of continuous visibility past the ~802ms animation before capturing; the 15s cap still bounds a broken toast. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ruption) Root cause of the Linux arm64 suite crash (random SIGSEGV in the theme phase; three CI cores at three different wild PCs -- gcMarkWorkerDrainLoop markFunction, cn1MakeFont, LinuxImplementation_exists -- classic heap-corruption signature; x64 leg never crashed). The allocator (cn1BibopInitSlot) writes parentClsReference/heapPosition and then RELEASE-stores the mark word LAST: the mark word is the object's single publication point. gcMarkObject's parallel-worker path loaded it RELAXED, so on arm64's weak memory model a worker could observe the object without observing the preceding parentClsReference store, then dereferenced a stale/garbage parentClsReference->markFunction. x86 hid it (every x86 load is acquire); it is branch-only (parallel marking, aa2838e, is not on master). Acquire-load the mark word before reading any other header field, pairing with the allocator's release store; reuse that snapshot as the claim's 'old'. Orders every parentClsReference read -- the guard, the CAS-success deref, and (through the worklist mutex's release/acquire) the drain worker's deref. Serial path unchanged. Gauntlet green on Apple-Silicon arm64 (same weak-memory model, parallel path active): all tortures byte-identical, GcStress/MtStress both stop modes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The acquire-load fix removed the parallel mark-WORKER crash (zero gcMarkWorkerDrainLoop frames in the next arm64 core), but arm64 Linux still corrupts the heap -- the crash moved to a frameless method reading a smashed threadStateData -- so a second ordering hole remains in the branch-only parallel-GC work. Force one marker (bypassing the whole parallel path: gcMarkDrainParallel -> serial gcMarkDrain, no atomics, no pool) as a git-A/B isolation step. Green arm64 => parallel marking is the sole remaining corruptor and the audit continues offline behind CN1_GC_MARK_THREADS>1; still-red => the bug is elsewhere in the branch GC changes. The acquire fix stays in for when parallel marking is re-enabled. Gauntlet green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… builds) A conservative scan reads every aligned word of a thread stack incl. inter-variable padding; under -fsanitize=address those reads hit ASan's poisoned stack redzones and raise guaranteed false positives that bury real findings. Exempt cn1ConservativeMarkRange (standard for conservative collectors). No effect on normal builds; makes ASan builds of the VM usable for hunting the intermittent Linux GC heap corruption. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erministic ToastBarTopPosition flaked on the Metal and tvOS backends (and intermittently mac-native): showMessage's slide (slideUpAndWait + slideDownAndWait) relies on the animation manager ticking to completion, which those offscreen pipelines don't guarantee -- the ToastBarComponent stayed stuck at height 0, so time-based capture snapshotted an absent toast and the screenshot 'differs'. No amount of waiting helps a slide that never completes, and reruns didn't clear metal/tv. Add ToastBar.setAnimated(boolean): when false the toast is shown/hidden instantly via a synchronous revalidate instead of the slide. The test disables animation, so the toast is deterministically laid out the moment showMessage returns on every backend. The final on-screen state (toast fully visible at TOP, no empty band above) is identical to the animated end state, so existing goldens still match; the test still validates the TOP-position layout it exists for. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The poll+settle rewrite and the instant-show (setAnimated) attempt both made ToastBarTopPosition WORSE, not better: instant-show left the ToastBarComponent at height 47 but visible=false (updateStatus/hidden override), so the toast was absent from the capture and the screenshot differed on metal/mac/tv. The original simple 2s-wait test is exactly what produced the committed goldens (which contain the toast), so restore it and drop the ToastBar.setAnimated experiment. The toast's cross-backend animation timing is a pre-existing, separate concern from this VM/GC PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Manually-triggered job that builds the x64 suite native ELF with -fsanitize=address on a real x86 GitHub runner (not QEMU) and runs it, to pinpoint the x64-only heap-UAF in cn1BibopFastAlloc (a stale/freed BiBOP page) that arm64 ASan cannot surface. Serial marking = the shipped, crashing config. Delete once fixed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… needs default branch)
Served its purpose: x64+serial+ASan ran clean twice (139/139), confirming the x64-serial residual is a conservative-scan root-miss that ASan structurally masks (its layout changes hide the missed-root free). ASan exhausted as a tool for this bug; the parallel mark/sweep UAF it DID pin is fixed by the serial default.
Gated (default-off) assertions that fire AT THE SOURCE of the intermittent x64 cn1BibopFastAlloc crash, in a normal build (ASan masks it -- its layout changes hide the bug -- so ASan can't catch it): - fast/nozero bump path: bibopCurrent[ci] page must be OWNED, match ci, and the bumped slot must lie inside the page (catches a retired/ recycled/reformatted current page). - sweep: a page reaching cn1BibopSweep must NOT be owned (catches a live current page swept out from under its thread). Compiles + runs clean under the gauntlet (no false fire). Off by default; enabled via -DCN1_BIBOP_VALIDATE for the CI x64 diagnostic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The x64 crash reproduced in a normal -DCN1_BIBOP_VALIDATE build lands in gcMarkDrain dereferencing a worklist obj's garbage parentCls (NOT a page invariant -- those didn't fire). Add a source-point check that reports the corrupt obj's parentCls / heapPosition / gcMark so the next run says whether it is a mid-construction (parentCls==0) object wrongly enqueued or a freed/reused slot marked live.
rev2's obj passed the parentCls!=0/heapPosition checks but fp() still jumped into a libc-range address -> parentCls is a non-null garbage pointer. Strengthen the drain-source check: also abort when the worklist obj's gcMark is neither currentGcMarkValue nor -1 (a freed/reused slot), and when its markFunction is >256MB from a known app-text anchor (&gcMarkObject) -- i.e. a garbage fp. Gauntlet clean (Gc+MtStress).
Root cause (pinned by the CN1_BIBOP_VALIDATE assertion firing on real x86): gcMarkObject marked a FREED BiBOP slot. A freed slot sits on its page free-list, whose intrusive next pointer overwrites __codenameOneParentClsReference at slot offset 0, so parentCls becomes a garbage (slot-interior) pointer. The conservative native-stack scan legitimately resolves an interior pointer from any stack word -- and on x86 a leftover stack word pointed at a freed slot (arm64's differing register/stack layout, and ASan's redzone layout, simply never produced that word -> the crash looked x64-only and was invisible under ASan). gcMarkObject's guard only rejected parentCls==0 / ==Class, so it stamped gcMark=current over FREE_MARK, pushed the slot, and the mark drain then called through the clobbered parentCls -> jump to a garbage address (the intermittent cn1BibopFastAlloc/gcMarkDrain SIGSEGV). Fix: gcMarkObject returns early when the acquire-loaded mark is CN1_BIBOP_FREE_MARK -- a free slot has no live fields to trace. FREE_MARK (-7) is never a live mark (live == currentGcMarkValue or -1 grace), so a real object is never skipped. Full gauntlet GREEN (all tortures byte-identical to HotSpot, GcStress/MtStress both stop modes) -> no live object dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rmed) Confirmed: with the freed-slot mark fix, 3x x64 suite runs on real x86 with the CN1_BIBOP_VALIDATE assertions active -> no assertion fired, no crash, all completed. The gated assertions (default-off) stay as a GC regression aid.
…stsweep link fix Two independent fixes surfaced while driving PR #5327 fully green. 1) Windows-on-ARM screenshot flake (root cause, not a rerun): The cn1ss WebSocket screenshot suite's generated main never enabled an offscreen target, so initDisplay built an HWND render target and captureWindowToPngBytes always returned null -- every screenshot fell back to super.screenshot(), which repaints each form into a fresh mutable image. On the slow windows-11-arm preview runner that per-screenshot repaint stalled the suite mid-run (42/100 PNGs). Add enableOffscreenCapture(): a hidden window is still created (identical pump/DPI/exact 784x561 client size) but windowGraphics is an offscreen WIC bitmap of that size, so captureWindowToPngBytes reads back the real frame via the proven WIC path -- no per-screenshot repaint. Same software WIC rasterizer as the mutable-image path, so pixels match the existing goldens. Richer capture-failure logging + upload of the native cn1windows.log on both legs so any residual miss is diagnosable (the arm64 native log was dropped). 2) BiBOP CN1_BIBOP_NO_FASTSWEEP link fix (review): cn1BibopNoteNativePeer is declared unconditionally and called unconditionally by toNSString(), but its definition sat inside #ifndef CN1_BIBOP_NO_FASTSWEEP, so that config failed to link. Add a #else no-op (correct: with fast sweep off every dead slot is full-walked to cn1BibopReclaimSlot anyway). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Linux screenshot suite intermittently crashes in serial gcMarkObject while marking a child (e.g. Component.BGPainter.this$0): the child is readable at the acquire-load but its header is reclaimed by the time we stamp its mark word (offset-0 clobbered -> parentCls->markFunction garbage). Only reproduces with the live WebKit/GTK/Gallium threads, so GcStress/MtStress can't hit it. Add CN1_BIBOP_VALIDATE forensics (temporary diagnostic build on the Linux CI): - gcMarkDrain records the object it is currently tracing (gcMarkCurrentDrainObj), i.e. the PARENT whose mark function is marking children. - gcMarkObject validates each child BEFORE the faulting mark-write and, on a corrupt/reclaimed child (bad heapPosition or a markFunction outside the app text), aborts with the corrupt child + its drain parent (class, markFn, heapPosition, gcMark). That distinguishes a conservative-scan root-miss (parent live, child wrongly freed) from a worklist/slot reuse (parent itself stale). - linux-build-run.yml compiles the native ELF with -DCN1_BIBOP_VALIDATE so the next reproduction aborts at the source with full context (remove once fixed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause (red-teamed + core-verified): the concurrent collector marks each thread's roots while that thread is paused, then RELEASES it before the others are scanned (cn1_globals.m:963), and never pauses native threads at all. With no write barrier, a released or native mutator that moves/nulls the last snapshot-time reference to a live object -- between its own scan and the end of mark -- makes that object unreachable to the collector, so it is swept while live; a surviving stale reference then faults the next cycle in gcMarkObject. Reproducible only with the live GTK/WebKit/Gallium threads that do such concurrent mutation, which is why GcStress/MtStress never hit it (this is the intermittent Linux mid-suite SIGSEGV). Fix: a Yuasa snapshot-at-the-beginning (deletion) write barrier. While a mark is in progress (gcSatbActive), every store overwriting an object reference in a heap location first hands the collector the OLD value (cn1SatbEnqueue), so a reference present at the start of the cycle is preserved no matter where a mutator moves it. codenameOneGCMark drains the log to a fixpoint (idempotent-mark detection via gcMarkNewObjectCount) and clears the flag before sweep. Deadlock-free (no thread holding) and covers native threads too -- exactly the mutators thread-pausing cannot stop. Barrier sites (every object-ref store; fresh-object inits skipped, old value null): - PUTFIELD/PUTSTATIC setters (ByteCodeClass), optimized AASTORE (cn1_set_array_element_object), and the previously-UNBARRIERED AASTORE fallback (BasicInstruction -- also restores its missing nursery barrier). - Bulk/replace natives: object System.arraycopy, HashMap put-replace/remove/clear. Cost: within noise. Off-mark the barrier is one predicted-not-taken flag load; the read+enqueue runs only during the infrequent mark. Same-machine A/B (-DCN1_DISABLE_SATB vs default, best-of-5): geomean 1.00x -> 1.01x, store-heavy shapes flat/non-monotonic. Gauntlet GREEN (8 tortures bit-identical to HotSpot; GcStress/MtStress x8 across cooperative + forced-signal modes). -DCN1_DISABLE_SATB compiles it out. CN1_BIBOP_VALIDATE stays on the Linux CI to catch any residual. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l site) The SATB barrier did not close the intermittent Linux crash: the CN1_BIBOP_VALIDATE dump shows a valid, marked drain-parent holding a DANGLING pointer to a freed object (child gcMark=0, garbage class) -- an object wrongly freed while still referenced, i.e. a conservative-scan root-miss, not the SATB deletion race. Extend the MARKOBJ CORRUPT CHILD dump with parentClass (the drain parent is validated live, so its clsName is safe and names WHICH object holds the lost reference) and markCallSite (__builtin_return_address(0) -> the return PC in the parent's generated mark function, which addr2line / the gdb bt maps to the exact field being marked). One diagnostic round then pinpoints the specific lost reference. Diagnostic-only, gated by CN1_BIBOP_VALIDATE (Linux CI); no effect on shipped builds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The corrupt-child crash surfaces two ways -- a mapped-but-garbage child (caught by the CN1_BIBOP_VALIDATE abort, which names parentClass) and a WILD/unmapped child (faults at gcMarkObject's acquire-load before the forensic check can run). To name the culprit regardless of path, have the gdb post-mortem read gcMarkCurrentDrainObj (the parent, recorded before its mark function runs) and its clsName straight from the core, plus the crashing thread's full bt (the mark-function frame resolves to the exact field being marked). Diagnostic-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The intermittent Linux crash is a mark-traversal completeness bug, not the SATB deletion race: the CN1_BIBOP_VALIDATE dump names a live, marked Component.BGPainter whose this$0 (the owning Component, reachable only through that back-reference) was swept -- a marked object whose mark function's children went untraversed. gcMarkDrain only triggers the BiBOP page rescan (which re-pushes marked-but-undrained slots) on a worklist OVERFLOW. If a marked object's subtree is ever left untraversed, its reachable children are freed while live and a surviving reference faults the next cycle. Add a belt pass at the end of codenameOneGCMark, before sweep: force one full rescan + drain to a fixpoint unconditionally, so EVERY marked object's mark function runs and all reachable children are marked. gcMarkDrain re-pushes each marked slot and loops until a pass marks nothing new -> O(reachable), idempotent, recovers any marked-but-untraversed subtree. Under CN1_BIBOP_VALIDATE it logs how many objects the belt recovered (proves whether the drain was incomplete). Gauntlet GREEN (8 tortures bit-identical; GcStress/MtStress x8 both stop modes). Perf impact measured separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… amplifier Mechanized: a Component reached only through Component.BGPainter.this$0 is freed while live, and the O(1) all-dead page reclaim pools its page WITHOUT stamping the slots FREE_MARK -- so the dangling this$0 evades the resolver/gcMarkObject FREE_MARK guard and faults, instead of being safely rejected. Disable the O(1) shortcut on the Linux diagnostic build: every dead slot is then full-walked and stamped FREE_MARK. If the crash stops, the O(1) reclaim is the amplifier and the fix is to make freed slots detectable (FREE_MARK / bumpIndex) to dangling refs; if it persists, the free itself (a conservative root-miss) is the whole story. Diagnostic-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…drain parent) The belt recovers up to 456 reachable-but-unmarked objects per cycle -- a large, systematic drain gap I can't pin by inspection. Log the class of each object the belt newly marks and its drain parent (throttled), so the pattern names itself. Revert the NO_FASTSWEEP A/B (that leg confirmed the O(1) reclaim is only an amplifier -- with FREE_MARK the BGPainter CORRUPT CHILD abort vanished but a different SIGSEGV plus the same 456-object incompleteness remained), and diagnose the real fastsweep config. Diagnostic-only, gated by CN1_BIBOP_VALIDATE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tators) The belt runs during the concurrent mark with mutators still allocating, so looping it to a "mark nothing new" fixpoint can never converge -- observed intermittently hanging/breaking FusedTest. Revert to the bounded single-pass belt (gauntlet-green). The belt diagnostic (BELT RECOVERED child<-parent, CN1_BIBOP_VALIDATE-only) stays; it showed the incompleteness is systematic container->content (Object[]/HashMap/ArrayList/ Property/String/GeneralPath, heaviest under VectorMap churn), i.e. concurrent mutation into already-drained black containers -- a Yuasa deletion barrier does not catch adds. The real fix is concurrent-mark completeness, not a bigger belt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…able crash) ROOT CAUSE (mechanized, forensic-confirmed): cn1BibopSweep promotes a fresh grace object (gcMark==-1) to live WITHOUT draining it (cn1_globals.m ~2090; the O(1) grace-page path pools the whole page the same way). So an OLD object reachable ONLY through a fresh, not-yet-linked object is left unmarked and swept. When a mutator later links that fresh object into the live graph, the next cycle drains it and marks the now dangling child -> the intermittent Linux gcMarkObject fault. The belt diagnostic named the pattern exactly: Double<-Property, Object[]<-HashMap/ArrayList/Vector, char[]<- String, float[]/byte[]<-GeneralPath -- fresh containers whose (old) contents were dropped. The O(1) all-dead reclaim amplifies it (pools the freed slot without FREE_MARK so the dangling ref evades the resolver guard instead of being rejected). FIX: before sweep, walk the page registry and drain every grace object (gcMark==-1, parentCls!=0), so a surviving grace object's subtree survives WITH it. Single bounded pass (no fixpoint loop -- mutators are active, so looping livelocks; observed breaking FusedTest). Gauntlet GREEN (8 tortures bit-identical; GcStress/MtStress x8 both stop modes). Perf measured separately. Validated on Linux CI next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The grace-drain pass cut the swept-while-reachable incompleteness 456 -> 1-4/cycle and removed the BGPainter CORRUPT CHILD path, but a residual remained: mutators create fresh grace objects (referencing OLD objects) in the window AFTER the grace pass runs, so an old object linked into such a fresh object is still dropped. Repurpose the (already emitted at every object-reference store) CN1_WRITE_BARRIER no-nursery branch as the SATB INSERTION half: during the mark, enqueue the NEW reference being stored, so an object linked into the live graph mid-mark is kept regardless of whether its container is yet reachable. Combined with CN1_SATB_DELETE (deletion half) this is a complete snapshot+incremental (Yuasa+Dijkstra) barrier; with the grace-subtree pass it guarantees no reachable object is swept. Off-mark cost is one predicted-not-taken flag load; -DCN1_DISABLE_SATB compiles it out. Gauntlet GREEN (8 tortures bit-identical; GcStress/MtStress x8 both stop modes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The SATB log was drained and gcSatbActive cleared BEFORE the grace pass and belt ran, so a mutator that linked an old object into a fresh grace object DURING those phases was not logged -> the residual arm64 slip. Move the SATB drain/clear/final-catch to AFTER grace + belt so the deletion+insertion barriers stay armed through the whole mark and that window is captured. Gauntlet GREEN. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ck-free) The concurrent drain returns before a true fixpoint to avoid livelocking active mutators, so a marked object's subtree can be left untraversed and swept while reachable -> the intermittent Linux crash. Definitive fix: before sweep, STW-pause every running lightweight thread at a safepoint, then loop the grace pass + belt + SATB drain to a TRUE fixpoint (with mutators frozen it converges -- no livelock, unlike the earlier looped belt that hung FusedTest), then resume ONLY the threads we paused (aggressive allocators stay held for sweep). Deadlock-free: a thread that can't reach a cooperative safepoint quickly is either blocked on a monitor held by an already-paused thread (not mutating) or in a rare safepoint-free loop; a bounded ~100ms wait then proceeds, so it never hangs. Native threads aren't paused -- the still-armed SATB barrier captures their stores. Gauntlet GREEN (8 tortures bit-identical; GcStress/MtStress x8 both stop modes, no deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch takes ParparVM from a ~1.5-36x deficit against warmed Java 25 (HotSpot C2) to geomean 1.00x parity across the ten-benchmark suite, with six benchmarks at or below HotSpot. Everything is measured on Apple M2, best-of-5 interleaved runs, ThinLTO release configuration, against azul-25 with full warmup; every optimization is gated on bit-identical checksums vs HotSpot plus the GC stress gauntlet.
intArithmetic/longArithmetic run at exact pure-C parity (verified against same-flags C controls); the residual is C2-vs-clang scheduling of the dependency chain, not VM overhead. recursion is HotSpot's speculative inlining, accepted.
What the emitted code looks like, before and after
1. Frameless codegen (recursion 4.6x -> 1.6x, feeds everything else)
Every Java method used to push a GC-visible frame of type-tagged slots and route every intermediate value through it:
Methods proven safe (no try/catch, no synchronization; object roots covered by the conservative native-stack scan) now compile to plain C:
2. Diverging array checks (quicksort 1.23x -> 0.92x)
The bounds-check helper used in fused comparisons returns a dummy after throwing, so its cold path rejoins the loop. That put a reachable call inside every loop cycle, and clang must assume a call clobbers memory:
In frameless methods the failure path now throws and returns from the method (the same pattern the stack-overflow guard uses), so no cycle of the loop contains a call and the header loads hoist:
Measured on the sort alone: 216ms -> 164ms, vs HotSpot's 197ms.
3. Compact HashMap: no entry objects (hashMapChurn 36x -> 0.95x, with the box cache)
LinkedHashMap keeps its ordering as two parallel
intlink arrays (prev/next slot indices) over the same storage. The hot five operations (get/put/remove/containsKey/clear) run as C natives probing the raw array data.4. Fused objects:
@Fused(String, StringBuilder, annotatable user classes)5. Allocation fast path + init-before-publish (objectAllocation 20x -> 1.19x)
Dead pages whose every slot is garbage are reclaimed O(1) (the page flips back to bump-from-zero) instead of per-slot sweeping.
6. Escape analysis: non-escaping StringBuilders live on the C stack
javac lowers
"item-" + i + '/' + ntonew StringBuilder().append(...)...toString(). A CFG walk proves the builder reference is only ever the receiver of StringBuilder calls (append returnsthis, so the alias is tracked through chains, re-stores into the same local, and the ternary-in-argument diamonds javac emits). Proven sites:GC safety falls out of the conservative native-stack scan: if the buffer grows onto the heap, the replacement pointer sits in scanned stack memory.
7. Devirtualization + call-site intrinsics
The same round removed the
enteringNativeAllocations()bracket (four flag stores on every native call) under conservative roots, where the native stack is scanned and the bracket protects nothing: string-building floor 27.1ms -> 20.4ms from that alone.GC
Non-moving BiBOP heap with concurrent mark/sweep; conservative native-stack root scanning (default-on) with generation-counted signal-stop; parallel marking; the snapshot's page-table sort is cached (the page registry is grow-only, so the sorted order only changes on registration).
Two real trigger bugs found and fixed (exposed by churn workloads, affect production):
allocationsSinceLastGCwas anintaccumulating bytes -- GB-per-cycle workloads wrapped it negative,isHighFrequencyGC()returned false, and the GC slept its 30s idle wait while dead pages ballooned into the GB range; andcn1BibopMaybeGcskipped its 24MB trigger entirely innativeAllocationMode, so workloads allocating only inside natives never collected.Correctness fixes found along the way (all real bugs)
this.restoreTo<label>is assigned at try-entry -- AFTER the setjmp -- and read in the catch handler AFTER a longjmp; C11 makes it indeterminate there. gcc register-allocates it, so the handler restoredthreadObjectStackOffsetfrom a rolled-back register and every callee frame after a caught exception was allocated ON TOP of the current frame's locals. Every clang build worked by luck (clang spills). Found via the musl CI job (the only gcc-compiled platform in CI) hanging deterministically; reproduced locally with gcc-16 (FusedTest segfault, bit-identical at -O0); fixed withvolatileon the two try-entry variables. This plausibly affected every gcc-built Codename One Linux app that ever caught an exception.Benchmark fix
Bench.stringBuildingpreviously built a string, read hash+length, and dropped it -- a shape where HotSpot's escape analysis scalar-replaces a String that real code would keep. Measured head-to-head: consume-and-drop 1.49x vs escaping 1.14x (pre-fix). The benchmark now parks each string in a ring buffer that outlives the iteration (batch-consumed, every string still hashed exactly once), so both VMs materialize every String -- measuring string building rather than EA-vs-no-EA.Benchmark suite (in this PR)
The complete performance + correctness suite is included under
vm/benchmarks/:The harness refuses to print ratios if any checksum differs from the host JVM — divergence is a VM bug by definition, never a perf trade. The README documents each workload and the torture coverage.
Binary size & memory
Same app (
Bench), same flags (-O3, ThinLTO), master vs this branch, Apple M2:The master peak-RSS blowup is the
allocationsSinceLastGCint-overflow bug this PR fixes (the GC slept its 30s idle wait while dead pages accumulated); with the fixed triggers, RSS under heavy churn is bounded below the reference JVM's. The +17 KB binary cost buys the intrinsics, the compact HashMap and the escape-analysis machinery.API surface
@Fusedis the one new public annotation (applied internally toString/StringBuilder; usable on developer classes with encapsulated primitive buffers). The developer guide's performance chapter now documents it — contract, example, and the automatic optimizations (stack-allocated string building, tagged integers, devirtualization, compact collections, BCE).@StackAllocatewas removed from the public API before merge: nothing applies it, and its contract (no instance ever escapes its creating frame) depends on every caller — something no reusable class can promise. The machinery remains as the engine behind the automatic, per-call-site-proven StringBuilder stack allocation.-DCN1_DISABLE_TAGGED_INT; auto-disabled on 32-bit pointers incl. Apple Watch). Writing the benchmark scripts exposed that the old opt-in flag was set by NO shipping config — deployed apps never had it (hashMapChurn 2.8x untagged vs 0.97x tagged).charAtintrinsic (and the pre-existing native + JS twin) now bound by the string's logicalcountrather than the backing array's capacity; regression case added to StrCmp.Validation
Every commit was gated on:
*Impltwin.Escape hatches for bisection:
-DCN1_DISABLE_SB_STACK_ALLOC,CN1_DISABLE_SCALAR_REPLACE,-Dcn1.frameless*,CN1_GC_SIGNAL_STOPenv.CI portability + JS-port hardening (follow-up commits)
The branch was developed and validated on macOS (Darwin exposes GNU/BSD APIs by default); CI flagged the gaps, fixed in two follow-up commits:
_GNU_SOURCEforpthread_getattr_np/REG_*ucontext indices (glibc+musl);-flto=thingated on Clang (gcc rejects the thin spelling)._WIN32(cooperative stop path only); the compat shim gainedpthread_once,pthread_detach,posix_memalign(_aligned_malloc-- the page arena never frees, so the pairing rule is moot),PTHREAD_COND_INITIALIZER, and a processor-count fallback withoutsysconf. Found via a full static POSIX audit rather than iterating on first-error-wins compiles.Integer.cn1Value/valueOf(int)natives got their runtime bindings; and the pure-Java*Impltwins thatbindNativedelegates call fromparparvm_runtime.jsare now retention roots in both the unused-method cull and the JS RTA -- no bytecode call site exists, so they were being eliminated and the delegation threwReferenceError(caught by the new core-slice completeness tests). All 233 JS-target tests pass locally.BytecodeInstructionIntegrationTestassertions were stale against deliberate emission changes (indy concat now stack-allocates its builder; frameless supersedes the fast-stack macro) -- modernized to accept every current form while guarding the same contract.🤖 Generated with Claude Code