Skip to content

Reapply "Column paged merge batcher"#36842

Merged
DAlperin merged 8 commits into
MaterializeInc:mainfrom
DAlperin:dov/column-paged-merge-batcher
Jun 2, 2026
Merged

Reapply "Column paged merge batcher"#36842
DAlperin merged 8 commits into
MaterializeInc:mainfrom
DAlperin:dov/column-paged-merge-batcher

Conversation

@DAlperin
Copy link
Copy Markdown
Member

@DAlperin DAlperin commented Jun 1, 2026

Reapply note

Reapplies #36627, which was reverted in #36839 after feature-benchmark regressions tracked in CLU-100 (Insert 22% slower, Update 17%, CreateIndex 44%, DifferentialJoinHydrationBaseline 55%, SkewedJoin 72%, and others — 10 scenarios past the 10% threshold).

Root cause was inside the forked merge-batcher framework, not in the merge inner loop itself. The new ColumnMergeBatcher shipped without the recycled-chunk stash that the upstream differential_dataflow::MergeBatcher carries: every shipped result and exhausted head dropped its leaf allocations, and the next merge round restarted from zero-capacity Vecs. The per-leaf geometric regrowth tax (~2× the chunk's bytes in memcpy traffic per chunk) landed on every record that flowed through the two arrange call sites, multiplied by log(N) merge rounds per record — matching the regression shape exactly (ingestion / hydration / maintained joins hit hard, arrange-light scenarios like FinishOrderByLimit hit less hard).

The reapply has three changes on top of the original PR:

  1. Per-batcher stash in ColumnMergeBatcher: a small (STASH_CAP = 2, with a MAX_RECYCLE_BYTES per-buffer guard) free-list of cleared Column::Typed buffers, threaded through merge_chains / extract_chain / drain_side. Drained heads, finished result buffers, and consumed source buffers feed in; the next allocation pulls out instead of starting at default.
  2. Whole-chunk passthrough in merge_chains: heads arrive materialized via FetchIter, so peeking endpoints is free. When a head sorts entirely before the other side's current record, ship it wholesale and skip the per-record merge. Same shape as the legacy ColumnMerger::merge fast path, gated on positions[i] == 0.
  3. Feature flag (enable_column_paged_batcher, default false) over the type swap at the two arrange sites. With the flag off, arranges fall back to the legacy Col2ValBatcher / RowRowBuilder (columnation-merger) path that shipped before Column paged merge batcher #36627. With it on, they use the columnar-native Col2ValPagedBatcher / RowRowColPagedBuilder. Eviction is gated by a separate enable_column_paged_batcher_spill (default false).

Build #16668 confirmed the rebased combination of (1) + (2) closes every CLU-100 wallclock regression and stays inside memory thresholds. Build #16672 will be the first run with the feature flag wired in; with the default false it should match pre-#36627 main exactly.

Motivation

The merge-batcher's transient state (chains of sealed input chunks awaiting merge / extract) sits between input ingestion and the spine. Under memory pressure that transient peak is what trips OOMs, not the spine itself — the spine has already consolidated.

This PR plugs ColumnPager (#36552) into the merge-batcher so those chain entries can live on disk (or compressed in RAM) instead of resident memory, while leaving the hot-path merge / extract logic and on-wire chunk shape unchanged.

Description

A new ColumnMergeBatcher in mz-timely-util forks the DD merge-batcher framework so chain entries are PagedColumn (Resident / Paged / Compressed) rather than resident Columns. Streaming drivers merge_chains / extract_chain walk those chains via a FetchIter that materializes one head per side on demand and hands finished output back to the pager. The per-chunk merge / extract logic itself is the same Column::merge_from / Column::extract already used by the in-memory ColumnMerger. A per-batcher stash recycles drained head buffers and shipped result buffers across calls so the inner loop reuses leaf-Vec capacity instead of regrowing from zero each round. Whole-chunk passthrough on positions[i] == 0 skips the per-record merge for disjoint-key heads.

Compute integration:

  • New dyncfgs in compute-types:
    • enable_column_paged_batcher (default false): kill switch over the batcher type swap. When off, arrange call sites use Col2ValBatcher / RowRowBuilder (legacy columnation path). When on, they use Col2ValPagedBatcher / RowRowColPagedBuilder (new columnar-native path that the pager can spill).
    • enable_column_paged_batcher_spill (default false): controls whether the pager actually evicts under budget pressure. Only meaningful when the path flag is on.
    • column_paged_batcher_budget_fraction (default 5% of replica memory; floored at 128 MiB). Pager backend follows --scratch-directory availability the same way mz_ore::pager already does (file when a scratch directory is configured, swap otherwise).
  • apply_worker_config derives (enabled, total, backend) from those dyncfgs and routes through column_pager::apply_tiered_config, which holds a process-wide TieredPolicy singleton and mutates its atomic budget / backend / codec in place. In-flight ResidentTickets keep crediting the same atomic across reconfigures, so operator-driven tunes (or other workers in the same process reapplying the same config) don't orphan accounting onto a stale policy.
  • Two arrange call sites in render/context.rs (arrange_collection) and render/join/linear_join.rs (JoinStage arrange) branch on the path flag at construction time. Both arms return Arranged<S, TraceAgent<RowRowSpine<_, _>>> so the runtime if/else type-checks cleanly; monomorphization keeps both versions of the inner arrange body in the binary.
  • New Col2ValPagedBatcher type alias and a BuilderInput for Column<((K, V), T, R)> impl so the DD OrdValBuilder can consume the batcher's Column output without a container conversion. RowRowColPagedBuilder + a couple of small PushInto / PartialEq impls in row_spine.rs get the Row-keyed paths type-checking.
  • Materialized and Clusterd mzcompose services gain memory_swap and mem_swappiness so feature-benchmarks can configure container swap behavior independently of the batcher.

Observability: column_pager::metrics registers a PagerMetrics struct with the process metrics registry. Counters cover skip / pageout / pagein decisions and bytes through each path; computed gauges expose mz_column_pager_budget_remaining_bytes and mz_column_pager_budget_configured_bytes against the live TieredPolicy atomics. Compute init wires the registration; the timely-util module is registry-agnostic, so callers without a registry (tests, benches, examples) just see no-op observers.

Verification

  • Unit tests (mz-timely-util, columnar + column_pager modules): chunker correctness, per-chunk merge / extract, drain, ColumnMergeBatcher end-to-end seal under ColumnPager::disabled() and a ForcePagePolicy that forces every chunk through the pager. The recycling stash is exercised implicitly by every end-to-end test; TieredPolicy::reconfigure tests cover in-flight ticket preservation across pool resizes, shrink-saturates-at-zero, and live backend/codec swap. Proptests cover merge / extract invariants.
  • Criterion microbench (benches/columnar_merge_batcher.rs): compares legacy ColumnMerger against the paged path with disabled / swap / lz4 across mixed / collision / disjoint inputs and four cache-tier sizes; prints a throughput summary table.
  • End-to-end example (examples/column_paged_spill.rs): drives arrange_core over a cancellation workload (positives + negatives at the same time so the spine stays empty and all pressure lives in the batcher). Back-to-back baseline + spill modes with an optional RSS sampler thread.
  • Feature-benchmark scenarios:
    • DifferentialJoinColumnPaged measures steady-state overhead vs. DifferentialJoin with the path flag on.
    • DifferentialJoinHydrationBaseline / DifferentialJoinHydrationFile measure re-hydration time after REPLICATION FACTOR 0 → 1 toggling. File variant sets both enable_column_paged_batcher and enable_column_paged_batcher_spill to exercise actual eviction. Gated on MzVersion > 26.28.0 via can_run; dev versions skip on both sides until 26.28.0 ships.
  • Nightly feature-benchmark (build #16668) confirmed the rebased recycling + passthrough combination closes every CLU-100 wallclock regression with the feature flag flipped on; with the flag off the path is the legacy columnation behavior.

Risk

The feature flag defaults off. With enable_column_paged_batcher = false, arrange operators run on the same Col2ValBatcher / RowRowBuilder path that shipped before #36627. The new columnar-native batcher infrastructure is still compiled in, but no production dataflow uses it until the flag is flipped. Reverting in production is a flag flip, not a code revert.

The flag is read at operator construction time. Flipping the dyncfg only affects dataflows created after the change; existing dataflows continue on whichever path they were built with until they're rebuilt. This matches existing dyncfg semantics in compute (most flags don't switch live operators).

Whole-chunk passthrough is implemented for the resident case only. Heads arrive resident from FetchIter (Skip path) or freshly decoded from disk (Paged/Compressed path), so peeking endpoints is free in both cases. The fast path is gated on positions[i] == 0 so only chains that haven't been partially consumed can short-circuit, matching the legacy ColumnMerger::merge shape.

Recycling stash is not visible to the pager budget. The 2-entry per-batcher stash holds cleared-but-capacity-preserving Column::Typed allocations that don't carry ResidentTickets. The pager budget sees fewer bytes than RSS by up to ~4 MiB per batcher (MAX_RECYCLE_BYTES). This matches the behavior of the upstream DD MergeBatcher stash that the legacy path runs on (also untracked), so this is not a regression in accounting fidelity. If tighter accounting is wanted later, the follow-up is to give stashed chunks lightweight tickets so the policy can choose to evict them under pressure.

Logging arranges still use the legacy ColumnMerger path. Bytes paged to swap / file don't enter the mz_arrangement_batcher_*_raw RSS-shaped accounting tables, matching how those tables already treat shipped chunks.

@DAlperin DAlperin requested review from a team as code owners June 1, 2026 06:39
@DAlperin DAlperin force-pushed the dov/column-paged-merge-batcher branch from 2b64b12 to 35e4c9d Compare June 1, 2026 06:43
DAlperin added 5 commits June 1, 2026 10:29
Adds a Materialize-private merge-batcher that routes per-chunk transient state through `ColumnPager`, bounding the resident-bytes peak under memory pressure. Behind `enable_column_paged_batcher` (default off).

Three building blocks in `mz-timely-util`: * `ColumnMergeBatcher` + `merge_chains` + `extract_chain` in `columnar/merge_batcher.rs` — chains hold `PagedColumn` entries that resolve to disk on demand. Reuses the existing `Column::merge_from` / `Column::extract` building blocks. * `BuilderInput for Column<((K, V), T, R)>` so DD `OrdValBuilder` can consume the batcher's output without a container conversion. * `column_pager` gains a process-global pager singleton (matching the lower-level pager's global-atomic design) and a per-decision skip/page counter for diagnostics.

Compute integration: * `RowRowColPagedBuilder` alias + `PartialEq<&RowRef> for DatumSeq` / `PushInto<&RowRef> for DatumContainer` so the Row-keyed arrange path type-checks. * Worker init in `apply_worker_config` reads three new dyncfgs and installs the process-global pager: `enable_column_paged_batcher` (on/off), `column_paged_batcher_backend` (`swap` | `file`), `column_paged_batcher_budget_fraction` (fraction of replica memory, default 5%). Per-worker / shared pool sizes derive from `memory_limiter::get_memory_limit` with sensible floors and caps. * Two arrange call sites switched to the paged path: `render/context.rs::arrange_collection` (central ArrangeBy) and `render/join/linear_join.rs::JoinStage`. Other arrange sites (logging) left on the legacy `ColInternalMerger` path.

Also extends `Materialized` and `Clusterd` mzcompose services to accept `memory_swap` and `mem_swappiness`, so callers can configure container-level swap behavior independent of the batcher.
Adds three pieces of validation tooling for the column-paged merge batcher: a criterion microbench, an end-to-end timely example, and feature-benchmark scenarios.

Criterion bench (`src/timely-util/benches/columnar_merge_batcher.rs`): compares the legacy `ColumnMerger` against the new path with disabled / swap / lz4 pagers across three input regimes (mixed, collisions, disjoint) and four cache-tier sizes. Prints a throughput summary table when the group finishes. Good for per-chunk-merge perf comparisons; doesn't exercise the dataflow operator graph.

End-to-end example (`src/timely-util/examples/column_paged_spill.rs`): drives `arrange_core` over a cancellation workload (positives + negatives at the same time, so the spine stays empty and all pressure lives in the merge-batcher). Configurable workers / records / budget; back-to-back baseline + spill modes; optional RSS sampler thread via `ps`. Modeled on `differential-dataflow/examples/columnar_spill.rs` but uses our `Col2ValPagedBatcher` + `ColumnPager` + `TieredPolicy` directly instead of DD's `SpillBatcher`/`Threshold`/`FileSpill` plumbing. `cargo run --release --example column_paged_spill` for a smoke test; see `--help` for sweep options.

Feature-benchmark scenarios (`misc/python/.../scenarios/benchmark_main.py`): * `DifferentialJoinColumnPaged` — same query shape as `DifferentialJoin`, paged batcher enabled. Measures steady-state overhead vs the legacy path. * `DifferentialJoinHydrationBaseline` / `DifferentialJoinHydrationFile` — sister leaves of a non-runnable `DifferentialJoinHydration` parent. Each measures the time to re-hydrate a linear-join arrangement after `REPLICATION FACTOR 0 -> 1` toggling. Baseline has the paged batcher off; File enables it with the file backend and `budget_fraction = 0.01` so chunks spill rather than competing with the spine for RAM. Compare under `--this-memory` + `--this-memory-swap` to evaluate user-space spill vs OS swap.

Feature-benchmark CLI plumbing (`test/feature-benchmark/mzcompose.py`): adds `--this-memory`, `--this-memory-swap`, `--this-mem-swappiness` (and `--other-*` companions) so memory caps and swap behavior are configurable per side, plus `--skip-other` for iterating on `this` without the comparison round trip. The benchmark-result evaluator tolerates the single-side case by returning `None` ratios instead of indexing past the end of `_points`.
@DAlperin DAlperin force-pushed the dov/column-paged-merge-batcher branch from 35e4c9d to 8246be7 Compare June 1, 2026 14:37
…re harness

Three changes on top of the recycling fix to close the remaining
feature-benchmark regressions and let skipped scenarios coexist with
retry filtering:

* Whole-chunk passthrough in merge_chains. Heads arrive resident via
  FetchIter, so endpoint peeks are free. When a head sorts entirely
  before the other side's current record, ship it wholesale and skip
  the per-record merge. Same shape as the legacy ColumnMerger fast
  path; gated on positions[i] == 0.
* STASH_CAP lowered from 16 to 2 + MAX_RECYCLE_BYTES guard
  (1 << 22, ~4 MiB). The stash isn't a hoard — it's a hot-buffer
  cache for the result/keep/ship churn. Passthrough keeps most chunks
  off the merge inner loop, so 2 buffers covers the steady-state
  ship/refill ping-pong without inflating per-batcher resident
  overhead (invisible to the pager budget).
* Gate DifferentialJoin{ColumnPaged,Hydration*} on MzVersion > 26.28.0
  via can_run. Dev versions don't distinguish the dyncfg's presence,
  so the scenarios skip on both sides during 0.x development and
  start running once 26.28.0 ships.
* feature-benchmark/mzcompose: filter the rerun list by
  has_scenario_result before has_scenario_regression. Skipped
  scenarios (via can_run) leave no entry in the report; the previous
  filter raised KeyError instead of just excluding them from reruns.
@DAlperin DAlperin force-pushed the dov/column-paged-merge-batcher branch 3 times, most recently from 3d9e4c1 to 2770d66 Compare June 2, 2026 07:15
Adds a kill switch over the type swap to Col2ValPagedBatcher /
RowRowColPagedBuilder. The new dyncfgs:

* enable_column_paged_batcher (default false): when true, arrange
  call sites use the columnar-native paged batcher / builder. When
  false, they fall back to the legacy columnation Col2ValBatcher /
  RowRowBuilder path that shipped before MaterializeInc#36627.
* enable_column_paged_batcher_spill (default false): renamed from
  the previous enable_column_paged_batcher (which controlled
  eviction). With the path flag off it has no effect; with the path
  flag on it controls whether the pager actually evicts under
  budget pressure.

Both flags default off; arranges run on the legacy columnation path
unless someone opts in. DifferentialJoinHydrationFile scenario opts
both on (path + spill) to exercise the spill path.

Read at operator construction time, so a flip takes effect on
dataflows created after the change; existing dataflows continue on
whichever path they were built with. The runtime if/else at each
arrange site monomorphizes both branches, but they return the same
Arranged<S, TraceAgent<RowRowSpine<_,_>>> so the type system is
happy and binary bloat is bounded.

Touch sites:
* compute-types/dyncfgs.rs: define both flags, register them.
* compute/src/compute_state.rs: spill toggle reads the renamed flag.
* compute/src/render/context.rs: thread use_paged_path through
  arrange_collection and branch the mz_arrange_core call.
* compute/src/render/join/linear_join.rs: same branch at the
  JoinStage arrange.
* feature_benchmark scenarios: HydrationFile now sets both flags.
* parallel_workload + mzcompose lint allowlists: add the new flag.
@DAlperin DAlperin force-pushed the dov/column-paged-merge-batcher branch from 2770d66 to 5def314 Compare June 2, 2026 07:20
Copy link
Copy Markdown
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

Comment on lines +101 to +108
/// Recycled empty `Column::Typed` chunks. Drained heads and shipped result
/// buffers feed in here; subsequent merge / extract calls pop from here
/// instead of starting from a zero-capacity `Column::default()`. Mirrors
/// the stash carried by the upstream `differential_dataflow` merge-batcher
/// framework, which this type forks. Without it, each shipped chunk
/// triggers a fresh per-leaf grow cycle and per-merge-round allocation
/// dominates the inner loop.
stash: Vec<Column<(D, T, R)>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clear the stash in seal because it's unlikely that we'll immediately need the stash again, and it's a good opportunity to release some memory for whatever comes next.

@DAlperin DAlperin merged commit bfa6499 into MaterializeInc:main Jun 2, 2026
120 of 122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants