[analytics-datafusion] Fixes for native memory on cancel, a query hang, spill progress, threshold updates, and spill size default#22187
Conversation
PR Reviewer Guide 🔍(Review updated until commit dd2383d)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to dd2383d Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit dd2383d
Suggestions up to commit dd2383d
Suggestions up to commit 302c6c6
Suggestions up to commit 0dd4959
Suggestions up to commit 35443e9
|
|
Persistent review updated to latest commit 33323b5 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #22187 +/- ##
============================================
+ Coverage 73.32% 73.36% +0.03%
- Complexity 75934 76000 +66
============================================
Files 6075 6075
Lines 345282 345282
Branches 49697 49697
============================================
+ Hits 253177 253315 +138
+ Misses 71786 71758 -28
+ Partials 20319 20209 -110 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
33323b5 to
d61cf19
Compare
|
Persistent review updated to latest commit d61cf19 |
6e4b0f8 to
7d6e1bd
Compare
|
Persistent review updated to latest commit 7d6e1bd |
| // `DynamicLimitPool::try_grow`). Limits how much memory spillable consumers can | ||
| // be allowed through the 85% check at the same time, so several spills together | ||
| // stay below the 95% limit. Default 512MB. | ||
| static SPILL_EXEMPT_CAP_BYTES: AtomicU64 = AtomicU64::new(512 * 1024 * 1024); |
There was a problem hiding this comment.
Will make it around 10% in next revision.
|
Persistent review updated to latest commit 1d88317 |
|
❌ Gradle check result for 1d88317: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Bukhtawar
left a comment
There was a problem hiding this comment.
Can we write a test to demonstrate spill kicks in?
| /** Fraction of the spill volume's total capacity used as the default cap. */ | ||
| static final double SPILL_LIMIT_FRACTION = 0.80; | ||
|
|
||
| /** Fallback when the spill volume's capacity cannot be probed. 8 GiB. */ | ||
| static final long SPILL_LIMIT_FALLBACK_BYTES = 8L * 1024 * 1024 * 1024; | ||
|
|
There was a problem hiding this comment.
Will this lead to wasted disk space? Can we reduce the 8GB buffer based on some benchmarks
There was a problem hiding this comment.
This buffer is a fallback and based on minimum instance ram size (50% of 16GB). Ideally the SPILL_LIMIT_FRACTION will be used which is 80% of the spill disk size. Above this threshold the rejection will be triggered for spill.
|
Persistent review updated to latest commit 6cba71d |
|
❌ Gradle check result for 6cba71d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 60a286a |
|
Persistent review updated to latest commit 5eef64a |
5eef64a to
69dc974
Compare
|
Persistent review updated to latest commit 69dc974 |
69dc974 to
6271a7e
Compare
…the DataFusion engine Squash of the datafusion-try-grow-fix work: - Clamp data-node fragment gate to CPU worker count. - Fix permanent CrossRtStream wedge by spawning the driver off the consumer. - Exempt self-liquidating spill reservations from the 85% RSS gate (bounded by SPILL_EXEMPT_CAP_BYTES). - Fix dynamic memory-guard threshold updates silently using stale values (single grouped settings-update consumer). - Default spill cap to 80% of spill volume capacity; validate operator overrides. - Make spill_exempt_cap_bytes a dynamic cluster setting (datafusion.memory_guard.spill_exempt_cap_bytes, raw bytes, default 512MB). - Add Rust unit tests for the spill_exempt_cap setter and FFI negative-clamp. - Add an end-to-end disk-spill test (GROUP BY under a small pool spills and returns correct results, asserted via DataFusion SpillCount/SpilledBytes metrics). The cross_rt_stream.rs and runtime_manager.rs changes were reverted; this squash reflects the net final state with those files unchanged from origin/main. Signed-off-by: snghsvn <snghsvn@amazon.com>
6271a7e to
35443e9
Compare
|
Persistent review updated to latest commit 6271a7e |
|
Persistent review updated to latest commit 35443e9 |
Have added a integ test for spill test. |
|
Persistent review updated to latest commit 0dd4959 |
…NGS count Adding datafusion.memory_guard.spill_exempt_cap_bytes to ALL_SETTINGS raised the registered-setting count from 28 to 29, but testAllSettingsContainsAllExpectedSettings still asserted 28. Update the count and assert the new setting is registered. Signed-off-by: snghsvn <snghsvn@amazon.com>
|
Persistent review updated to latest commit 302c6c6 |
|
❕ Gradle check result for 302c6c6: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
|
Persistent review updated to latest commit dd2383d |
|
❌ Gradle check result for dd2383d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit dd2383d |
|
Persistent review updated to latest commit dd2383d |
Description
Five fixes in the DataFusion analytics backend. They all affect native (Rust-side)
memory on data nodes when queries run concurrently, get cancelled, or spill to disk.
1. Let a spilling query allocate the buffer it needs to finish spilling (
memory.rs,memory_guard.rs)Background — how spilling is driven here. The memory pool watches the node's actual
resident memory (RSS, read from jemalloc) rather than only its own bookkeeping, because some
operators allocate native memory before they reserve it in the pool, so the pool's own counter
can understate reality. There are two RSS lines:
is not an error to the operator: spillable operators (sort, grouped aggregate) treat a
rejected request as the signal "out of room — spill to disk now." So rejecting is how the
pool tells a query to spill.
node from running out of memory.
The problem. Spilling is not free — to write its in-memory data out, an operator first frees
that data and then needs a small amount of memory to run the sort/merge and write the spill
file. The catch is timing: right after the operator frees its data, the freed pages have not
been returned to the OS yet (the allocator still holds them), so RSS is still above 85% at the
exact moment the spill needs its working memory. The old code rejected that request too — so
the spill it had just ordered could never finish, and the query got stuck with the
"Failed to reserve memory for sort during spill" error.
The fix. In the band between 85% and 95%, the pool now distinguishes two cases:
letting it grow would only add pressure).
middle of doing exactly what we want — spilling — and this request is part of finishing it.
To keep this safe, the allowance is bounded by a fixed byte budget (
SPILL_EXEMPT_CAP_BYTES,default 512 MB), not by a second RSS line. This matters: an RSS-based limit would not work,
because RSS lags behind the frees (the same staleness that caused the original bug). The byte
budget is logical accounting that updates the instant memory is freed, so it stays accurate
during a spill. A running total (
exempt_outstanding) tracks how much has been allowed throughbut not yet freed; once it reaches the budget, further spillable requests are rejected again
(which just makes them spill more aggressively). The budget is only charged after an
allocation actually succeeds, so a request that is allowed through but then fails for another
reason does not permanently consume budget; it is returned as memory is freed.
The 95% critical line is checked first and ignores this allowance entirely, so even with the
full budget granted to concurrent spills, the node can never be pushed past the hard limit.
Flow — one spillable query as RSS climbs:
exempt_outstanding.shrink, which lowersexempt_outstandingimmediately — so the budget reopens even though RSS has not dropped yet.4. Make runtime updates to the memory-guard thresholds actually take effect (
DataFusionPlugin.java)The four memory-guard thresholds were each registered with a callback that re-read the
settings from the cluster service. During a settings update, that read returns the previous
value (the new values aren't visible to that read until after all callbacks have run), so
changing a threshold at runtime silently kept using the old value. The fix uses a single
callback that reads the four values from the updated settings it is handed, so the new values
take effect.
5. Default the spill size limit to 80% of the spill disk, and reject invalid overrides (
api.rs,DataFusionPlugin.java, tests)datafusion.spill_memory_limit_bytesnow defaults to 80% of the total capacity of the diskthe spill directory is on (with an 8 GiB fallback if that capacity can't be read). An explicit
value set by an operator is rejected if it is larger than the disk's capacity.
Check List
DataFusionPluginSettingsTests).--signoff.