Skip to content

Gate block-manager historical storage behind PROMPP feature flag#389

Merged
vporoshok merged 15 commits into
ppfrom
fix/block-compactor-observability
Jul 2, 2026
Merged

Gate block-manager historical storage behind PROMPP feature flag#389
vporoshok merged 15 commits into
ppfrom
fix/block-compactor-observability

Conversation

@vporoshok

Copy link
Copy Markdown
Collaborator

Summary

  • add a staged rollout switch for historical block backend selection in server mode while keeping PP head + adapter as the write path in both modes
  • default to the pre-PR-377 behavior (historical reads via TSDB) and enable the new block-manager backend only with PROMPP_FEATURES=enable_block_manager
  • keep the additional compaction/blocks observability from earlier commits to make stage diagnostics explicit during rollout

Why

A stage rollout showed a sharp CPU and mmap spike after switching storage behavior, so we need a safer migration path and better visibility before enabling the new scheme by default.

Test plan

  • devcontainer exec --workspace-folder . --config .devcontainer/arm/devcontainer.json go test -tags stringlabels ./pp/go/storage/block/...
  • devcontainer exec --workspace-folder . --config .devcontainer/arm/devcontainer.json go test -tags stringlabels ./cmd/prometheus/...

Made with Cursor

Base automatically changed from index_writer_benchmark to pp June 26, 2026 10:42
@vporoshok vporoshok requested review from u-veles-a and vslpsl June 26, 2026 10:53
@vporoshok vporoshok self-assigned this Jun 26, 2026
vporoshok and others added 11 commits June 26, 2026 14:58
Add compaction plan/result logs, restore missing block-manager TSDB gauges, and cap compaction ranges by max block duration so 2h block setups follow configured bounds.

Co-authored-by: Cursor <cursoragent@cursor.com>
Default to the pre-PR-377 historical TSDB path and enable block-manager only with PROMPP_FEATURES=enable_block_manager, keeping PP head+adapter as the write path in both modes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose prometheus_tsdb_blocks_loaded_by_size to track loaded block size buckets after reload and help diagnose startup spikes caused by unexpected compaction output.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the size-based gauge with prometheus_tsdb_blocks_loaded_by_duration so legacy TSDB exposes the same block-duration view as block-manager during startup diagnostics.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add Grafana panels for loaded block layout and normalize loaded-block duration buckets to 5 minutes in both block-manager and legacy TSDB so duration heatmaps are stable and easy to read.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose loaded-block duration labels in minutes with 1-minute rounding for both block-manager and legacy TSDB paths, and update Grafana panels to query duration_minutes for clearer heatmap grouping.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror legacy tsdb "Found healthy block" output so operators can see the
on-disk block layout when the block manager starts, including each block's
normalized duration in minutes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Run reload (with deletion) and a single compaction pass sequentially in one
goroutine, mirroring tsdb's compact/reload loop. This removes the race where
the compactor's independent loop could plan/compact blocks that the manager's
reload was concurrently deleting (open meta.json: no such file or directory)
and the resulting repeated re-compaction of overlapping blocks (CPU/mmap churn).

The compactor no longer runs its own goroutine, ticker or shared mutex: it
exposes a one-shot Compact() called by Manager after each reload. Also render
the compaction plan as a string in logs (go-kit cannot encode []string).

Co-authored-by: Cursor <cursoragent@cursor.com>
When the C++ scrape parser rejects a buffer with invalid UTF-8 (in a HELP text
or a label value), print the containing line, the buffer size and the line
start offset to stdout. The Go-side buffer is mutated in place during parsing,
so the offending bytes can only be inspected reliably here.

Co-authored-by: Cursor <cursoragent@cursor.com>
After a successful compaction, immediately reload and compact again instead of
waiting for the next ticker interval, so multiple pending compactions converge
in one tick (mirroring tsdb's compact/reload loop). Compact now reports whether
it did any work to drive the loop.

Co-authored-by: Cursor <cursoragent@cursor.com>
Revert the mixin panels and example dashboard JSON added for block duration
observability; keep only the code-level metrics and logging.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vporoshok vporoshok force-pushed the fix/block-compactor-observability branch from e93b6cd to f6bbeb4 Compare June 26, 2026 11:05
Comment thread pp/go/storage/block/manager.go
Mirror tsdb: if the reload that loads a newly compacted block fails, remove
that block so a half-applied compaction does not leave orphaned blocks on
disk. Compact now returns the created block ULIDs and the manager removes
them on reload failure.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread cmd/prometheus/main.go Outdated
Comment thread pp/wal/hashdex/scraper/scraper.h Outdated
vporoshok and others added 3 commits June 29, 2026 17:17
Open the historical backend (tsdb.DB or block.Manager) eagerly during setup
for both schemes. The historical path does no WAL replay (the PP head +
adapter is the only write path), so the open is as cheap as the block
manager's initial reload and there is no need to defer it into the run group.

Remove the now-unnecessary dbOpen channel and the startup gating on it in the
config loader and head manager; the server storage actor only waits for
shutdown and closes the fanout. Simplify tsdbHistoricalStorage now that its db
is set at construction.

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the scraper invalid-UTF-8 diagnostic logging (log_invalid_utf8 helper
and its call sites) that was added only for stage diagnostics. Update the
storage-scheme comment to reference the PROMPP_FEATURES=enable_block_manager
feature flag instead of the removed CLI flag.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vporoshok vporoshok merged commit 2af58b5 into pp Jul 2, 2026
30 of 33 checks passed
@vporoshok vporoshok deleted the fix/block-compactor-observability branch July 2, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants