Skip to content

feat(electric-telemetry): add process_subtype attribute for supervisor/erlang/logger_olp granularity#4397

Open
erik-the-implementer wants to merge 7 commits into
mainfrom
erik/process-subtype-attribute
Open

feat(electric-telemetry): add process_subtype attribute for supervisor/erlang/logger_olp granularity#4397
erik-the-implementer wants to merge 7 commits into
mainfrom
erik/process-subtype-attribute

Conversation

@erik-the-implementer
Copy link
Copy Markdown
Contributor

@erik-the-implementer erik-the-implementer commented May 22, 2026

Summary

Adds a new low-cardinality process_subtype attribute alongside the existing process_type on all electric-telemetry events that today carry it: vm.monitor.long_gc, vm.monitor.long_schedule, vm.monitor.long_message_queue, process.memory, process.bin_memory.

For the three coarse process_type buckets that hide the most signal during overload (per recent investigations into long-mailbox spikes), process_subtype is derived from cheap process introspection:

  • process_type = "supervisor" → registered name; else first atom in $ancestors; else nil.
  • process_type = "erlang" → registered name (catches named VM helpers like :erts_dirty_process_signal_handler); else initial_call MFA string (e.g. ":erlang.apply/2").
  • process_type = "logger_olp" → registered name (the handler id — default, otel_log_handler, logger_proxy, …).

For all other process_type values, process_subtype is nil.

The change is purely additive: process_type values are unchanged, so existing Honeycomb boards and alerts that group by process_type continue to work. process_subtype gives a drill-down dimension without exploding cardinality (registered names + MFAs only; no pids, no dynamic registry tuples).

Related issues

  • Source task: electric-sql/alco-agent-tasks#46
  • SRE motivation / investigation: electric-sql/alco-agent-tasks#45 — long-mailbox / overload investigations where the coarse process_type buckets (supervisor, erlang, logger_olp) hid the specific processes responsible. process_subtype adds the drill-down dimension those investigations needed.

Implementation notes

  • New ElectricTelemetry.Processes.proc_type_and_subtype/1 returns {type, subtype} in a single Process.info/2 call; proc_subtype/1 is also exported for callers that only want the subtype.
  • Process.info/2 now also fetches :registered_name (one extra key per call).
  • sorted_groups/2 groups by {type, subtype} so the process.memory / process.bin_memory metrics break down by subtype as well.
  • :process_subtype is added to the tags: lists of the affected last_value, sum, and distribution metric definitions.
  • proc_type/1 is kept unchanged for backward compatibility.

Test plan

  • New unit tests for each of the three subtype-bucket derivations and their fallback paths (packages/electric-telemetry/test/electric/telemetry/processes_test.exs).
  • Existing proc_type/1 tests unchanged and still pass (122/122 tests pass in electric-telemetry).
  • Changeset added: @core/electric-telemetry: minor.

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.37%. Comparing base (d921a9f) to head (7026b78).
⚠️ Report is 11 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...telemetry/lib/electric/telemetry/system_monitor.ex 0.00% 16 Missing ⚠️
...ry/lib/electric/telemetry/application_telemetry.ex 0.00% 2 Missing ⚠️
...tric-telemetry/lib/electric/telemetry/processes.ex 95.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4397       +/-   ##
===========================================
+ Coverage   37.04%   69.37%   +32.32%     
===========================================
  Files         217       15      -202     
  Lines       17094      591    -16503     
  Branches     5762        0     -5762     
===========================================
- Hits         6333      410     -5923     
+ Misses      10746      181    -10565     
+ Partials       15        0       -15     
Flag Coverage Δ
electric-telemetry 69.37% <50.00%> (?)
elixir 69.37% <50.00%> (?)
packages/agents ?
packages/agents-mobile ?
packages/agents-server ?
packages/agents-server-ui ?
packages/electric-ax ?
typescript ?
unit-tests 69.37% <50.00%> (+32.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude Code Review

Summary

Iteration 4. Retracting the critical issue raised in iteration 3 — it was based on a commit (3c9935c3d) that doesn’t exist anywhere in the repository. The actual sorted_groups/2 uses unconditional Map.put(:subtype, subtype) (processes.ex:94-95), so application_telemetry.ex:194,210 reading map.subtype is safe. There is no KeyError regression. The PR is in good shape; only minor non-blocking suggestions remain.

What is Working Well

  • Polish since iteration 1 covers nearly every suggestion raised. Four small chore/test commits address what iteration 1 flagged:
    • 5ab7d1cc6 tightens @spec proc_type_and_subtype(pid()) :: {atom() | binary(), atom() | binary() | nil} — was {term(), binary() | nil}.
    • aa3163990 makes registered_name/1 (formerly registered_name_string/1) explicit about nil/empty-list handling using a with [] <- form.
    • fef04de39 pins the {:dead, nil} contract for proc_type_and_subtype/1 (processes_test.exs:226-241) — closes the test-gap suggestion from iteration 1.
    • d78050545 removes the redundant single-shot proc_subtype/1 entry point and unifies the fallback chain (registered_name -> ancestor -> initial_call_mfa_string) across all three buckets.
  • Single-pass info fetch is preserved. proc_type_and_subtype/1 still uses one Process.info/2 call to drive both the type and subtype derivation, and :registered_name is the only added key — cheap addition to the per-process hot path.
  • No atom-table risks. All atoms emitted as subtypes come from process introspection (:registered_name, $ancestors head, initial_call module) — never created from external input.
  • Tests cover the dead-process contract and exercise the supervisor / erlang / logger_olp buckets, including the deepest initial_call fallback path.

Issues Found

Critical (Must Fix): None. Retracting the iteration-3 KeyError claim — see Summary above. The code path under review (sorted_groups/2) writes :subtype unconditionally; non-bucketed groups carry subtype: nil end-to-end, which is well-defined for map.subtype reads.

Important (Should Fix): None.

Suggestions (Nice to Have):

Mixed atom() | binary() subtype values at the metric-tag boundary (packages/electric-telemetry/lib/electric/telemetry/processes.ex:35, 171-187). The fallback chain produces an atom in two of three cases (registered_name, ancestor_name) and a binary in the third (initial_call_mfa_string). The reporters all stringify tag values eventually, so emission will not break, but the rendered values are visibly inconsistent at the dashboard layer: Atom.to_string(:my_worker) -> "my_worker" while Exception.format_mfa(:erlang, :apply, 2) -> ":erlang.apply/2" (leading colon kept). Worth a quick eyeball at one of the existing Honeycomb boards to confirm filter predicates like process_subtype = "my_worker" work after this lands. If you want to normalize, a to_string/1 on the atom branches in proc_subtype/2 would make every emitted subtype a binary and side-step the inconsistency. Not a blocker.

info/1 made public without an external caller (packages/electric-telemetry/lib/electric/telemetry/processes.ex:160-163). d78050545 flipped defp info(pid) to @doc false def info(pid). @doc false keeps it out of generated docs, but def still makes it part of the public interface for dialyzer, hot-code-loading boundaries, and accidental external use. A repository-wide grep confirms no callers outside processes.exsystem_monitor.ex only imports proc_type_and_subtype/1, and processes_test.exs does not reference info/1. Recommend reverting to defp unless there is a planned future caller.

nil-tag rendering on non-bucketed processes (carry-over) (packages/electric-telemetry/lib/electric/telemetry/system_monitor.ex:58-61 for port long_schedule, plus implicit process_subtype: nil for every non-bucketed process emitted via process.memory / process.bin_memory). process_subtype: nil flows through to telemetry_metrics_prometheus_core, telemetry_metrics_statsd, and otel_metric_exporter, each of which has its own convention for nil label values (typically empty string or literal "nil"). Worth a single sanity check against the live Honeycomb / Prometheus / StatsD pipeline before relying on "purely additive, existing boards keep working," especially for any alert that filters on process_type=<bucket>. No code change required if the rendering is acceptable; otherwise consider mapping nil -> "_none" (or similar) at the :telemetry.execute boundary so the label is always a string.

Stale subtype on :recheck_message_queues (carry-over) (packages/electric-telemetry/lib/electric/telemetry/system_monitor.ex:128-134). The captured {type, subtype} is replayed every 200ms for as long as the queue stays over threshold. Subtypes can drift slightly more than coarse types — e.g. a registered name being set after spawn would update the subtype but not be reflected in the cached tuple. Re-resolving via proc_type_and_subtype(pid) inside log_long_message_queue_event/3 would be more accurate. Consistent with the prior type-only behaviour, so not a regression; flagging as a known minor inaccuracy.

Issue Conformance

The PR description accurately reflects the implementation: additive process_subtype on the five named events, three coarse buckets subtyped via the documented registered_name -> $ancestors -> initial_call chain, all other types nil. Linked tasks (electric-sql/alco-agent-tasks#46, #45) are present. The polish commits since iteration 1 are silent in the PR description, but they are pure follow-ups and do not change the behavioural contract — not worth a description rewrite.

Previous Review Status

Iteration 3 critical issue was incorrect. It claimed a commit 3c9935c3d introduced a put_if_not_nil(:subtype, ...) change in sorted_groups/2 that would make application_telemetry.ex crash with KeyError. That commit does not exist in the repository (or anywhere in git log --all), and the cited code (put_if_not_nil) does not appear in the diff. The actual sorted_groups/2 writes :subtype unconditionally and has done so since the initial commit on this branch. Apologies for the false alarm.

Addressed since iteration 1:

  • Linked issues — added (#46, #45).
  • Dead-process contract test for proc_type_and_subtype/1 — added in fef04de39.
  • registered_name_string/1 clause-ordering safety — addressed in aa3163990 (rewritten as registered_name/1 using with [] <-).
  • proc_type_and_subtype/1 @spec tightened — 5ab7d1cc6.
  • proc_subtype/1 docstring nit — moot; the entire proc_subtype/1 entry point was removed in d78050545.

Still open (all non-blocking):

  • Mixed atom/binary subtype values (suggestion above).
  • info/1 visibility flip (suggestion above).
  • nil-tag rendering sanity check (carry-over).
  • Stale subtype on :recheck_message_queues (carry-over).

Review iteration: 4 | 2026-06-01

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude Code Review

Summary

Iteration 5. One new commit since iteration 4 (7026b78a9) re-resolves {type, subtype} per tick in :recheck_message_queues, closing the carry-over staleness item from prior reviews. No new issues; remaining open suggestions are minor and non-blocking.

What's Working Well

  • Carry-over staleness fixed. 7026b78a9 switches long_message_queue_pids from a Map<pid, {type, subtype}> to a MapSet<pid> and re-runs proc_type_and_subtype(pid) inside the recheck loop (system_monitor.ex:128-131). Subtypes that change after the initial detection (late Process.register/2, ancestor renaming) are now reflected on every 200 ms emission instead of being frozen at first-detect time.
  • Empty-set early-return preserved. The new :recheck_message_queues clause keeps the previous behaviour of cancelling the timer and clearing long_message_queue_timer when the set is empty (system_monitor.ex:123-126) — just expressed as an if over MapSet.size/1 instead of a separate guarded clause. Functionally equivalent.
  • Dead-pid path is still safe. A pid that dies while still in the set will get {:dead, nil} from proc_type_and_subtype/1, then log_long_message_queue_event/3 is a no-op because Process.info(pid, :message_queue_len) returns nil for a dead pid and the with {:message_queue_len, _} guard fails. No telemetry will be emitted with process_type: :dead.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

Two Process.info/2 calls per pid per recheck tick (minor)

File: packages/electric-telemetry/lib/electric/telemetry/system_monitor.ex:128-131, 137-143

With the requery, each pid in long_message_queue_pids now triggers two Process.info/2 calls every 200 ms: one inside proc_type_and_subtype/1 (six keys), one inside log_long_message_queue_event/3 for :message_queue_len. For the expected size of this set (a handful of overloaded pids), this is fine. If you ever care, info/1 in processes.ex:160-163 could grow :message_queue_len and the helper could thread it through — but unless overload scenarios push the set into the hundreds, the current shape is simpler and not worth optimising.

Mixed atom() | binary() subtype values at the metric-tag boundary (carry-over)

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:35, 171-187

Still applies after the iteration-4 polish. proc_subtype/2 returns an atom in the registered_name/ancestor_name branches and a binary in the initial_call_mfa_string branch. The exporters all stringify eventually, but the rendered values differ: :my_worker -> "my_worker" vs Exception.format_mfa(:erlang, :apply, 2) -> ":erlang.apply/2" (leading colon kept). Worth eyeballing one of the Honeycomb boards once this lands — a filter process_subtype = "erlang.apply/2" will silently miss MFA rows because of the leading colon. If you want to normalise, wrap the atom branches in Atom.to_string/1 inside proc_subtype/2 so every emitted subtype is a binary. Not a blocker.

info/1 is public without an external caller (carry-over)

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:160-163

Still @doc false def info(pid). A repo-wide grep still confirms no external callers (system_monitor.ex only imports proc_type_and_subtype/1; processes_test.exs does not call info/1). @doc false hides it from generated docs but doesn't restore private semantics — dialyzer, hot-code-loading boundaries, and accidental cross-module use still see it. Recommend flipping back to defp unless a future caller is already planned.

process_subtype: nil rendering across exporters (carry-over)

Files: packages/electric-telemetry/lib/electric/telemetry/system_monitor.ex:58-61, plus implicit nil from proc_type_and_subtype/1 for every non-bucketed process emitted via process.memory / process.bin_memory.

telemetry_metrics_prometheus_core, telemetry_metrics_statsd, and otel_metric_exporter each have their own convention for nil label values (typically empty string or literal "nil"). Worth a single sanity check against the live Honeycomb / Prometheus / StatsD pipeline before relying on "purely additive, existing boards keep working", especially for any alert that filters on process_type=<bucket>. No code change required if rendering is acceptable; otherwise consider mapping nil -> "_none" at the :telemetry.execute boundary so the label is always a string.

Issue Conformance

PR description still accurately reflects the implementation. The new commit (7026b78a9) tightens the freshness contract on :recheck_message_queues but doesn't change the public attribute set — the description's "purely additive" claim still holds. Linked tasks (electric-sql/alco-agent-tasks#46, #45) remain accurate. Codecov's two reported failures (Electric.ShapeCacheTest, Electric.Replication.PublicationManagerTest) are unrelated to electric-telemetry — pre-existing flakes / unrelated areas.

Previous Review Status

Addressed since iteration 4:

  • Stale subtype on :recheck_message_queues — fixed in 7026b78a9 by switching to a MapSet and re-resolving {type, subtype} per pid on each tick.

Still open (all non-blocking, unchanged):

  • Mixed atom/binary subtype values (suggestion above).
  • info/1 visibility flip (suggestion above).
  • nil-tag rendering sanity check (carry-over).

Review iteration: 5 | 2026-06-02

Adds a new low-cardinality `process_subtype` attribute alongside the
existing `process_type` on all telemetry events that today carry it
(`vm.monitor.long_{gc,schedule,message_queue}`, `process.memory`,
`process.bin_memory`).

For the three coarse `process_type` buckets that previously hid most
of the signal during overload, `process_subtype` is derived as:

  * `:supervisor`  -> registered name, else first atom in $ancestors
  * `:erlang`      -> registered name, else initial_call MFA string
  * `:logger_olp`  -> registered name (handler id)

For every other `process_type` value, `process_subtype` is `nil`.
The existing `process_type` taxonomy is unchanged, so Honeycomb boards
and alerts that group by it continue to work; `process_subtype` adds
a finer-grained drill-down without exploding cardinality.

Refs electric-sql/alco-agent-tasks#46.
term() is correct but uninformative — in practice the type is always
atom() | binary() (atoms cover :dead, :unknown, module atoms, and atom
labels; binaries cover the string-label case). Helps dialyzer downstream.
…ut nil

is_atom(nil) is true, so the previous clause order (`nil -> nil` before
the atom guard) was load-bearing — dropping it would silently turn into
`"nil"`. Rewrite the guard to match on `is_atom(name) and not is_nil(name)`
so the clause stands on its own.
…_subtype/1

Existing tests cover proc_type/1 returning :dead for an exited process,
but proc_type_and_subtype/1 and proc_subtype/1 weren't exercised for
that case. Implementation relies on Process.info/2 returning nil and
Access on nil cascading nils through every helper; lock that contract
down so a future refactor of info/1 doesn't silently change the answer.
@erik-the-implementer erik-the-implementer force-pushed the erik/process-subtype-attribute branch from 1bd5c76 to fef04de Compare June 1, 2026 13:34
@erik-the-implementer erik-the-implementer force-pushed the erik/process-subtype-attribute branch from 3c9935c to d780505 Compare June 1, 2026 15:03
@alco alco self-assigned this Jun 2, 2026
Copy link
Copy Markdown
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc_type already comes from a variety of places (process label or if not initial_module) so it seems natural to extend that to check other places if the result we get back is not helpful. for example, I don't think it's every useful to have erlang as the proc_type and we could fall back to what you have as the subtype in that case. same for supervisor.

The strong argument for a proc_sub_type would come from {"logger_olp", handler id} if there's an infinite amount of handler_ids, and if it's useful to group by logger_olp.

So I can potentially see the need for proc_subtype, but personally I wouldn't use it for erlang and supervisor because it's not particularly useful to group by erlang or supervisor and there should only be a limited number of the names we'd replace it with.

Or maybe we shouldn't have subtype at all. What are the handler ids? Are they limited and readable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants