Skip to content

Add docs recommending autoscaling setup#324

Open
carlydf wants to merge 9 commits into
mainfrom
demo-ga-no-recording-rule
Open

Add docs recommending autoscaling setup#324
carlydf wants to merge 9 commits into
mainfrom
demo-ga-no-recording-rule

Conversation

@carlydf

@carlydf carlydf commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Adds documentation outlining the tradeoffs between two autoscaling solutions:

  1. HPA+prometheus adapter
  2. KEDA Temporal Scaler

Documentation focuses on straightforward descriptions of the pros and cons of each solution.

@carlydf carlydf requested review from a team and jlegrone as code owners May 14, 2026 01:52
@carlydf carlydf marked this pull request as draft May 14, 2026 02:03

@jaypipes jaypipes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carlydf , I've done a first go-around reviewing this documentation and adding (quite a few) suggested changes and removals to "de-Claude" some of it and make it (hopefully) a bit more readable for a general audience.

Comment thread docs/README.md
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
@jaypipes jaypipes marked this pull request as ready for review June 1, 2026 16:35
@jaypipes jaypipes changed the title Drop backlog recording rule; consume raw temporal_cloud_v1_approximate_backlog_count Add docs recommending autoscaling setup Jun 3, 2026

@Shivs11 Shivs11 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of nits -- looks g to me otherwise

Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
@jaypipes jaypipes force-pushed the demo-ga-no-recording-rule branch from 718471b to f8335de Compare June 9, 2026 12:49
@jaypipes jaypipes self-requested a review June 9, 2026 12:53
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
@jaypipes jaypipes force-pushed the demo-ga-no-recording-rule branch from d452af5 to d1e2d02 Compare June 11, 2026 16:53

@carlydf carlydf left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So close! Thank you for all your hard work on this @jaypipes. Let me know what you think of my suggestions :)

Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
Comment thread docs/scaling-recommendations.md
Comment thread internal/demo/k8s/prometheus-stack-values.yaml Outdated
Comment thread internal/demo/README.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md
carlydf and others added 8 commits June 22, 2026 13:03
The backlog metric pipeline goes from prometheus-adapter directly to the
raw temporal_cloud_v1_approximate_backlog_count series, eliminating the
temporal_approximate_backlog_count recording rule. Adapter rule:

- seriesQuery filters out temporal_worker_build_id="__unversioned__" so
  discovery doesn't choke on the 5000+ unversioned series in typical
  accounts.
- metricsQuery sum(...) collapses labels the HPA doesn't select on at
  query time (instance/job/region/task_priority/temporal_account).
- metricsRelistInterval is bumped to 5m to accommodate the ~3-minute
  embedded-timestamp lag in Temporal Cloud's OpenMetrics emission.

WRT example, prometheus-stack-values, and demo README are updated to
match. Add docs/scaling-recommendations.md covering the empirically
measured reactivity model (steady-state ~3:15 dominated by Cloud
aggregation lag), task-queue-unload behavior, scale-from-zero limits,
and when to pick KEDA over the metric path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial scaling-recommendations.md framed steady-state HPA reactivity as
~3:15, citing a "Temporal Cloud aggregation lag." That was wrong. The
actual sample-age distribution on the OpenMetrics endpoint is:

  p50  30s  (matches ~1/min emission cadence, age oscillates 0-60s)
  p95  50s
  p99  ~tail of occasional gateway-wide stalls

So typical end-to-end reactivity is ~85s (emission + scrape + HPA poll),
not ~3:15. The 3-minute figures came from observations made during the
occasional periods when the OpenMetrics gateway returns frozen
timestamps across every series in the account simultaneously - those
stalls are real but not steady-state.

Doc now:
- Replaces the 3:15 figure with empirically-derived ~85s typical.
- Adds a "Gateway-wide stalls" caveat describing the frozen-timestamp
  behavior observationally (no speculation about cause).
- Keeps the metricsRelistInterval: 5m recommendation, now justified by
  the need to exceed stall duration rather than the misattributed
  "aggregation lag."
- Demo README updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier wording implied multiple stall events ("occasional periods")
when we have only directly characterized one such event during this
investigation. Reword to describe exactly what was seen, note that
frequency is not yet known, and that the behavior is open with the
Observability team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified directly: across a 3-hour window including one of the observed
"stall" events, every gap between consecutive sample timestamps in
Prometheus's storage is exactly 60 seconds. So the OpenMetrics endpoint
isn't dropping or freezing emissions - it's delivering them late, in
bursts after a delay, with their original minute-aligned timestamps.

The retrospective record looks complete (good for dashboards), but live
HPA consumers see the delay as real staleness because they query the
latest available timestamp at decision time. Reframe the caveat in the
scaling doc and demo README accordingly.

Also note we observed two such delay events in ~2 hours of close
observation - frequency in normal operation is still open with the
Observability team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Jay Pipes <jaypipes@gmail.com>
Co-authored-by: Stefan Richter <stefan@02strich.de>
Removes a bunch of overly verbose Claude-generated stuff that will
likely confuse readers. Reworded a few places where Claude was using
some odd terminology -- e.g. "typical end-to-end reactivity" -- to use
more straightforward verbiage. Added a brief WRT example HPA template
that shows the stabilization window that is referred to in multiple
sections of the doc.

Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
@jaypipes jaypipes force-pushed the demo-ga-no-recording-rule branch from d1e2d02 to a80374f Compare June 22, 2026 17:04

@carlydf carlydf left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved! (I can't actually approve because this is technically my PR)

@jaypipes jaypipes enabled auto-merge (squash) June 22, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants