feat: L2 Cache with Valkey Client Side Hash Ring by larry-dalmeida · Pull Request #4033 · zalando/skipper

larry-dalmeida · 2026-05-26T10:40:59Z

Related Issue

Description

Extends the cache() filter with an optional Valkey-backed L2 cache using a client-side consistent hash ring.

When --swarm-valkey-urls is configured, responses are stored in Valkey (L2) with automatic fallback to the in-process LRU (L1) on any Valkey error.

Storage architecture

Write path: successful Valkey Set does not warm L1 (write-around). L1 is only populated on Valkey errors.

Read path: Valkey miss → nil (clean miss, no L1 consulted). Valkey error → L1 consulted as fallback.

Valkey ring topology

All pods share the same ring, so a cold-miss coalesced by pod A lands in the same shard that pod B would read from — no thundering herd across pods.

Observability

Three counters track Valkey operation outcomes:

Counter	Meaning
`valkey_miss`	Key absent in Valkey (clean miss)
`valkey_get_fallback`	Valkey error on Get; L1 consulted
`valkey_set_fallback`	Valkey error on Set; L1 written

lru_bytes gauge is now updated by a background scraper every 10 s instead of only on eviction, so it stays accurate when capacity is not exceeded.

Storage Set and Delete errors are now logged at Warn instead of being silently discarded.

Additional changes

Trace spans: cache_status, cache_key, and cache_ttl_remaining_ms are tagged on the active OpenTracing span for HIT, MISS, and STALE paths.
Injectable metrics: lru_eviction and lru_oversized counters are now injectable via constructor parameters, removing the hidden metrics.Default dependency and enabling test-scoped counter assertions without global state mutation. See filters/cache/observability-gaps.md for why full per-route namespacing requires a Skipper core change.
NewCacheFilter signature change: a fourth parameter valkeyRing *skpnet.ValkeyRingClient is added. Pass nil to preserve the existing LRU-only behaviour. Call sites embedding Skipper as a library must be updated.

Bug fixes

Stale-if-error via coalesce: the SIE block in Response() was dead code — coalesce() always calls ctx.Serve(), causing Response() to return early. SIE logic moved inside coalesce() with the pre-fetch snapshot captured before f.fetch runs.
only-if-cached + SWR: entries in the stale-while-revalidate window are still being served to other clients and must not return 504 to only-if-cached requests. Fixed via IsUsable() replacing IsStale().
Response() nil-safe key assertion: bare type assertion on stateBagKey panicked on route misconfiguration; replaced with comma-ok form.
Expire(-1) truncation: time.Duration(-1) (-1 ns) truncated to EXPIRE key 0 in Valkey. Changed to -1*time.Second which correctly sends EXPIRE key -1.

Regression tests

Test	Covers
`TestCacheFilter_MustRevalidate_ForcesCoalesceWhenStale`	RFC 9111 §5.2.2.2: `must-revalidate` forces origin fetch even within SWR window
`TestCacheFilter_UnsafeMethod_4xx_DoesNotInvalidate`	RFC 9111 §4.4: unsafe method + 4xx must not invalidate cached entry
`TestLRUStorage_OversizedEntry`	Entry exceeding shard capacity increments `lru_oversized` and is not stored
`TestCacheFilter_RevalDropped_WhenQueueFull`	`reval_dropped` counter fires when `revalJobs` channel is at capacity
`TestValkeyStorage_FallsBackToL1OnValkeyUnavailable` (strengthened)	L1 is physically written on Valkey fallback, not just inferred from `Get`

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com> feat: wire ValkeyStorage into NewCacheFilter (nil = in-memory only) Signed-off-by: Larry D Almeida <hello@larrydalmeida.com> feat: wire Valkey ring into cache() filter when swarm Valkey is configured Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

Replace the single valkey_fallback counter with three distinct counters for better observability: - valkey_miss — clean cache miss (key absent in Valkey) - valkey_get_fallback — Valkey error on Get; L1 consulted instead - valkey_set_fallback — Valkey error on Set; L1 written instead Inject metrics.Metrics into ValkeyStorage via NewValkeyStorage so tests can assert counter values without relying on the global metrics.Default singleton. Introduce a valkeyClient interface (Get/SetWithExpire/Expire) so unit tests can use an in-memory stub instead of a live Valkey/Docker connection. Two new tests — RecordsValkeyMiss and SplitFallbackCounters — exercise the counter logic with stubs. Signed-off-by: Larry D Almeida <hello@larrydalmeida.com> fix(cache): add compile-time interface guards; assert valkey_get_fallback fires on fallback Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

szuecs · 2026-05-29T19:56:42Z

+			// Shallow copy so NewValkeyRingClient can mutate opt.Addrs without
+			// racing against the ratelimit registry's copy of the same pointer.
+			cacheValkeyOpts := *valkeyOptions
+			cacheValkeyRing, err = skpnet.NewValkeyRingClient(&cacheValkeyOpts)


Why not passing the same client?
I know that you can not know this, but in the end we can refactor code that I want to move around a bit and some I want to delete later, so we can make this nice. :)

Please merge larry-dalmeida#2 , if you agree that this makes sense

…we can reuse the valkey ring client in the cache filter creation. Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

…ients refactor: ratelimit registry creation to pass ring clients

szuecs · 2026-06-01T15:14:49Z

👍

larry-dalmeida · 2026-06-02T07:47:16Z

👍

a4180p · 2026-06-02T08:34:46Z

Write path: successful Valkey Set does not warm L1 (write-around). L1 is only populated on Valkey errors.
Read path: Valkey miss → nil (clean miss, no L1 consulted). Valkey error → L1 consulted as fallback.

That actually implies that l1 cache will be empty in case if Set was successfull and only Get fails.
Also, this model means that cache will be slower - there is no fast l1 layer, positive path always reads and writes from/to network storage.
Is that desired behaviour?
I would expect a small write/read-through L1 cache to serve as a fast local layer for the common case, but I may be missing some context around the intended use case.

szuecs · 2026-06-02T08:54:58Z

Write path: successful Valkey Set does not warm L1 (write-around). L1 is only populated on Valkey errors.
Read path: Valkey miss → nil (clean miss, no L1 consulted). Valkey error → L1 consulted as fallback.

That actually implies that l1 cache will be empty in case if Set was successful and only Get fails. Also, this model means that cache will be slower - there is no fast l1 layer, positive path always reads and writes from/to network storage. Is that desired behaviour? I would expect a small write/read-through L1 cache to serve as a fast local layer for the common case, but I may be missing some context around the intended use case.

Good point, I wondered about this and missed to ask the question.
Normally (for example CPU) you would check L1 cache and if you get a cache miss, you check L2 cache and if you have a hit you will save in L1 and respond.

larry-dalmeida · 2026-06-02T12:36:51Z

@a4180p @szuecs @MustafaSaber

Intended behavior: L1 is a degraded-mode fallback only, not hot cache layer path. Happy path always reads from and writes to L2. L1 is only populated when a valkey operation fails.

CPI L1/L2 cache hierarchy

works because 1) L2 is on-chip with 5-30ns latency. L1 saves nanoseconds, not milliseconds. 2) Cache coherence is handled at hardware level (MESI protocol) - when core writes to L1, hardware broadcasts invalidation to other core’s L1 automatically. This does not transfer here:

L1 is an LRU per Skipper pod - there can be n pods, each with own private LRU. No hardware coherence protocol

Valkey round trip is ~0.5-1ms over local cluster network - latency is higher than L1 cache but problem being solved is not sub-ms latency but rather cross-pod cache sharing.

Primary goal: cross-pod consistency

Without L2, every pod maintains an independent LRU. A cold miss on pod A fetches from origin, pod B's LRU is empty and fetches independently. Under load (e.g. a popular campaign going live), this produces an N-way thundering herd - one upstream fetch per pod, not one per cluster.

Valkey solves this because the consistent-hash ring maps a given cache key to the same shard regardless of which pod is making the request. A cold-miss coalesced by pod A writes to Valkey shard S. Pod B's next request for the same key hits shard S directly - no upstream fetch.

Why warming L1 on write undermines this goal

If L1 is warmed on a successful Valkey Set (write-through):

A pod serves future requests from its local LRU, bypassing Valkey.
Valkey Delete or TTL-based expiry does not reach L1 - the pod continues serving stale content until L1's own TTL expires.
There is no invalidation channel. Adding one (e.g. Valkey pub/sub) would require every pod to subscribe, handle reconnects, and accept a bounded staleness window on subscriber lag.

The tradeoff is: faster reads on the hot path vs. stale content served after invalidation, with non-trivial invalidation infrastructure.

Why Valkey-miss does not consult L1

A Valkey miss (nil, no error) means the key is genuinely absent - no pod has fetched and stored it yet, or the TTL expired. Consulting L1 on a clean miss would serve stale content beyond the intended TTL. The filter must go to origin.

L1 is only consulted when Valkey returns an error, because in that case we have no authoritative answer. Serving a potentially-stale L1 entry is preferable to a 5xx.

Considered alternatives

Write-through with TTL-bounded staleness

Warm L1 on every successful Set, accept that L1 entries can be served up to their TTL after a Valkey Delete. For content that is never explicitly invalidated (only TTL-expired), this is semantically equivalent to the current design - and would reduce Valkey read load.

Not chosen because explicit Delete is on the roadmap (cache invalidation on content publish events). Once that lands, write-through without an invalidation channel produces observable stale responses.

Write-through + Valkey pub/sub invalidation

Warm L1, subscribe each pod to a Valkey pub/sub channel for invalidation events. This matches the CPU L1/L2 mental model most closely.

Not chosen for this PR. The complexity cost is high (subscribe lifecycle, reconnect handling, message delivery guarantees, lag-bounded staleness), and the latency benefit does not yet justify it. This is the natural next step if Valkey read latency becomes a bottleneck.

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

szuecs · 2026-06-02T13:45:03Z

Considered alternatives

Write-through with TTL-bounded staleness

Warm L1 on every successful Set, accept that L1 entries can be served up to their TTL after a Valkey Delete. For content that is never explicitly invalidated (only TTL-expired), this is semantically equivalent to the current design - and would reduce Valkey read load.

Not chosen because explicit Delete is on the roadmap (cache invalidation on content publish events). Once that lands, write-through without an invalidation channel produces observable stale responses.

We could have L1 entry TTL of a fixed acceptable amount of time.
We have for example token result caching in tokeninfo filters set to 60s and it offloaded 50% of total CPU time and latency dropped from 10-20ms to <1ms.
Given our experience I would be in favour of having a fixed TTL of 60s for L1 cache in case there is an L2 cache setup configured.

larry-dalmeida · 2026-06-02T14:12:22Z

Given our experience I would be in favour of having a fixed TTL of 60s for L1 cache in case there is an L2 cache setup configured.

Good point. tokeninfo data is a strong precedent.

The write-around choice was conservative: L1 and Valkey TTLs are independent, and if a Valkey entry expires or gets evicted, an L1 entry with a longer TTL would silently serve stale content with no signal. The intent was to keep Valkey authoritative for the lifetime of every entry.

That said, your proposal sidesteps the problem cleanly.
A fixed short TTL (e.g. 60 s) on L1 should result in:

staleness is bounded and operator-visible - it's a deliberate configuration choice, not a silent race
the Delete path already calls l1.Delete unconditionally, so explicit invalidations would propagate correctly
no invalidation channel needed

Trade-off is: cache filter TTL must be meaningfully longer than the L1 TTL for the L1 layer to be useful.
For short-lived entries (< 60 s) the L1 TTL would need to be proportionally smaller, so it is likely to be configurable rather than hard-coded.

For now I will proceed with setting 60s as fixed TTL as a start.

a4180p · 2026-06-02T15:37:08Z

@larry-dalmeida

The intent was to keep Valkey authoritative for the lifetime of every entry.

After every successfull read from L1 you can call EXPIRE to valkey comand to:

prolongate TTL of the record if it exists
get a signal if it not exists and cleanup the record in L1

You can also consider using GETEX instead of GET if you wan TTLs to be updated on read ops.

szuecs · 2026-06-02T17:13:52Z

@larry-dalmeida

The intent was to keep Valkey authoritative for the lifetime of every entry.

After every successfull read from L1 you can call EXPIRE to valkey comand to:

This you can't really do because l2 cache is shared by all skipper instances
In the rate limit cases we set Expiry time on create entry. I think it's the way to go for expiration by time.

larry-dalmeida · 2026-06-05T07:52:06Z

@szuecs @a4180p @MustafaSaber thanks a lot for the feedback and patience 🧡 I will review it thoroughly and get back to you with concrete proposal. Currently I am busy with a business critical reliability related topic but I will resume this on Monday 8th June.

larry-dalmeida force-pushed the feat/cache-client-side-valkey-hash-ring branch from a47f35e to 730af75 Compare May 26, 2026 10:53

larry-dalmeida force-pushed the feat/cache-client-side-valkey-hash-ring branch from 7047126 to 573ebdf Compare May 27, 2026 04:40

larry-dalmeida changed the title ~~Add L2 Cache with Valkey Client Side Hash Ring~~ feat: Add L2 Cache with Valkey Client Side Hash Ring May 27, 2026

larry-dalmeida changed the title ~~feat: Add L2 Cache with Valkey Client Side Hash Ring~~ feat: L2 Cache with Valkey Client Side Hash Ring May 27, 2026

larry-dalmeida added 10 commits May 27, 2026 11:55

fix: address WP6 code review feedback

b739f9e

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

feat: add periodic lru_bytes gauge scrape every 10s

b76aaec

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

feat: log storage errors with per-site messages

c4ecb10

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

test: must-revalidate forces coalesce when stale

9d82e0f

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

test: unsafe method + 4xx does not invalidate cached entry

f957133

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

test: oversized LRU entry increments lru_oversized and is not stored

7cb9e18

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

test: reval_dropped and L1 fallback write verification

f48389d

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

feat: tag cache_status, cache_key, cache_ttl_remaining_ms on trace span

59c1ab3

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

cache: promote storage Set/Delete error logs from Debug to Warn

c73eb72

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

larry-dalmeida force-pushed the feat/cache-client-side-valkey-hash-ring branch from 55ba444 to c73eb72 Compare May 27, 2026 09:55

larry-dalmeida added 2 commits May 28, 2026 08:15

Promote log to warn

bca05f9

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

cache: injectable metrics, trace spans, lru_bytes scraper

fe2cf14

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

larry-dalmeida marked this pull request as ready for review May 28, 2026 07:56

szuecs added the architectural all changes in the hot path, big changes in the control plane, control flow changes in filters label May 29, 2026

szuecs reviewed May 29, 2026

View reviewed changes

szuecs and others added 2 commits May 29, 2026 23:23

refactor: ratelimit registry creation to pass ring clients such that …

8426f73

…we can reuse the valkey ring client in the cache filter creation. Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

Merge pull request #2 from zalando/refactor/ratelimit-to-pass-ring-cl…

cca25b7

…ients refactor: ratelimit registry creation to pass ring clients

szuecs requested review from MustafaSaber and a4180p June 1, 2026 20:02

MustafaSaber reviewed Jun 2, 2026

View reviewed changes

Comment thread skipper.go Outdated

Fix registration of cache filter

1bbbd70

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

larry-dalmeida marked this pull request as draft June 2, 2026 14:21

Conversation

larry-dalmeida commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Description

Storage architecture

Valkey ring topology

Observability

Additional changes

Bug fixes

Regression tests

Uh oh!

szuecs May 29, 2026

Choose a reason for hiding this comment

Uh oh!

szuecs May 29, 2026

Choose a reason for hiding this comment

Uh oh!

szuecs commented Jun 1, 2026

Uh oh!

larry-dalmeida commented Jun 2, 2026

Uh oh!

a4180p commented Jun 2, 2026

Uh oh!

szuecs commented Jun 2, 2026

Uh oh!

Uh oh!

larry-dalmeida commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CPI L1/L2 cache hierarchy

Primary goal: cross-pod consistency

Why warming L1 on write undermines this goal

Why Valkey-miss does not consult L1

Considered alternatives

Write-through with TTL-bounded staleness

Write-through + Valkey pub/sub invalidation

Uh oh!

szuecs commented Jun 2, 2026

Considered alternatives

Write-through with TTL-bounded staleness

Uh oh!

larry-dalmeida commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a4180p commented Jun 2, 2026

Uh oh!

szuecs commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larry-dalmeida commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

larry-dalmeida commented May 26, 2026 •

edited

Loading

larry-dalmeida commented Jun 2, 2026 •

edited

Loading

larry-dalmeida commented Jun 2, 2026 •

edited

Loading

szuecs commented Jun 2, 2026 •

edited

Loading