Skip to content

feat: L2 Cache with Valkey Client Side Hash Ring#4033

Draft
larry-dalmeida wants to merge 16 commits into
zalando:masterfrom
larry-dalmeida:feat/cache-client-side-valkey-hash-ring
Draft

feat: L2 Cache with Valkey Client Side Hash Ring#4033
larry-dalmeida wants to merge 16 commits into
zalando:masterfrom
larry-dalmeida:feat/cache-client-side-valkey-hash-ring

Conversation

@larry-dalmeida

@larry-dalmeida larry-dalmeida commented May 26, 2026

Copy link
Copy Markdown
Contributor

Related Issue

#3991

Description

Extends the cache() filter with an optional Valkey-backed L2 cache using a client-side consistent hash ring.

When --swarm-valkey-urls is configured, responses are stored in Valkey (L2) with automatic fallback to the in-process LRU (L1) on any Valkey error.

Storage architecture

image

Write path: successful Valkey Set does not warm L1 (write-around). L1 is only populated on Valkey errors.

Read path: Valkey miss → nil (clean miss, no L1 consulted). Valkey error → L1 consulted as fallback.

Valkey ring topology

image

All pods share the same ring, so a cold-miss coalesced by pod A lands in the same shard that pod B would read from — no thundering herd across pods.

Observability

Three counters track Valkey operation outcomes:

Counter Meaning
valkey_miss Key absent in Valkey (clean miss)
valkey_get_fallback Valkey error on Get; L1 consulted
valkey_set_fallback Valkey error on Set; L1 written

lru_bytes gauge is now updated by a background scraper every 10 s instead of only on eviction, so it stays accurate when capacity is not exceeded.

Storage Set and Delete errors are now logged at Warn instead of being silently discarded.

Additional changes

  • Trace spans: cache_status, cache_key, and cache_ttl_remaining_ms are tagged on the active OpenTracing span for HIT, MISS, and STALE paths.
  • Injectable metrics: lru_eviction and lru_oversized counters are now injectable via constructor parameters, removing the hidden metrics.Default dependency and enabling test-scoped counter assertions without global state mutation. See filters/cache/observability-gaps.md for why full per-route namespacing requires a Skipper core change.
  • NewCacheFilter signature change: a fourth parameter valkeyRing *skpnet.ValkeyRingClient is added. Pass nil to preserve the existing LRU-only behaviour. Call sites embedding Skipper as a library must be updated.

Bug fixes

  • Stale-if-error via coalesce: the SIE block in Response() was dead code — coalesce() always calls ctx.Serve(), causing Response() to return early. SIE logic moved inside coalesce() with the pre-fetch snapshot captured before f.fetch runs.
  • only-if-cached + SWR: entries in the stale-while-revalidate window are still being served to other clients and must not return 504 to only-if-cached requests. Fixed via IsUsable() replacing IsStale().
  • Response() nil-safe key assertion: bare type assertion on stateBagKey panicked on route misconfiguration; replaced with comma-ok form.
  • Expire(-1) truncation: time.Duration(-1) (-1 ns) truncated to EXPIRE key 0 in Valkey. Changed to -1*time.Second which correctly sends EXPIRE key -1.

Regression tests

Test Covers
TestCacheFilter_MustRevalidate_ForcesCoalesceWhenStale RFC 9111 §5.2.2.2: must-revalidate forces origin fetch even within SWR window
TestCacheFilter_UnsafeMethod_4xx_DoesNotInvalidate RFC 9111 §4.4: unsafe method + 4xx must not invalidate cached entry
TestLRUStorage_OversizedEntry Entry exceeding shard capacity increments lru_oversized and is not stored
TestCacheFilter_RevalDropped_WhenQueueFull reval_dropped counter fires when revalJobs channel is at capacity
TestValkeyStorage_FallsBackToL1OnValkeyUnavailable (strengthened) L1 is physically written on Valkey fallback, not just inferred from Get

@larry-dalmeida larry-dalmeida force-pushed the feat/cache-client-side-valkey-hash-ring branch from a47f35e to 730af75 Compare May 26, 2026 10:53
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

feat: wire ValkeyStorage into NewCacheFilter (nil = in-memory only)

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

feat: wire Valkey ring into cache() filter when swarm Valkey is configured

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
@larry-dalmeida larry-dalmeida force-pushed the feat/cache-client-side-valkey-hash-ring branch from 7047126 to 573ebdf Compare May 27, 2026 04:40
@larry-dalmeida larry-dalmeida changed the title Add L2 Cache with Valkey Client Side Hash Ring feat: Add L2 Cache with Valkey Client Side Hash Ring May 27, 2026
@larry-dalmeida larry-dalmeida changed the title feat: Add L2 Cache with Valkey Client Side Hash Ring feat: L2 Cache with Valkey Client Side Hash Ring May 27, 2026
Replace the single valkey_fallback counter with three distinct counters
for better observability:
  - valkey_miss         — clean cache miss (key absent in Valkey)
  - valkey_get_fallback — Valkey error on Get; L1 consulted instead
  - valkey_set_fallback — Valkey error on Set; L1 written instead

Inject metrics.Metrics into ValkeyStorage via NewValkeyStorage so
tests can assert counter values without relying on the global
metrics.Default singleton.

Introduce a valkeyClient interface (Get/SetWithExpire/Expire) so unit
tests can use an in-memory stub instead of a live Valkey/Docker
connection. Two new tests — RecordsValkeyMiss and
SplitFallbackCounters — exercise the counter logic with stubs.

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>

fix(cache): add compile-time interface guards; assert valkey_get_fallback fires on fallback

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
@larry-dalmeida larry-dalmeida force-pushed the feat/cache-client-side-valkey-hash-ring branch from 55ba444 to c73eb72 Compare May 27, 2026 09:55
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
@larry-dalmeida larry-dalmeida marked this pull request as ready for review May 28, 2026 07:56
@szuecs szuecs added the architectural all changes in the hot path, big changes in the control plane, control flow changes in filters label May 29, 2026
Comment thread skipper.go Outdated
// Shallow copy so NewValkeyRingClient can mutate opt.Addrs without
// racing against the ratelimit registry's copy of the same pointer.
cacheValkeyOpts := *valkeyOptions
cacheValkeyRing, err = skpnet.NewValkeyRingClient(&cacheValkeyOpts)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not passing the same client?
I know that you can not know this, but in the end we can refactor code that I want to move around a bit and some I want to delete later, so we can make this nice. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge larry-dalmeida#2 , if you agree that this makes sense

szuecs and others added 2 commits May 29, 2026 23:23
…we can reuse the valkey ring client in the cache filter creation.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
…ients

refactor: ratelimit registry creation to pass ring clients
@szuecs

szuecs commented Jun 1, 2026

Copy link
Copy Markdown
Member

👍

@szuecs szuecs requested review from MustafaSaber and a4180p June 1, 2026 20:02
@larry-dalmeida

Copy link
Copy Markdown
Contributor Author

👍

@a4180p

a4180p commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Write path: successful Valkey Set does not warm L1 (write-around). L1 is only populated on Valkey errors.
Read path: Valkey miss → nil (clean miss, no L1 consulted). Valkey error → L1 consulted as fallback.

That actually implies that l1 cache will be empty in case if Set was successfull and only Get fails.
Also, this model means that cache will be slower - there is no fast l1 layer, positive path always reads and writes from/to network storage.
Is that desired behaviour?
I would expect a small write/read-through L1 cache to serve as a fast local layer for the common case, but I may be missing some context around the intended use case.

@szuecs

szuecs commented Jun 2, 2026

Copy link
Copy Markdown
Member

Write path: successful Valkey Set does not warm L1 (write-around). L1 is only populated on Valkey errors.
Read path: Valkey miss → nil (clean miss, no L1 consulted). Valkey error → L1 consulted as fallback.

That actually implies that l1 cache will be empty in case if Set was successful and only Get fails. Also, this model means that cache will be slower - there is no fast l1 layer, positive path always reads and writes from/to network storage. Is that desired behaviour? I would expect a small write/read-through L1 cache to serve as a fast local layer for the common case, but I may be missing some context around the intended use case.

Good point, I wondered about this and missed to ask the question.
Normally (for example CPU) you would check L1 cache and if you get a cache miss, you check L2 cache and if you have a hit you will save in L1 and respond.

Comment thread skipper.go Outdated
@larry-dalmeida

larry-dalmeida commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@a4180p @szuecs @MustafaSaber

Intended behavior: L1 is a degraded-mode fallback only, not hot cache layer path. Happy path always reads from and writes to L2. L1 is only populated when a valkey operation fails.

CPI L1/L2 cache hierarchy

works because 1) L2 is on-chip with 5-30ns latency. L1 saves nanoseconds, not milliseconds. 2) Cache coherence is handled at hardware level (MESI protocol) - when core writes to L1, hardware broadcasts invalidation to other core’s L1 automatically. This does not transfer here:

L1 is an LRU per Skipper pod - there can be n pods, each with own private LRU. No hardware coherence protocol

Valkey round trip is ~0.5-1ms over local cluster network - latency is higher than L1 cache but problem being solved is not sub-ms latency but rather cross-pod cache sharing.

Primary goal: cross-pod consistency

Without L2, every pod maintains an independent LRU. A cold miss on pod A fetches from origin, pod B's LRU is empty and fetches independently. Under load (e.g. a popular campaign going live), this produces an N-way thundering herd - one upstream fetch per pod, not one per cluster.

Valkey solves this because the consistent-hash ring maps a given cache key to the same shard regardless of which pod is making the request. A cold-miss coalesced by pod A writes to Valkey shard S. Pod B's next request for the same key hits shard S directly - no upstream fetch.

Why warming L1 on write undermines this goal

If L1 is warmed on a successful Valkey Set (write-through):

  1. A pod serves future requests from its local LRU, bypassing Valkey.
  2. Valkey Delete or TTL-based expiry does not reach L1 - the pod continues serving stale content until L1's own TTL expires.
  3. There is no invalidation channel. Adding one (e.g. Valkey pub/sub) would require every pod to subscribe, handle reconnects, and accept a bounded staleness window on subscriber lag.

The tradeoff is: faster reads on the hot path vs. stale content served after invalidation, with non-trivial invalidation infrastructure.

Why Valkey-miss does not consult L1

A Valkey miss (nil, no error) means the key is genuinely absent - no pod has fetched and stored it yet, or the TTL expired. Consulting L1 on a clean miss would serve stale content beyond the intended TTL. The filter must go to origin.

L1 is only consulted when Valkey returns an error, because in that case we have no authoritative answer. Serving a potentially-stale L1 entry is preferable to a 5xx.

Considered alternatives

Write-through with TTL-bounded staleness

Warm L1 on every successful Set, accept that L1 entries can be served up to their TTL after a Valkey Delete. For content that is never explicitly invalidated (only TTL-expired), this is semantically equivalent to the current design - and would reduce Valkey read load.

Not chosen because explicit Delete is on the roadmap (cache invalidation on content publish events). Once that lands, write-through without an invalidation channel produces observable stale responses.

Write-through + Valkey pub/sub invalidation

Warm L1, subscribe each pod to a Valkey pub/sub channel for invalidation events. This matches the CPU L1/L2 mental model most closely.

Not chosen for this PR. The complexity cost is high (subscribe lifecycle, reconnect handling, message delivery guarantees, lag-bounded staleness), and the latency benefit does not yet justify it. This is the natural next step if Valkey read latency becomes a bottleneck.

Signed-off-by: Larry D Almeida <hello@larrydalmeida.com>
@szuecs

szuecs commented Jun 2, 2026

Copy link
Copy Markdown
Member

Considered alternatives

Write-through with TTL-bounded staleness

Warm L1 on every successful Set, accept that L1 entries can be served up to their TTL after a Valkey Delete. For content that is never explicitly invalidated (only TTL-expired), this is semantically equivalent to the current design - and would reduce Valkey read load.

Not chosen because explicit Delete is on the roadmap (cache invalidation on content publish events). Once that lands, write-through without an invalidation channel produces observable stale responses.

We could have L1 entry TTL of a fixed acceptable amount of time.
We have for example token result caching in tokeninfo filters set to 60s and it offloaded 50% of total CPU time and latency dropped from 10-20ms to <1ms.
Given our experience I would be in favour of having a fixed TTL of 60s for L1 cache in case there is an L2 cache setup configured.

@larry-dalmeida

larry-dalmeida commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Given our experience I would be in favour of having a fixed TTL of 60s for L1 cache in case there is an L2 cache setup configured.

Good point. tokeninfo data is a strong precedent.

The write-around choice was conservative: L1 and Valkey TTLs are independent, and if a Valkey entry expires or gets evicted, an L1 entry with a longer TTL would silently serve stale content with no signal. The intent was to keep Valkey authoritative for the lifetime of every entry.

That said, your proposal sidesteps the problem cleanly.
A fixed short TTL (e.g. 60 s) on L1 should result in:

  • staleness is bounded and operator-visible - it's a deliberate configuration choice, not a silent race
  • the Delete path already calls l1.Delete unconditionally, so explicit invalidations would propagate correctly
  • no invalidation channel needed

Trade-off is: cache filter TTL must be meaningfully longer than the L1 TTL for the L1 layer to be useful.
For short-lived entries (< 60 s) the L1 TTL would need to be proportionally smaller, so it is likely to be configurable rather than hard-coded.

For now I will proceed with setting 60s as fixed TTL as a start.

@larry-dalmeida larry-dalmeida marked this pull request as draft June 2, 2026 14:21
@a4180p

a4180p commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

@larry-dalmeida

The intent was to keep Valkey authoritative for the lifetime of every entry.

After every successfull read from L1 you can call EXPIRE to valkey comand to:

  • prolongate TTL of the record if it exists
  • get a signal if it not exists and cleanup the record in L1

You can also consider using GETEX instead of GET if you wan TTLs to be updated on read ops.

@szuecs

szuecs commented Jun 2, 2026

Copy link
Copy Markdown
Member

@larry-dalmeida

The intent was to keep Valkey authoritative for the lifetime of every entry.

After every successfull read from L1 you can call EXPIRE to valkey comand to:

This you can't really do because l2 cache is shared by all skipper instances
In the rate limit cases we set Expiry time on create entry. I think it's the way to go for expiration by time.

@larry-dalmeida

Copy link
Copy Markdown
Contributor Author

@szuecs @a4180p @MustafaSaber thanks a lot for the feedback and patience 🧡 I will review it thoroughly and get back to you with concrete proposal. Currently I am busy with a business critical reliability related topic but I will resume this on Monday 8th June.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

architectural all changes in the hot path, big changes in the control plane, control flow changes in filters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants