feat: Valkey cluster support with room key ensure RPC#199
Conversation
📝 WalkthroughWalkthroughThis PR migrates the entire system from single-node Valkey to per-site independent Valkey cluster deployments. The change includes hash-tagged room key naming for cluster slot consistency, cluster-backed storage and connection adapters, a new room-key ensure NATS RPC handler, and coordinated configuration/docker-compose/integration test updates across all services. ChangesValkey Cluster Migration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Trivy (0.69.3)Trivy execution failed: 2026-05-19T11:23:50Z FATAL Fatal error run error: fs scan error: scan error: scan failed: failed analysis: post analysis error: post analysis error: kubernetes scan error: fs filter error: fs filter error: walk error open gitleaks-report-27.json: no such file or directory: open gitleaks-report-27.json: no such file or directory Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (3)
pkg/model/model_test.go (1)
822-855: ⚡ Quick winUse
roundTripfor these new model JSON round-trip tests.These tests duplicate helper logic already centralized in
roundTrip, which increases drift risk inpkg/model/model_test.go.♻️ Proposed refactor
func TestRoomKeyEnsureRequestJSON(t *testing.T) { src := model.RoomKeyEnsureRequest{RoomID: "room-abc"} - data, err := json.Marshal(src) - if err != nil { - t.Fatalf("marshal: %v", err) - } - var dst model.RoomKeyEnsureRequest - if err := json.Unmarshal(data, &dst); err != nil { - t.Fatalf("unmarshal: %v", err) - } - if !reflect.DeepEqual(src, dst) { - t.Errorf("round-trip mismatch:\n got %+v\n want %+v", dst, src) - } + roundTrip(t, &src, &model.RoomKeyEnsureRequest{}) } func TestRoomKeyEnsureResponseJSON(t *testing.T) { src := model.RoomKeyEnsureResponse{ RoomID: "room-xyz", Version: 3, PublicKey: []byte{0x04, 0xAB, 0xCD}, PrivateKey: []byte{0x7F, 0x01}, } - data, err := json.Marshal(src) - if err != nil { - t.Fatalf("marshal: %v", err) - } - var dst model.RoomKeyEnsureResponse - if err := json.Unmarshal(data, &dst); err != nil { - t.Fatalf("unmarshal: %v", err) - } - if !reflect.DeepEqual(src, dst) { - t.Errorf("round-trip mismatch:\n got %+v\n want %+v", dst, src) - } + roundTrip(t, &src, &model.RoomKeyEnsureResponse{}) }As per coding guidelines:
pkg/model/model_test.gomust verify model marshal/unmarshal via the genericroundTriphelper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/model/model_test.go` around lines 822 - 855, Replace the manual marshal/unmarshal checks in TestRoomKeyEnsureRequestJSON and TestRoomKeyEnsureResponseJSON with calls to the existing roundTrip test helper: for each test, construct the src value (RoomKeyEnsureRequest and RoomKeyEnsureResponse respectively) and pass it to roundTrip(t, src) so the centralized helper performs JSON marshal/unmarshal and equality checks; update or remove the duplicated marshal/unmarshal/assert logic in those test functions accordingly.pkg/roomkeystore/integration_test.go (2)
341-348: ⚡ Quick winDon't swallow
CLUSTER INFOprobe failures.This loop drops both the
Execexit status and theio.ReadAllerror, so a bad probe turns into a genericEventuallytimeout with no clue what actually failed. Treat either condition as probe failure and keep the last output/error for the assertion message.As per coding guidelines, "Never ignore errors silently — comment if intentionally discarded."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/roomkeystore/integration_test.go` around lines 341 - 348, The probe loop in the require.Eventually closure swallows container.Exec and io.ReadAll errors which yields an opaque timeout; modify the closure to record the last execution error and last read error/output (e.g., lastExecErr, lastReadErr, lastOut) declared outside the closure, return false on any execErr or read error, and after Eventually completes assert with those captured values so the failure message includes the real exec exit/error and the probe stdout (referencing require.Eventually, container.Exec, io.ReadAll, and strings.Contains in the change).
306-371: 🏗️ Heavy liftExercise
NewValkeyClusterStorein this harness too.
setupValkeyClusterwiresvalkeyStoretogether directly, so these integration tests never touch the new public constructor. That leaves the constructor-specific behavior added in this PR—client setup, ping validation, and closer wiring—outside the cluster test path. Please add one constructor-level integration case, or refactor the helper so tests and production go through the same entrypoint.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/roomkeystore/integration_test.go` around lines 306 - 371, The cluster test helper setupValkeyCluster currently constructs valkeyStore directly, bypassing NewValkeyClusterStore and leaving constructor-specific behavior untested; change the helper or add an integration test that calls NewValkeyClusterStore (or refactor setupValkeyCluster to call NewValkeyClusterStore internally) so the returned store and closer come from the public constructor, ensuring client setup, Ping validation and closer wiring exercised for cluster mode (update references to valkeyStore, clusterAdapter, and the returned closer/ping assertions accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker-local/compose.deps.yaml`:
- Around line 158-161: The entrypoint currently runs an unconditional
"valkey-cli CLUSTER ADDSLOTSRANGE 0 16383" which fails on restarts and causes
the container to exit; modify the startup sequence (the entrypoint/sh -c
command) to make slot assignment idempotent by first checking whether slots are
already assigned (e.g., call "valkey-cli CLUSTER SLOTS" or a similar inspection
command) and only run "valkey-cli CLUSTER ADDSLOTSRANGE 0 16383" if no slots are
present, or else allow failures to be ignored (e.g., conditional execution or
"|| true") so the container continues to become healthy; update the shell
pipeline surrounding the valkey-server launch, the until loop, and the CLUSTER
ADDSLOTSRANGE invocation to incorporate that conditional check.
In `@docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md`:
- Around line 29-40: The markdown has unlabeled fenced code blocks (e.g., the
blocks showing "site: ftest ...", "room:abc123:key", and the "external
connector" example) which trigger MD040; fix by adding a language identifier
(use "text") to each opening triple-backtick fence for those blocks (and the
other occurrences noted around lines 52-55 and 314-325) so they become ```text
instead of ``` ensuring all fenced code blocks are labeled.
- Around line 341-345: The example struct RoomKeyEnsureRequest is missing bson
struct tags; update the RoomKeyEnsureRequest model to include both json and bson
tags for each field (e.g., RoomID should have `json:"roomId"` and
`bson:"roomId"`), ensuring the spec example matches repository coding guidelines
and prevent implementers from copying a non-compliant shape.
In `@pkg/roomkeystore/adapter.go`:
- Around line 199-201: When the cluster client `c` fails its connectivity check
(`c.Ping(ctx).Err()`), close the created client before returning the error to
avoid resource leaks; modify the error path in the same block so that you call
the client's close method (e.g., `c.Close()` or the appropriate close/shutdown
method on the cluster client) and handle/ignore its error, then return the
formatted error message as before.
- Around line 175-176: Replace fragile string-matching of the Lua error in both
redisAdapter.rotatePipeline and clusterAdapter.rotatePipeline with a
deterministic sentinel error and check using errors.Is; define a package-level
sentinel (e.g., var ErrNoCurrentKey = errors.New("roomkeystore: no current
key")) or wrap the Redis/Lua error into that sentinel when receiving the redis
reply, then change the conditional from strings.Contains(err.Error(), "no
current key") to errors.Is(err, ErrNoCurrentKey) in both rotatePipeline
implementations so callers can rely on typed error comparison.
In `@room-service/handler.go`:
- Around line 1195-1215: The current check-then-set using h.keyStore.Get,
roomkeystore.GenerateKeyPair, and h.keyStore.Set is racy: two requests can both
see no key, generate different pairs, and both call Set. Make the operation
atomic by moving key-creation into an atomic create-if-absent on the keystore
(e.g. implement and call a CreateIfAbsent / GetOrCreate / SetIfAbsent method on
h.keyStore that either returns the existing entry or stores and returns the
newly generated pair), and ensure you only call roomkeystore.GenerateKeyPair
inside the create callback so you generate a key only when the keystore actually
performs the insert; if the keystore API cannot be changed, implement a retry:
attempt Set with a non-overwrite flag and if it fails due to existing entry,
return the existing entry from Get.
In `@search-service/main.go`:
- Around line 37-38: The Password field in the config struct (symbol Password;
nearby Addrs) currently defaults to an empty string which weakens startup
guarantees; change its env tag to mark the secret as required and use the
VALKEY_PASSWORD variable name (e.g. env:"VALKEY_PASSWORD,required") instead of
envDefault:"", so the application fails fast when the secret is missing and you
no longer allow an empty password at startup.
---
Nitpick comments:
In `@pkg/model/model_test.go`:
- Around line 822-855: Replace the manual marshal/unmarshal checks in
TestRoomKeyEnsureRequestJSON and TestRoomKeyEnsureResponseJSON with calls to the
existing roundTrip test helper: for each test, construct the src value
(RoomKeyEnsureRequest and RoomKeyEnsureResponse respectively) and pass it to
roundTrip(t, src) so the centralized helper performs JSON marshal/unmarshal and
equality checks; update or remove the duplicated marshal/unmarshal/assert logic
in those test functions accordingly.
In `@pkg/roomkeystore/integration_test.go`:
- Around line 341-348: The probe loop in the require.Eventually closure swallows
container.Exec and io.ReadAll errors which yields an opaque timeout; modify the
closure to record the last execution error and last read error/output (e.g.,
lastExecErr, lastReadErr, lastOut) declared outside the closure, return false on
any execErr or read error, and after Eventually completes assert with those
captured values so the failure message includes the real exec exit/error and the
probe stdout (referencing require.Eventually, container.Exec, io.ReadAll, and
strings.Contains in the change).
- Around line 306-371: The cluster test helper setupValkeyCluster currently
constructs valkeyStore directly, bypassing NewValkeyClusterStore and leaving
constructor-specific behavior untested; change the helper or add an integration
test that calls NewValkeyClusterStore (or refactor setupValkeyCluster to call
NewValkeyClusterStore internally) so the returned store and closer come from the
public constructor, ensuring client setup, Ping validation and closer wiring
exercised for cluster mode (update references to valkeyStore, clusterAdapter,
and the returned closer/ping assertions accordingly).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0eb6287e-8a76-4914-8886-4c39720c9539
📒 Files selected for processing (27)
broadcast-worker/deploy/docker-compose.ymlbroadcast-worker/main.godocker-local/compose.deps.yamldocs/superpowers/plans/2026-05-19-valkey-cluster-support.mddocs/superpowers/specs/2026-05-19-valkey-cluster-support-design.mdhistory-service/cmd/main.gohistory-service/deploy/docker-compose.ymlhistory-service/internal/config/config.gopkg/model/event.gopkg/model/model_test.gopkg/roomkeystore/adapter.gopkg/roomkeystore/integration_test.gopkg/roomkeystore/roomkeystore.gopkg/subject/subject.gopkg/testutil/testimages/testimages.gopkg/valkeyutil/integration_test.gopkg/valkeyutil/valkey.gopkg/valkeyutil/valkey_test.goroom-service/deploy/docker-compose.ymlroom-service/handler.goroom-service/handler_test.goroom-service/main.goroom-worker/deploy/docker-compose.ymlroom-worker/main.goroom-worker/mock_publisher_test.gosearch-service/deploy/docker-compose.ymlsearch-service/main.go
| if err != nil && strings.Contains(err.Error(), "no current key") { | ||
| return 0, ErrNoCurrentKey |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify all string-based matching points tied to "no current key" so both adapters can be updated consistently.
rg -n -C2 'redis\.error_reply\(.+no current key|strings\.Contains\(err\.Error\(\),\s*"no current key"\)' pkg/roomkeystore/adapter.goRepository: hmchangw/chat
Length of output: 635
🏁 Script executed:
# Read the relevant sections of the adapter to understand the error context
head -180 pkg/roomkeystore/adapter.go | tail -70Repository: hmchangw/chat
Length of output: 2521
🏁 Script executed:
# Also check if there are error constant definitions in this file
rg -n 'ErrNoCurrentKey|error_reply' pkg/roomkeystore/adapter.go | head -20Repository: hmchangw/chat
Length of output: 170
🏁 Script executed:
# Check the rotateScript definition to understand the error wrapping
rg -n 'rotateScript|redis.Script' pkg/roomkeystore/adapter.go -A 5Repository: hmchangw/chat
Length of output: 1035
🏁 Script executed:
# Find where ErrNoCurrentKey is defined
rg -n 'ErrNoCurrentKey' pkg/roomkeystore/ -B 2 -A 2Repository: hmchangw/chat
Length of output: 3329
🏁 Script executed:
# Check the imports and error handling patterns in the file
head -30 pkg/roomkeystore/adapter.goRepository: hmchangw/chat
Length of output: 856
🏁 Script executed:
# Look for any error.Is checks or error type checking in the codebase
rg -n 'errors\.Is|errors\.As' pkg/roomkeystore/ | head -10Repository: hmchangw/chat
Length of output: 313
Use sentinel error or custom error type instead of string matching in both adapter rotate methods.
Both redisAdapter.rotatePipeline() (line 63) and clusterAdapter.rotatePipeline() (line 175) match the Redis error by message text:
if err != nil && strings.Contains(err.Error(), "no current key") {
return 0, ErrNoCurrentKey
}
This violates the coding guideline: "Never compare errors by string — use errors.Is() and errors.As() for error type checking." The Lua script's redis.error_reply('no current key') gets wrapped as an untyped error, making the string match brittle—any change to the script message breaks sentinel mapping.
Refactor to define a deterministic contract (e.g., return a specific error type or wrapper from Lua, or create a custom sentinel within the adapter package) that can be checked without string parsing. Update both adapters consistently.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/roomkeystore/adapter.go` around lines 175 - 176, Replace fragile
string-matching of the Lua error in both redisAdapter.rotatePipeline and
clusterAdapter.rotatePipeline with a deterministic sentinel error and check
using errors.Is; define a package-level sentinel (e.g., var ErrNoCurrentKey =
errors.New("roomkeystore: no current key")) or wrap the Redis/Lua error into
that sentinel when receiving the redis reply, then change the conditional from
strings.Contains(err.Error(), "no current key") to errors.Is(err,
ErrNoCurrentKey) in both rotatePipeline implementations so callers can rely on
typed error comparison.
| Addrs []string `env:"ADDRS,required" envSeparator:","` | ||
| Password string `env:"PASSWORD" envDefault:""` |
There was a problem hiding this comment.
Require VALKEY_PASSWORD instead of defaulting it to empty.
Line 38 defaults a secret to "", which weakens startup config guarantees in main.go.
Suggested fix
type ValkeyConfig struct {
Addrs []string `env:"ADDRS,required" envSeparator:","`
- Password string `env:"PASSWORD" envDefault:""`
+ Password string `env:"PASSWORD,required"`
}As per coding guidelines: "never default secrets or connection strings — mark them required."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Addrs []string `env:"ADDRS,required" envSeparator:","` | |
| Password string `env:"PASSWORD" envDefault:""` | |
| Addrs []string `env:"ADDRS,required" envSeparator:","` | |
| Password string `env:"PASSWORD,required"` |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@search-service/main.go` around lines 37 - 38, The Password field in the
config struct (symbol Password; nearby Addrs) currently defaults to an empty
string which weakens startup guarantees; change its env tag to mark the secret
as required and use the VALKEY_PASSWORD variable name (e.g.
env:"VALKEY_PASSWORD,required") instead of envDefault:"", so the application
fails fast when the secret is missing and you no longer allow an empty password
at startup.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
room-service/handler.go (1)
1196-1218:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMake the room-key ensure path atomic.
This is still a racy
Get→ generate →Setsequence. Two concurrent requests can both see "missing", generate different pairs, and both write, so the second request can overwrite the first key immediately. Please move this to a store-level create-if-absent/ensure primitive and generate the key only inside that atomic path.Suggested direction
- existing, err := h.keyStore.Get(ctx, req.RoomID) - if err != nil { - return nil, fmt.Errorf("ensure room key: get: %w", err) - } - if existing != nil { - return json.Marshal(model.RoomKeyEnsureResponse{ - RoomID: req.RoomID, - Version: existing.Version, - }) - } - - newPair, err := roomkeystore.GenerateKeyPair() - if err != nil { - return nil, fmt.Errorf("ensure room key: generate key pair: %w", err) - } - ver, err := h.keyStore.Set(ctx, req.RoomID, newPair) - if err != nil { - return nil, fmt.Errorf("ensure room key: set: %w", err) - } + ensured, err := h.keyStore.Ensure(ctx, req.RoomID) + if err != nil { + return nil, fmt.Errorf("ensure room key: ensure: %w", err) + } return json.Marshal(model.RoomKeyEnsureResponse{ RoomID: req.RoomID, - Version: ver, + Version: ensured.Version, })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-service/handler.go` around lines 1196 - 1218, The current Get → GenerateKeyPair → Set flow is racy; change the keystore API to provide an atomic ensure/create-if-absent primitive (e.g., KeyStore.Ensure/CreateIfAbsent/EnsureKey) that accepts the room ID and a creator callback so the store will call the callback only when it needs to create and will return the existing version if present. Replace the h.keyStore.Get + roomkeystore.GenerateKeyPair + h.keyStore.Set sequence with a single call to that new method (pass a closure that calls roomkeystore.GenerateKeyPair) and return model.RoomKeyEnsureResponse using the version returned by the atomic ensure call; update implementations of KeyStore to perform the create-if-missing atomically.pkg/roomkeystore/adapter.go (1)
63-65: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftReplace the Lua error text match with a stable sentinel path.
rotatePipelinestill depends onstrings.Contains(err.Error(), "no current key")to translate the script failure intoErrNoCurrentKey. That makes caller behavior depend on go-redis's formatted error text instead of a deterministic contract. Please switch this to a sentinel/custom error flow that can be checked witherrors.Is.As per coding guidelines, "Never compare errors by string — use
errors.Is()anderrors.As()for error type checking."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/roomkeystore/adapter.go` around lines 63 - 65, The code currently matches the Lua error text via strings.Contains on rotateScript.Run's error; change the Lua script to return a stable sentinel token (e.g., return redis.error_reply("NO_CURRENT_KEY")) and update the Go side (where rotateScript.Run is called in the rotatePipeline/adapter code) to detect that exact token and wrap it into the package sentinel ErrNoCurrentKey (e.g., if err != nil && strings.Contains(err.Error(), "NO_CURRENT_KEY") { err = fmt.Errorf("%w: redis script returned NO_CURRENT_KEY", ErrNoCurrentKey); } return 0, ErrNoCurrentKey) so callers can use errors.Is(err, ErrNoCurrentKey); reference rotateScript.Run and ErrNoCurrentKey when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@pkg/roomkeystore/adapter.go`:
- Around line 63-65: The code currently matches the Lua error text via
strings.Contains on rotateScript.Run's error; change the Lua script to return a
stable sentinel token (e.g., return redis.error_reply("NO_CURRENT_KEY")) and
update the Go side (where rotateScript.Run is called in the
rotatePipeline/adapter code) to detect that exact token and wrap it into the
package sentinel ErrNoCurrentKey (e.g., if err != nil &&
strings.Contains(err.Error(), "NO_CURRENT_KEY") { err = fmt.Errorf("%w: redis
script returned NO_CURRENT_KEY", ErrNoCurrentKey); } return 0, ErrNoCurrentKey)
so callers can use errors.Is(err, ErrNoCurrentKey); reference rotateScript.Run
and ErrNoCurrentKey when making these changes.
In `@room-service/handler.go`:
- Around line 1196-1218: The current Get → GenerateKeyPair → Set flow is racy;
change the keystore API to provide an atomic ensure/create-if-absent primitive
(e.g., KeyStore.Ensure/CreateIfAbsent/EnsureKey) that accepts the room ID and a
creator callback so the store will call the callback only when it needs to
create and will return the existing version if present. Replace the
h.keyStore.Get + roomkeystore.GenerateKeyPair + h.keyStore.Set sequence with a
single call to that new method (pass a closure that calls
roomkeystore.GenerateKeyPair) and return model.RoomKeyEnsureResponse using the
version returned by the atomic ensure call; update implementations of KeyStore
to perform the create-if-missing atomically.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee6f4b1d-e38d-4abe-ac01-fefefaa9baae
📒 Files selected for processing (8)
history-service/internal/config/config.gopkg/model/event.gopkg/model/model_test.gopkg/roomkeystore/adapter.gopkg/subject/subject.gopkg/valkeyutil/valkey.goroom-service/handler.goroom-service/handler_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/subject/subject.go
- pkg/model/model_test.go
- pkg/model/event.go
- room-service/handler_test.go
- history-service/internal/config/config.go
- pkg/valkeyutil/valkey.go
Covers hash-tagged key names, ClusterConfig/NewValkeyClusterStore, valkeyutil.ConnectCluster, per-service config migration from VALKEY_ADDR to VALKEY_ADDRS, and per-site docker-compose changes. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- Cluster mode fully replaces standalone; VALKEY_ADDR is retired - Remove draft "Wait —" note from valkeyutil section; replace with clean clusterRedisClient design - Backward compat section rewritten to state VALKEY_ADDRS is the only connection path going forward https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Documents the new NatsHandleEnsureRoomKey handler in room-service that external connectors can call to get or generate a room key without touching Valkey directly. Covers subject, model type, idempotency contract, PublicKey inclusion rationale, no-fan-out design decision, and TDD test scenarios. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
11 tasks covering: image constant, hash-tag key names, cluster adapter, cluster integration tests, valkeyutil ConnectCluster, service config migration (VALKEY_ADDR → VALKEY_ADDRS), docker-compose updates, and the room key ensure RPC (model + subject + TDD red/green). https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Pins bitnami/valkey-cluster:8 for cluster-mode integration tests. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…ency
room:{roomID}:key and room:{roomID}:key:prev now share the same hash tag
so both keys always land on the same cluster slot. Required for the Lua
rotate script and DEL pipeline to work without CROSSSLOT errors.
https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…erStore Parallel cluster path alongside standalone. clusterAdapter wraps *redis.ClusterClient with the same hashCommander interface — valkeyStore and all its methods are unchanged. rotateScript works unchanged because hash-tagged keys guarantee same-slot execution. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Three new integration tests exercise the clusterAdapter code path: - RoundTrip: Set → Get → Delete - RotateRoundTrip: Set → Rotate → GetByVersion (current + prev slots) - HashTagSlotConsistency: CLUSTER KEYSLOT asserts both key names share a slot; rotate Lua script confirms no CROSSSLOT error at runtime Helper uses valkey/valkey:8 in --cluster-enabled mode (single node with all 16384 slots via ADDSLOTSRANGE) plus a ClusterSlots override so go-redis resolves the externally-mapped address rather than the internal 127.0.0.1:6379 the node announces. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Adds clusterRedisClient (wrapping *redis.ClusterClient) and ConnectCluster constructor, mirroring the existing Connect/redisClient pair but targeting a cluster seed-address list instead of a single addr. Both satisfy the same Client interface, so callers switch by swapping the constructor. Unit test covers the error-wrapping path (bad address → "valkey cluster connect: …"). Integration tests exercise clusterRedisClient.Get/Set/Del against a single-node cluster-mode Valkey via the ClusterSlots override (avoids the testcontainers port-translation problem with cluster topology discovery that prevents calling ConnectCluster directly in tests). https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…mode) Replace the single-address NewValkeyStore with NewValkeyClusterStore across all five services that connect to Valkey: - room-service: ValkeyAddr → ValkeyAddrs, NewValkeyStore → NewValkeyClusterStore - room-worker: same - broadcast-worker: same, plus update the empty-address guard message - history-service/config: ValkeyConfig.Addr → Addrs (envSeparator:",") - history-service/main: same guard + NewValkeyClusterStore - search-service: ValkeyConfig.Addr → Addrs, Connect → ConnectCluster All services now accept VALKEY_ADDRS=<seed1>,<seed2>,... instead of the single VALKEY_ADDR, enabling Valkey cluster-mode deployments per site. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
All service docker-compose.yml files: - VALKEY_ADDR → VALKEY_ADDRS (matches migrated service configs) docker-local/compose.deps.yaml and history-service/deploy/docker-compose.yml: - Replace single-node valkey/valkey:8-alpine with a single-node cluster-mode instance: entrypoint starts valkey-server --cluster-enabled yes, waits for PING, then runs CLUSTER ADDSLOTSRANGE 0 16383 to form a valid single-master cluster. Healthcheck verifies cluster_state:ok before dependents start. This ensures local dev and CI compose stacks match the cluster-mode client that all services now use (NewValkeyClusterStore / valkeyutil.ConnectCluster). https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
pkg/model:
- RoomKeyEnsureRequest{RoomID} — payload for the room key ensure RPC
- RoomKeyEnsureResponse{RoomID, Version, PublicKey, PrivateKey} — reply
(both keys returned; callers are trusted server-side components)
pkg/subject:
- RoomKeyEnsure(siteID) → "chat.server.request.room.{siteID}.key.ensure"
JSON round-trip tests added for both request and response types.
https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
New server-to-server RPC on chat.server.request.room.{siteID}.key.ensure:
- If a key already exists for the room, returns it immediately
- If no key exists, generates a fresh P-256 key pair, stores it via Set
(version 0), and returns it
- Returns RoomKeyEnsureResponse{roomId, version, publicKey, privateKey}
— both key bytes are returned because callers are trusted server-side
components (connectors, not end-clients)
Registered in RegisterCRUD under the "room-service" queue group.
Tests cover: key exists, key not found (set path), malformed request,
missing roomId, Get error, Set error, nil key store.
https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…ailure leaks - Replace redisAdapter+clusterAdapter with a single universalAdapter in pkg/roomkeystore — both *redis.Client and *redis.ClusterClient implement redis.UniversalClient, eliminating ~70 lines of duplicated method bodies - Replace redisClient+clusterRedisClient with universalClient in pkg/valkeyutil for the same reason - NewValkeyStore now closes the client on ping failure (matches the existing behaviour in Connect/ConnectCluster and prevents pool leaks) - Remove dead env struct tags from ClusterConfig (fields are populated directly by callers, never via caarlos0/env) - Drop redundant comment on roomprevkey (function name is self-explanatory) https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Replace the dual single-node/cluster code paths with a single *redis.ClusterClient path everywhere. Removes universalClient, redisAdapter, NewValkeyStore, and the old Connect function; adds a shared StartValkeyCluster testutil helper so container setup is centralised in one place rather than duplicated across six packages. - pkg/testutil/valkey.go (new): StartValkeyCluster starts a single-node cluster-mode Valkey container with ClusterSlots override for testcontainers address mapping - pkg/valkeyutil: remove Connect/redisClient/universalClient; keep only clusterClient; add WrapClusterClient for tests - pkg/roomkeystore: remove Config, NewValkeyStore, redisAdapter, universalAdapter; keep only clusterAdapter/NewValkeyClusterStore; add NewValkeyClusterStoreFromClient for tests; fix client leak on ping failure - Integration tests (roomkeystore, roomsubcache, valkeyutil, room-service, room-worker, search-service): replace per-package container boilerplate with testutil.StartValkeyCluster https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
… reply RoomKeyEnsureResponse was structurally identical to RoomKeyEvent minus Timestamp. Reuse RoomKeyEvent directly (with Timestamp set) and delete the near-duplicate type. Also drop the unused subject parameter from handleEnsureRoomKey and trim the over-explained doc comment on NatsHandleEnsureRoomKey. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…prove log field - subject.go: update RoomKeyEnsure comment to reference RoomKeyEvent (not the deleted RoomKeyEnsureResponse type) - testimages.go: remove ValkeyCluster constant — StartValkeyCluster uses the plain Valkey image with manual slot assignment, not bitnami/valkey-cluster - broadcast-worker/main.go: replace misleading valkey_addrs_set boolean field with valkey_addrs slice so the actual configured value is visible in the log https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- history-service/internal/config/config.go: remove "Addrs is validated only when encryption is enabled (see main.go)" — cross-file reference in a struct comment is fragile and the information belongs at the validation site - room-service/handler_test.go: remove // --- TestHandler_EnsureRoomKey --- section divider; test names already provide sufficient navigation https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- adapter.go: drop trailing sentence from NewValkeyClusterStore doc that restated what ClusterConfig's field comment already explains - valkey.go: drop clusterClient type comment that restated the type name; the interface-level Client doc is the right place for consumer guidance https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
The connector only needs to ensure a room has an encryption key pair stored in
Valkey — it does not consume the key material itself. broadcast-worker reads the
public key from Valkey to encrypt outgoing messages, and clients receive the
private key via the room-worker fan-out path. Returning both keys to a caller
that doesn't need them violates least-privilege.
Replace the RoomKeyEvent response (which carries PublicKey + PrivateKey) with a
new RoomKeyEnsureResponse { roomId, version } that confirms the key exists in
Valkey without exposing key bytes. Behaviour is otherwise unchanged: existing
keys are returned as-is, missing keys are generated and stored.
https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Both TestRoomKeyEnsureRequestJSON and TestRoomKeyEnsureResponseJSON were hand-rolling the marshal/unmarshal/DeepEqual cycle. The generic roundTrip helper at the bottom of model_test.go is the established convention used 20+ times in this file for the same purpose. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
loadgen was left using the removed NewValkeyStore/Config single-node API; update to NewValkeyClusterStore/ClusterConfig to match all other services. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
470b3a1 to
c1abab9
Compare
- history-service/cmd/main.go: fix goimports formatting (blank line) - tools/loadgen/main.go: fix goimports formatting (struct field alignment) - pkg/valkeyutil/valkey.go: restore WHY comment on ping-failure close path (closes half-constructed ClusterClient to prevent go-redis pool leaks) https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- pkg/roomkeystore/adapter.go: replace strings.Contains error match with isLuaNoCurrentKeyErr helper using exact match; add WHY comment explaining go-redis surfaces Lua error_reply as an untyped string error; drop unused "strings" import - docker-local/compose.deps.yaml: make CLUSTER ADDSLOTSRANGE idempotent — guard with cluster_slots_assigned:16384 check so container restarts don't exit due to already-assigned slots - docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md: add "text" language identifier to three unlabeled fenced code blocks (MD040); add bson tag to RoomKeyEnsureRequest spec example to match coding guidelines https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Combined `len(cfg.ValkeyAddrs) == 0 || cfg.ValkeyKeyGracePeriod <= 0` with a single ambiguous error message. Splitting into two checks with specific messages matches room-worker's pattern and makes misconfigured startup failures immediately actionable. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md`:
- Around line 309-313: Update the documented RPC response and examples to use
the new RoomKeyEnsureResponse contract instead of model.RoomKeyEvent (i.e.,
remove raw key bytes from replies); replace occurrences of `model.RoomKeyEvent{
RoomID, Version, PublicKey, PrivateKey, Timestamp }` with
`RoomKeyEnsureResponse` and ensure the sequence example that shows the RPC reply
(previously emitting PrivateKey/PublicKey) now only includes confirmation fields
present on RoomKeyEnsureResponse; also adjust any text referencing
`keyStore.Set`/idempotency to state that keys are stored but not returned in the
RPC response.
In `@tools/loadgen/main.go`:
- Around line 43-44: The VALKEY_ADDRS env tag is marked required on the config
struct (fields ValkeyAddrs / ValkeyPassword) which forces config parsing to fail
before subcommand dispatch; remove the `required` constraint from the
ValkeyAddrs (and related ValkeyPassword) struct tags so parsing succeeds for
commands that don't need the keystore, and add/ensure a runtime check inside
connectKeyStore (the connectKeyStore function used by the seed and teardown
subcommands) to validate that ValkeyAddrs is present and return a clear error if
missing; also apply the same tag removal to the duplicate fields referenced
around the later block (the other Valkey* fields noted in the comment) so only
connectKeyStore enforces presence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ed7e147e-6175-4449-a207-e9b8fff447d3
📒 Files selected for processing (29)
broadcast-worker/deploy/docker-compose.ymlbroadcast-worker/main.godocker-local/compose.deps.yamldocs/superpowers/plans/2026-05-19-valkey-cluster-support.mddocs/superpowers/specs/2026-05-19-valkey-cluster-support-design.mdpkg/model/event.gopkg/model/model_test.gopkg/roomkeystore/adapter.gopkg/roomkeystore/integration_test.gopkg/roomkeystore/roomkeystore.gopkg/roomsubcache/integration_test.gopkg/subject/subject.gopkg/testutil/valkey.gopkg/valkeyutil/integration_test.gopkg/valkeyutil/valkey.gopkg/valkeyutil/valkey_test.goroom-service/deploy/docker-compose.ymlroom-service/handler.goroom-service/handler_test.goroom-service/integration_test.goroom-service/main.goroom-worker/deploy/docker-compose.ymlroom-worker/integration_test.goroom-worker/main.goroom-worker/mock_publisher_test.gosearch-service/deploy/docker-compose.ymlsearch-service/integration_test.gosearch-service/main.gotools/loadgen/main.go
✅ Files skipped from review due to trivial changes (2)
- room-worker/mock_publisher_test.go
- search-service/deploy/docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (22)
- room-service/deploy/docker-compose.yml
- pkg/subject/subject.go
- pkg/valkeyutil/integration_test.go
- pkg/roomsubcache/integration_test.go
- pkg/valkeyutil/valkey_test.go
- pkg/model/event.go
- pkg/roomkeystore/roomkeystore.go
- broadcast-worker/deploy/docker-compose.yml
- pkg/testutil/valkey.go
- pkg/roomkeystore/adapter.go
- docker-local/compose.deps.yaml
- room-worker/deploy/docker-compose.yml
- room-worker/integration_test.go
- room-worker/main.go
- search-service/integration_test.go
- search-service/main.go
- broadcast-worker/main.go
- room-service/handler_test.go
- pkg/valkeyutil/valkey.go
- room-service/handler.go
- docs/superpowers/plans/2026-05-19-valkey-cluster-support.md
- room-service/integration_test.go
| - **Reply payload (success):** `model.RoomKeyEvent{ RoomID, Version, PublicKey, PrivateKey, Timestamp }` | ||
| - **Reply payload (error):** `model.ErrorResponse` via `natsutil.ReplyError` | ||
|
|
||
| **Idempotent by design:** if a key already exists in Valkey for the room, it is returned immediately without generating a new one. If no key exists (backfill case), a new key pair is generated, stored in Valkey via `keyStore.Set`, and then returned. | ||
|
|
There was a problem hiding this comment.
Room key ensure RPC response contract is outdated in this spec.
Line 309 and Line 348 still specify a model.RoomKeyEvent reply (including key material), but this PR’s current contract is RoomKeyEnsureResponse with confirmation fields only. Please update this section and the sequence example (Line 324) to avoid documenting raw key bytes in the response.
Also applies to: 333-334, 348-349
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md` around
lines 309 - 313, Update the documented RPC response and examples to use the new
RoomKeyEnsureResponse contract instead of model.RoomKeyEvent (i.e., remove raw
key bytes from replies); replace occurrences of `model.RoomKeyEvent{ RoomID,
Version, PublicKey, PrivateKey, Timestamp }` with `RoomKeyEnsureResponse` and
ensure the sequence example that shows the RPC reply (previously emitting
PrivateKey/PublicKey) now only includes confirmation fields present on
RoomKeyEnsureResponse; also adjust any text referencing
`keyStore.Set`/idempotency to state that keys are stored but not returned in the
RPC response.
| ValkeyAddrs []string `env:"VALKEY_ADDRS,required" envSeparator:","` | ||
| ValkeyPassword string `env:"VALKEY_PASSWORD" envDefault:""` |
There was a problem hiding this comment.
Don't require VALKEY_ADDRS for loadgen run.
Because config parsing happens before subcommand dispatch, this makes loadgen run fail at startup when VALKEY_ADDRS is unset, even though only seed and teardown call connectKeyStore.
Suggested fix
- ValkeyAddrs []string `env:"VALKEY_ADDRS,required" envSeparator:","`
+ ValkeyAddrs []string `env:"VALKEY_ADDRS" envSeparator:","`
ValkeyPassword string `env:"VALKEY_PASSWORD" envDefault:""`
@@
func connectKeyStore(cfg *config) (roomkeystore.RoomKeyStore, error) {
+ if len(cfg.ValkeyAddrs) == 0 {
+ return nil, fmt.Errorf("VALKEY_ADDRS is required for seed and teardown")
+ }
return roomkeystore.NewValkeyClusterStore(roomkeystore.ClusterConfig{
Addrs: cfg.ValkeyAddrs,
Password: cfg.ValkeyPassword,
GracePeriod: time.Hour,
})
}Also applies to: 182-186
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loadgen/main.go` around lines 43 - 44, The VALKEY_ADDRS env tag is
marked required on the config struct (fields ValkeyAddrs / ValkeyPassword) which
forces config parsing to fail before subcommand dispatch; remove the `required`
constraint from the ValkeyAddrs (and related ValkeyPassword) struct tags so
parsing succeeds for commands that don't need the keystore, and add/ensure a
runtime check inside connectKeyStore (the connectKeyStore function used by the
seed and teardown subcommands) to validate that ValkeyAddrs is present and
return a clear error if missing; also apply the same tag removal to the
duplicate fields referenced around the later block (the other Valkey* fields
noted in the comment) so only connectKeyStore enforces presence.
PR #199 (Valkey cluster support) renamed the loadgen config field from ValkeyAddr (single string) to ValkeyAddrs ([]string) and updated main.go but missed three usages in main_test.go, breaking both the `test` and `lint` CI jobs on every PR rebased onto current main. Mirror the rename in the test cfg literals so the loadgen test binary builds again.
Main introduced Valkey cluster-mode (PR #199 - 'feat: Valkey cluster support with room key ensure RPC'). Reconciliation: - Adopt main's per-test testutil.StartValkeyCluster instead of my shared Valkey helpers. valkey is now per-test (each test gets its own cluster-mode container), not process-shared. - Drop my testutil.Valkey / EnsureValkey / TerminateValkey / FlushValkey — replaced by StartValkeyCluster. - Remove TerminateValkey from TerminateAll (nothing process-shared to clean up). - Update search-service per-endpoint files (integration_ccs_test.go, integration_rooms_test.go) to use valkeyutil.WrapClusterClient(testutil.StartValkeyCluster(t)). - Drop search-service's local valkeyClient/flushValkey helpers. - Keep main's roomkeystore/roomsubcache integration tests but switch roomsubcache back to internal package (consistency). - Add main_test.go for pkg/valkeyutil (newly has integration tests on main). - Re-migrate room-service setupNATS to testutil.NATS (main reverted this when it touched the file). - Update tools/loadgen unit tests for the new ValkeyAddrs []string config field. - Update CLAUDE.md: Valkey is now per-test, not shared.
main_test.go was missed when PR #199 renamed the config field from ValkeyAddr string to ValkeyAddrs []string, causing a typecheck lint failure on main. https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp Co-authored-by: Claude <noreply@anthropic.com>
End-to-end runnability reviewer found a ship-blocking regression: `tools/loadgen/deploy/docker-compose.yml:22` still set `VALKEY_ADDR` while loadgen now reads `VALKEY_ADDRS` (PR #199 cluster rename). First `make seed` on a fresh clone would exit 2 with "VALKEY_ADDRS is not set". Fixed the env var name + added a comment explaining the rename. Other reviewer-flagged items in the same pass: - CHANGES.md was internally inconsistent — the Valkey section's narrative still said "VALKEY_ADDR" in two places after PR #199 renamed it. Rewritten to say VALKEY_ADDRS with a parenthetical about the rename. - `scenarios.go` advertised `large-room-broadcast` and `message-mutate` without SKELETON tags despite both having `SKELETON` markers in code. Added "(skeleton — ...)" to the one-line descriptions so `loadgen scenarios` honestly flags them. - USAGE.md now has explicit "Status: SKELETON" disclosures for those two scenarios with a one-line note about what's stubbed, mirroring the existing disclosures for auth-load/first-dm/notification-fanout. - Stale "four-stage" wording in `scenario_firstdm.go:82,622` and `scenario_firstdm_test.go:106,394,490` (residue from the persist-stage removal in f599cba). Now consistently says "three stages" / `len == 3`. - Bare `return err` in the `subscribersAdapter.SubscribeData` shim at `scenario_firstdm.go:619` violated CLAUDE.md §3 (no bare error returns). Wrapped with `fmt.Errorf("subscribe %s: %w", subj, err)`. Verified all gates remain green: make lint 0 issues, go test -race -count=1 green (22.2s), go vet -tags integration clean, gosec 0 issues, compose YAML parses.
… rename PR #199 (Valkey cluster support, merged to main) renamed two roomkeystore APIs that the suite-v2 runner depends on: roomkeystore.NewValkeyStore(Config{Addr, ...}) → roomkeystore.NewValkeyClusterStore(ClusterConfig{Addrs, ...}) And RoomKeyStore.Set's pair parameter is now by value, not pointer (documented at pkg/roomkeystore/roomkeystore.go:84). These breakages were latent on the branch from the moment we rebased onto main; they only surfaced now because Task 1 of the Part-2 mishap plan tries to commit, triggering the pre-commit make lint hook for the first time. Fix is mechanical, scope-minimal: - runner.go: NewValkeyStore → NewValkeyClusterStore; Config → ClusterConfig; Addr (string) → Addrs ([]string{cfg.ValkeyAddr}). Single-seed-node pattern is fine for docker-local; multi-node deployments will need a comma-separated env var, which is out of scope here. - seed/loader.go: *pair deref at the Set call site. make lint passes (0 issues). No behavior change. https://claude.ai/code/session_0139upFqMPspygX8XqTjpRN1
Summary
pkg/roomkeystore(room:{roomID}:key) for Valkey cluster slot consistencyclusterAdapter+NewValkeyClusterStoreinpkg/roomkeystorewrapping*redis.ClusterClientConnectClusterinpkg/valkeyutilfor cluster-mode client creationVALKEY_ADDRtoVALKEY_ADDRS(comma-separated)NatsHandleEnsureRoomKeyRPC (chat.server.request.room.{siteID}.key.ensure) inroom-serviceRoomKeyEnsureRequest/RoomKeyEnsureResponsetypes andsubject.RoomKeyEnsurebuilderTest plan
make lint— 0 issuesmake test— all unit tests passpkg/roomkeystorecluster integration tests — all 3 passpkg/valkeyutilcluster integration tests — all 3 passroom-servicehandler unit tests — 7 tests for EnsureRoomKey all passGenerated by Claude Code
Summary by CodeRabbit
Release Notes
New Features
Chores
VALKEY_ADDRS(comma-separated list) instead of single address.