Skip to content

feat: Valkey cluster support with room key ensure RPC#199

Merged
ngangwar962 merged 25 commits into
mainfrom
claude/general-session-8c4ER
May 21, 2026
Merged

feat: Valkey cluster support with room key ensure RPC#199
ngangwar962 merged 25 commits into
mainfrom
claude/general-session-8c4ER

Conversation

@ngangwar962
Copy link
Copy Markdown
Collaborator

@ngangwar962 ngangwar962 commented May 19, 2026

Summary

  • Hash-tagged keys in pkg/roomkeystore (room:{roomID}:key) for Valkey cluster slot consistency
  • clusterAdapter + NewValkeyClusterStore in pkg/roomkeystore wrapping *redis.ClusterClient
  • ConnectCluster in pkg/valkeyutil for cluster-mode client creation
  • All 5 services migrated from VALKEY_ADDR to VALKEY_ADDRS (comma-separated)
  • All local-dev compose files updated to single-node cluster-mode Valkey
  • NatsHandleEnsureRoomKey RPC (chat.server.request.room.{siteID}.key.ensure) in room-service
  • RoomKeyEnsureRequest/RoomKeyEnsureResponse types and subject.RoomKeyEnsure builder

Test plan

  • make lint — 0 issues
  • make test — all unit tests pass
  • pkg/roomkeystore cluster integration tests — all 3 pass
  • pkg/valkeyutil cluster integration tests — all 3 pass
  • room-service handler unit tests — 7 tests for EnsureRoomKey all pass

Generated by Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added server-to-server room key ensure RPC for resilient key management across distributed deployments.
  • Chores

    • Migrated Valkey infrastructure from single-node to cluster mode across services for improved scalability and fault tolerance.
    • Updated service configuration: Valkey endpoint now specified via VALKEY_ADDRS (comma-separated list) instead of single address.
    • Enhanced local development with cluster-enabled Valkey containers.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

This PR migrates the entire system from single-node Valkey to per-site independent Valkey cluster deployments. The change includes hash-tagged room key naming for cluster slot consistency, cluster-backed storage and connection adapters, a new room-key ensure NATS RPC handler, and coordinated configuration/docker-compose/integration test updates across all services.

Changes

Valkey Cluster Migration

Layer / File(s) Summary
Design Specification & Implementation Plan
docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md, docs/superpowers/plans/2026-05-19-valkey-cluster-support.md
Comprehensive design and implementation plan documents specifying per-site cluster topology, hash-tagged keys, cluster-aware storage/utilities, new room-service ensure RPC, service configuration changes, and a 11-task checklist.
Hash-Tagged Keys & Storage Types
pkg/roomkeystore/roomkeystore.go, pkg/model/event.go, pkg/model/model_test.go
Room key names now include {roomID} hash-tags to ensure same-cluster-slot mapping; exported Config removed and replaced by ClusterConfig; new RoomKeyEnsureRequest and RoomKeyEnsureResponse RPC models with JSON round-trip tests.
Cluster Storage Adapter & Constructors
pkg/roomkeystore/adapter.go
clusterAdapter wrapping *redis.ClusterClient replaces single-node redisAdapter; new ClusterConfig and constructors NewValkeyClusterStore (with connectivity validation) and NewValkeyClusterStoreFromClient (for test injection).
Cluster Startup Test Utility
pkg/testutil/valkey.go
Centralized StartValkeyCluster helper provisions single-node Valkey cluster-mode container, assigns full slot range, waits for readiness, and returns configured *redis.ClusterClient with cleanup.
Storage Integration Tests with Cluster
pkg/roomkeystore/integration_test.go
Tests migrated to cluster mode; round-trip, rotate, expiry, delete, and GetMany behavior preserved; new hash-tag slot consistency test validates no CROSSSLOT errors on rotate.
Cluster Connection Client & Constructors
pkg/valkeyutil/valkey.go
Cluster-backed Client implementation; new ConnectCluster (build, ping, validate *redis.ClusterClient) and WrapClusterClient (test injection) constructors; single-node Connect removed. JSON helpers remain compatible via Client interface.
Connection Integration & Unit Tests
pkg/valkeyutil/integration_test.go, pkg/valkeyutil/valkey_test.go
Cluster client integration tests (Set/Get/Del round-trip, cache-miss, empty Del); unit test for ConnectCluster error handling.
Room Key Ensure RPC Handler
room-service/handler.go, room-service/handler_test.go, pkg/subject/subject.go
New NATS request/reply handler NatsHandleEnsureRoomKey registered in RegisterCRUD; validates keyStore, decodes RoomKeyEnsureRequest, returns existing key version or generates+stores new keypair; comprehensive unit tests for success and error paths; new RoomKeyEnsure(siteID) subject builder.
Broadcast-Worker Cluster Configuration
broadcast-worker/main.go, broadcast-worker/deploy/docker-compose.yml
Config switched from VALKEY_ADDR to VALKEY_ADDRS (comma-separated); startup validation and NewValkeyClusterStore initialization when encryption enabled.
Room-Service Cluster Configuration
room-service/main.go, room-service/deploy/docker-compose.yml
Config now accepts VALKEY_ADDRS list; key store initialized via NewValkeyClusterStore; handler documentation and docker-compose env updated.
Room-Worker Cluster Configuration
room-worker/main.go, room-worker/deploy/docker-compose.yml, room-worker/mock_publisher_test.go
Config accepts VALKEY_ADDRS and constructs cluster store with ClusterConfig; docker-compose and test comments reflect cluster-mode requirement.
Search-Service Cluster Configuration
search-service/main.go, search-service/deploy/docker-compose.yml
ValkeyConfig uses Addrs []string from comma-separated VALKEY_ADDRS; connection switched from Connect to ConnectCluster; startup logging updated.
Load Generator Cluster Configuration
tools/loadgen/main.go
Config accepts required VALKEY_ADDRS comma-separated list; connectKeyStore uses NewValkeyClusterStore with clustered addresses.
Local Development Cluster Setup
docker-local/compose.deps.yaml
valkey service runs in single-node cluster mode with entrypoint shell script that enables clustering, assigns all slots, and waits for readiness; healthcheck validates CLUSTER INFO cluster state.
Integration Test Migration to Cluster Utilities
pkg/roomsubcache/integration_test.go, room-service/integration_test.go, room-worker/integration_test.go, search-service/integration_test.go
Tests refactored to use testutil.StartValkeyCluster and valkeyutil.WrapClusterClient instead of manual testcontainers; imports trimmed; helper signatures updated to return ready-to-use stores/clients.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hmchangw/chat#184: Introduces Valkey client wrappers that the main PR now switches to cluster-backed implementations in pkg/valkeyutil.
  • hmchangw/chat#171: Implements room encryption key storage in pkg/roomkeystore that the main PR refactors to cluster-aware adapter and constructors.
  • hmchangw/chat#106: Adds batch room info RPC with Valkey GetMany lookups that the main PR updates for cluster client compatibility.

Suggested reviewers

  • Joey0538
  • mliu33

🐰 Hop skip and a hop, the cluster's ready to hop!
Keys now tagged, no CROSSSLOT slots to dread,
Storage adapters spin up fast, tests all turn green instead,
Services dance in harmony, from broadcast to search,
A migration complete—now scalability's in reach! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.99% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the two main changes: Valkey cluster support and the addition of a room key ensure RPC, which are the primary objectives of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/general-session-8c4ER

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Trivy (0.69.3)

Trivy execution failed: 2026-05-19T11:23:50Z FATAL Fatal error run error: fs scan error: scan error: scan failed: failed analysis: post analysis error: post analysis error: kubernetes scan error: fs filter error: fs filter error: walk error open gitleaks-report-27.json: no such file or directory: open gitleaks-report-27.json: no such file or directory


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
pkg/model/model_test.go (1)

822-855: ⚡ Quick win

Use roundTrip for these new model JSON round-trip tests.

These tests duplicate helper logic already centralized in roundTrip, which increases drift risk in pkg/model/model_test.go.

♻️ Proposed refactor
 func TestRoomKeyEnsureRequestJSON(t *testing.T) {
 	src := model.RoomKeyEnsureRequest{RoomID: "room-abc"}
-	data, err := json.Marshal(src)
-	if err != nil {
-		t.Fatalf("marshal: %v", err)
-	}
-	var dst model.RoomKeyEnsureRequest
-	if err := json.Unmarshal(data, &dst); err != nil {
-		t.Fatalf("unmarshal: %v", err)
-	}
-	if !reflect.DeepEqual(src, dst) {
-		t.Errorf("round-trip mismatch:\n  got  %+v\n  want %+v", dst, src)
-	}
+	roundTrip(t, &src, &model.RoomKeyEnsureRequest{})
 }

 func TestRoomKeyEnsureResponseJSON(t *testing.T) {
 	src := model.RoomKeyEnsureResponse{
 		RoomID:     "room-xyz",
 		Version:    3,
 		PublicKey:  []byte{0x04, 0xAB, 0xCD},
 		PrivateKey: []byte{0x7F, 0x01},
 	}
-	data, err := json.Marshal(src)
-	if err != nil {
-		t.Fatalf("marshal: %v", err)
-	}
-	var dst model.RoomKeyEnsureResponse
-	if err := json.Unmarshal(data, &dst); err != nil {
-		t.Fatalf("unmarshal: %v", err)
-	}
-	if !reflect.DeepEqual(src, dst) {
-		t.Errorf("round-trip mismatch:\n  got  %+v\n  want %+v", dst, src)
-	}
+	roundTrip(t, &src, &model.RoomKeyEnsureResponse{})
 }

As per coding guidelines: pkg/model/model_test.go must verify model marshal/unmarshal via the generic roundTrip helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/model/model_test.go` around lines 822 - 855, Replace the manual
marshal/unmarshal checks in TestRoomKeyEnsureRequestJSON and
TestRoomKeyEnsureResponseJSON with calls to the existing roundTrip test helper:
for each test, construct the src value (RoomKeyEnsureRequest and
RoomKeyEnsureResponse respectively) and pass it to roundTrip(t, src) so the
centralized helper performs JSON marshal/unmarshal and equality checks; update
or remove the duplicated marshal/unmarshal/assert logic in those test functions
accordingly.
pkg/roomkeystore/integration_test.go (2)

341-348: ⚡ Quick win

Don't swallow CLUSTER INFO probe failures.

This loop drops both the Exec exit status and the io.ReadAll error, so a bad probe turns into a generic Eventually timeout with no clue what actually failed. Treat either condition as probe failure and keep the last output/error for the assertion message.

As per coding guidelines, "Never ignore errors silently — comment if intentionally discarded."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/roomkeystore/integration_test.go` around lines 341 - 348, The probe loop
in the require.Eventually closure swallows container.Exec and io.ReadAll errors
which yields an opaque timeout; modify the closure to record the last execution
error and last read error/output (e.g., lastExecErr, lastReadErr, lastOut)
declared outside the closure, return false on any execErr or read error, and
after Eventually completes assert with those captured values so the failure
message includes the real exec exit/error and the probe stdout (referencing
require.Eventually, container.Exec, io.ReadAll, and strings.Contains in the
change).

306-371: 🏗️ Heavy lift

Exercise NewValkeyClusterStore in this harness too.

setupValkeyCluster wires valkeyStore together directly, so these integration tests never touch the new public constructor. That leaves the constructor-specific behavior added in this PR—client setup, ping validation, and closer wiring—outside the cluster test path. Please add one constructor-level integration case, or refactor the helper so tests and production go through the same entrypoint.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/roomkeystore/integration_test.go` around lines 306 - 371, The cluster
test helper setupValkeyCluster currently constructs valkeyStore directly,
bypassing NewValkeyClusterStore and leaving constructor-specific behavior
untested; change the helper or add an integration test that calls
NewValkeyClusterStore (or refactor setupValkeyCluster to call
NewValkeyClusterStore internally) so the returned store and closer come from the
public constructor, ensuring client setup, Ping validation and closer wiring
exercised for cluster mode (update references to valkeyStore, clusterAdapter,
and the returned closer/ping assertions accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docker-local/compose.deps.yaml`:
- Around line 158-161: The entrypoint currently runs an unconditional
"valkey-cli CLUSTER ADDSLOTSRANGE 0 16383" which fails on restarts and causes
the container to exit; modify the startup sequence (the entrypoint/sh -c
command) to make slot assignment idempotent by first checking whether slots are
already assigned (e.g., call "valkey-cli CLUSTER SLOTS" or a similar inspection
command) and only run "valkey-cli CLUSTER ADDSLOTSRANGE 0 16383" if no slots are
present, or else allow failures to be ignored (e.g., conditional execution or
"|| true") so the container continues to become healthy; update the shell
pipeline surrounding the valkey-server launch, the until loop, and the CLUSTER
ADDSLOTSRANGE invocation to incorporate that conditional check.

In `@docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md`:
- Around line 29-40: The markdown has unlabeled fenced code blocks (e.g., the
blocks showing "site: ftest ...", "room:abc123:key", and the "external
connector" example) which trigger MD040; fix by adding a language identifier
(use "text") to each opening triple-backtick fence for those blocks (and the
other occurrences noted around lines 52-55 and 314-325) so they become ```text
instead of ``` ensuring all fenced code blocks are labeled.
- Around line 341-345: The example struct RoomKeyEnsureRequest is missing bson
struct tags; update the RoomKeyEnsureRequest model to include both json and bson
tags for each field (e.g., RoomID should have `json:"roomId"` and
`bson:"roomId"`), ensuring the spec example matches repository coding guidelines
and prevent implementers from copying a non-compliant shape.

In `@pkg/roomkeystore/adapter.go`:
- Around line 199-201: When the cluster client `c` fails its connectivity check
(`c.Ping(ctx).Err()`), close the created client before returning the error to
avoid resource leaks; modify the error path in the same block so that you call
the client's close method (e.g., `c.Close()` or the appropriate close/shutdown
method on the cluster client) and handle/ignore its error, then return the
formatted error message as before.
- Around line 175-176: Replace fragile string-matching of the Lua error in both
redisAdapter.rotatePipeline and clusterAdapter.rotatePipeline with a
deterministic sentinel error and check using errors.Is; define a package-level
sentinel (e.g., var ErrNoCurrentKey = errors.New("roomkeystore: no current
key")) or wrap the Redis/Lua error into that sentinel when receiving the redis
reply, then change the conditional from strings.Contains(err.Error(), "no
current key") to errors.Is(err, ErrNoCurrentKey) in both rotatePipeline
implementations so callers can rely on typed error comparison.

In `@room-service/handler.go`:
- Around line 1195-1215: The current check-then-set using h.keyStore.Get,
roomkeystore.GenerateKeyPair, and h.keyStore.Set is racy: two requests can both
see no key, generate different pairs, and both call Set. Make the operation
atomic by moving key-creation into an atomic create-if-absent on the keystore
(e.g. implement and call a CreateIfAbsent / GetOrCreate / SetIfAbsent method on
h.keyStore that either returns the existing entry or stores and returns the
newly generated pair), and ensure you only call roomkeystore.GenerateKeyPair
inside the create callback so you generate a key only when the keystore actually
performs the insert; if the keystore API cannot be changed, implement a retry:
attempt Set with a non-overwrite flag and if it fails due to existing entry,
return the existing entry from Get.

In `@search-service/main.go`:
- Around line 37-38: The Password field in the config struct (symbol Password;
nearby Addrs) currently defaults to an empty string which weakens startup
guarantees; change its env tag to mark the secret as required and use the
VALKEY_PASSWORD variable name (e.g. env:"VALKEY_PASSWORD,required") instead of
envDefault:"", so the application fails fast when the secret is missing and you
no longer allow an empty password at startup.

---

Nitpick comments:
In `@pkg/model/model_test.go`:
- Around line 822-855: Replace the manual marshal/unmarshal checks in
TestRoomKeyEnsureRequestJSON and TestRoomKeyEnsureResponseJSON with calls to the
existing roundTrip test helper: for each test, construct the src value
(RoomKeyEnsureRequest and RoomKeyEnsureResponse respectively) and pass it to
roundTrip(t, src) so the centralized helper performs JSON marshal/unmarshal and
equality checks; update or remove the duplicated marshal/unmarshal/assert logic
in those test functions accordingly.

In `@pkg/roomkeystore/integration_test.go`:
- Around line 341-348: The probe loop in the require.Eventually closure swallows
container.Exec and io.ReadAll errors which yields an opaque timeout; modify the
closure to record the last execution error and last read error/output (e.g.,
lastExecErr, lastReadErr, lastOut) declared outside the closure, return false on
any execErr or read error, and after Eventually completes assert with those
captured values so the failure message includes the real exec exit/error and the
probe stdout (referencing require.Eventually, container.Exec, io.ReadAll, and
strings.Contains in the change).
- Around line 306-371: The cluster test helper setupValkeyCluster currently
constructs valkeyStore directly, bypassing NewValkeyClusterStore and leaving
constructor-specific behavior untested; change the helper or add an integration
test that calls NewValkeyClusterStore (or refactor setupValkeyCluster to call
NewValkeyClusterStore internally) so the returned store and closer come from the
public constructor, ensuring client setup, Ping validation and closer wiring
exercised for cluster mode (update references to valkeyStore, clusterAdapter,
and the returned closer/ping assertions accordingly).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0eb6287e-8a76-4914-8886-4c39720c9539

📥 Commits

Reviewing files that changed from the base of the PR and between a55a505 and 3d2d18b.

📒 Files selected for processing (27)
  • broadcast-worker/deploy/docker-compose.yml
  • broadcast-worker/main.go
  • docker-local/compose.deps.yaml
  • docs/superpowers/plans/2026-05-19-valkey-cluster-support.md
  • docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md
  • history-service/cmd/main.go
  • history-service/deploy/docker-compose.yml
  • history-service/internal/config/config.go
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/roomkeystore/adapter.go
  • pkg/roomkeystore/integration_test.go
  • pkg/roomkeystore/roomkeystore.go
  • pkg/subject/subject.go
  • pkg/testutil/testimages/testimages.go
  • pkg/valkeyutil/integration_test.go
  • pkg/valkeyutil/valkey.go
  • pkg/valkeyutil/valkey_test.go
  • room-service/deploy/docker-compose.yml
  • room-service/handler.go
  • room-service/handler_test.go
  • room-service/main.go
  • room-worker/deploy/docker-compose.yml
  • room-worker/main.go
  • room-worker/mock_publisher_test.go
  • search-service/deploy/docker-compose.yml
  • search-service/main.go

Comment thread docker-local/compose.deps.yaml Outdated
Comment thread docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md Outdated
Comment thread docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md
Comment thread pkg/roomkeystore/adapter.go Outdated
Comment on lines +175 to +176
if err != nil && strings.Contains(err.Error(), "no current key") {
return 0, ErrNoCurrentKey
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify all string-based matching points tied to "no current key" so both adapters can be updated consistently.
rg -n -C2 'redis\.error_reply\(.+no current key|strings\.Contains\(err\.Error\(\),\s*"no current key"\)' pkg/roomkeystore/adapter.go

Repository: hmchangw/chat

Length of output: 635


🏁 Script executed:

# Read the relevant sections of the adapter to understand the error context
head -180 pkg/roomkeystore/adapter.go | tail -70

Repository: hmchangw/chat

Length of output: 2521


🏁 Script executed:

# Also check if there are error constant definitions in this file
rg -n 'ErrNoCurrentKey|error_reply' pkg/roomkeystore/adapter.go | head -20

Repository: hmchangw/chat

Length of output: 170


🏁 Script executed:

# Check the rotateScript definition to understand the error wrapping
rg -n 'rotateScript|redis.Script' pkg/roomkeystore/adapter.go -A 5

Repository: hmchangw/chat

Length of output: 1035


🏁 Script executed:

# Find where ErrNoCurrentKey is defined
rg -n 'ErrNoCurrentKey' pkg/roomkeystore/ -B 2 -A 2

Repository: hmchangw/chat

Length of output: 3329


🏁 Script executed:

# Check the imports and error handling patterns in the file
head -30 pkg/roomkeystore/adapter.go

Repository: hmchangw/chat

Length of output: 856


🏁 Script executed:

# Look for any error.Is checks or error type checking in the codebase
rg -n 'errors\.Is|errors\.As' pkg/roomkeystore/ | head -10

Repository: hmchangw/chat

Length of output: 313


Use sentinel error or custom error type instead of string matching in both adapter rotate methods.

Both redisAdapter.rotatePipeline() (line 63) and clusterAdapter.rotatePipeline() (line 175) match the Redis error by message text:

if err != nil && strings.Contains(err.Error(), "no current key") {
    return 0, ErrNoCurrentKey
}

This violates the coding guideline: "Never compare errors by string — use errors.Is() and errors.As() for error type checking." The Lua script's redis.error_reply('no current key') gets wrapped as an untyped error, making the string match brittle—any change to the script message breaks sentinel mapping.

Refactor to define a deterministic contract (e.g., return a specific error type or wrapper from Lua, or create a custom sentinel within the adapter package) that can be checked without string parsing. Update both adapters consistently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/roomkeystore/adapter.go` around lines 175 - 176, Replace fragile
string-matching of the Lua error in both redisAdapter.rotatePipeline and
clusterAdapter.rotatePipeline with a deterministic sentinel error and check
using errors.Is; define a package-level sentinel (e.g., var ErrNoCurrentKey =
errors.New("roomkeystore: no current key")) or wrap the Redis/Lua error into
that sentinel when receiving the redis reply, then change the conditional from
strings.Contains(err.Error(), "no current key") to errors.Is(err,
ErrNoCurrentKey) in both rotatePipeline implementations so callers can rely on
typed error comparison.

Comment thread pkg/roomkeystore/adapter.go
Comment thread room-service/handler.go
Comment thread search-service/main.go
Comment on lines +37 to +38
Addrs []string `env:"ADDRS,required" envSeparator:","`
Password string `env:"PASSWORD" envDefault:""`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require VALKEY_PASSWORD instead of defaulting it to empty.

Line 38 defaults a secret to "", which weakens startup config guarantees in main.go.

Suggested fix
 type ValkeyConfig struct {
 	Addrs    []string `env:"ADDRS,required" envSeparator:","`
-	Password string   `env:"PASSWORD"        envDefault:""`
+	Password string   `env:"PASSWORD,required"`
 }

As per coding guidelines: "never default secrets or connection strings — mark them required."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Addrs []string `env:"ADDRS,required" envSeparator:","`
Password string `env:"PASSWORD" envDefault:""`
Addrs []string `env:"ADDRS,required" envSeparator:","`
Password string `env:"PASSWORD,required"`
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/main.go` around lines 37 - 38, The Password field in the
config struct (symbol Password; nearby Addrs) currently defaults to an empty
string which weakens startup guarantees; change its env tag to mark the secret
as required and use the VALKEY_PASSWORD variable name (e.g.
env:"VALKEY_PASSWORD,required") instead of envDefault:"", so the application
fails fast when the secret is missing and you no longer allow an empty password
at startup.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
room-service/handler.go (1)

1196-1218: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the room-key ensure path atomic.

This is still a racy Get → generate → Set sequence. Two concurrent requests can both see "missing", generate different pairs, and both write, so the second request can overwrite the first key immediately. Please move this to a store-level create-if-absent/ensure primitive and generate the key only inside that atomic path.

Suggested direction
-	existing, err := h.keyStore.Get(ctx, req.RoomID)
-	if err != nil {
-		return nil, fmt.Errorf("ensure room key: get: %w", err)
-	}
-	if existing != nil {
-		return json.Marshal(model.RoomKeyEnsureResponse{
-			RoomID:  req.RoomID,
-			Version: existing.Version,
-		})
-	}
-
-	newPair, err := roomkeystore.GenerateKeyPair()
-	if err != nil {
-		return nil, fmt.Errorf("ensure room key: generate key pair: %w", err)
-	}
-	ver, err := h.keyStore.Set(ctx, req.RoomID, newPair)
-	if err != nil {
-		return nil, fmt.Errorf("ensure room key: set: %w", err)
-	}
+	ensured, err := h.keyStore.Ensure(ctx, req.RoomID)
+	if err != nil {
+		return nil, fmt.Errorf("ensure room key: ensure: %w", err)
+	}
 	return json.Marshal(model.RoomKeyEnsureResponse{
 		RoomID:  req.RoomID,
-		Version: ver,
+		Version: ensured.Version,
 	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-service/handler.go` around lines 1196 - 1218, The current Get →
GenerateKeyPair → Set flow is racy; change the keystore API to provide an atomic
ensure/create-if-absent primitive (e.g.,
KeyStore.Ensure/CreateIfAbsent/EnsureKey) that accepts the room ID and a creator
callback so the store will call the callback only when it needs to create and
will return the existing version if present. Replace the h.keyStore.Get +
roomkeystore.GenerateKeyPair + h.keyStore.Set sequence with a single call to
that new method (pass a closure that calls roomkeystore.GenerateKeyPair) and
return model.RoomKeyEnsureResponse using the version returned by the atomic
ensure call; update implementations of KeyStore to perform the create-if-missing
atomically.
pkg/roomkeystore/adapter.go (1)

63-65: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Replace the Lua error text match with a stable sentinel path.

rotatePipeline still depends on strings.Contains(err.Error(), "no current key") to translate the script failure into ErrNoCurrentKey. That makes caller behavior depend on go-redis's formatted error text instead of a deterministic contract. Please switch this to a sentinel/custom error flow that can be checked with errors.Is.

As per coding guidelines, "Never compare errors by string — use errors.Is() and errors.As() for error type checking."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/roomkeystore/adapter.go` around lines 63 - 65, The code currently matches
the Lua error text via strings.Contains on rotateScript.Run's error; change the
Lua script to return a stable sentinel token (e.g., return
redis.error_reply("NO_CURRENT_KEY")) and update the Go side (where
rotateScript.Run is called in the rotatePipeline/adapter code) to detect that
exact token and wrap it into the package sentinel ErrNoCurrentKey (e.g., if err
!= nil && strings.Contains(err.Error(), "NO_CURRENT_KEY") { err =
fmt.Errorf("%w: redis script returned NO_CURRENT_KEY", ErrNoCurrentKey); }
return 0, ErrNoCurrentKey) so callers can use errors.Is(err, ErrNoCurrentKey);
reference rotateScript.Run and ErrNoCurrentKey when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/roomkeystore/adapter.go`:
- Around line 63-65: The code currently matches the Lua error text via
strings.Contains on rotateScript.Run's error; change the Lua script to return a
stable sentinel token (e.g., return redis.error_reply("NO_CURRENT_KEY")) and
update the Go side (where rotateScript.Run is called in the
rotatePipeline/adapter code) to detect that exact token and wrap it into the
package sentinel ErrNoCurrentKey (e.g., if err != nil &&
strings.Contains(err.Error(), "NO_CURRENT_KEY") { err = fmt.Errorf("%w: redis
script returned NO_CURRENT_KEY", ErrNoCurrentKey); } return 0, ErrNoCurrentKey)
so callers can use errors.Is(err, ErrNoCurrentKey); reference rotateScript.Run
and ErrNoCurrentKey when making these changes.

In `@room-service/handler.go`:
- Around line 1196-1218: The current Get → GenerateKeyPair → Set flow is racy;
change the keystore API to provide an atomic ensure/create-if-absent primitive
(e.g., KeyStore.Ensure/CreateIfAbsent/EnsureKey) that accepts the room ID and a
creator callback so the store will call the callback only when it needs to
create and will return the existing version if present. Replace the
h.keyStore.Get + roomkeystore.GenerateKeyPair + h.keyStore.Set sequence with a
single call to that new method (pass a closure that calls
roomkeystore.GenerateKeyPair) and return model.RoomKeyEnsureResponse using the
version returned by the atomic ensure call; update implementations of KeyStore
to perform the create-if-missing atomically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee6f4b1d-e38d-4abe-ac01-fefefaa9baae

📥 Commits

Reviewing files that changed from the base of the PR and between 31ac860 and 08be38e.

📒 Files selected for processing (8)
  • history-service/internal/config/config.go
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/roomkeystore/adapter.go
  • pkg/subject/subject.go
  • pkg/valkeyutil/valkey.go
  • room-service/handler.go
  • room-service/handler_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/subject/subject.go
  • pkg/model/model_test.go
  • pkg/model/event.go
  • room-service/handler_test.go
  • history-service/internal/config/config.go
  • pkg/valkeyutil/valkey.go

claude added 22 commits May 19, 2026 10:48
Covers hash-tagged key names, ClusterConfig/NewValkeyClusterStore,
valkeyutil.ConnectCluster, per-service config migration from
VALKEY_ADDR to VALKEY_ADDRS, and per-site docker-compose changes.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- Cluster mode fully replaces standalone; VALKEY_ADDR is retired
- Remove draft "Wait —" note from valkeyutil section; replace with
  clean clusterRedisClient design
- Backward compat section rewritten to state VALKEY_ADDRS is the
  only connection path going forward

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Documents the new NatsHandleEnsureRoomKey handler in room-service
that external connectors can call to get or generate a room key
without touching Valkey directly. Covers subject, model type,
idempotency contract, PublicKey inclusion rationale, no-fan-out
design decision, and TDD test scenarios.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
11 tasks covering: image constant, hash-tag key names, cluster adapter,
cluster integration tests, valkeyutil ConnectCluster, service config
migration (VALKEY_ADDR → VALKEY_ADDRS), docker-compose updates, and
the room key ensure RPC (model + subject + TDD red/green).

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…ency

room:{roomID}:key and room:{roomID}:key:prev now share the same hash tag
so both keys always land on the same cluster slot. Required for the Lua
rotate script and DEL pipeline to work without CROSSSLOT errors.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…erStore

Parallel cluster path alongside standalone. clusterAdapter wraps
*redis.ClusterClient with the same hashCommander interface — valkeyStore
and all its methods are unchanged. rotateScript works unchanged because
hash-tagged keys guarantee same-slot execution.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Three new integration tests exercise the clusterAdapter code path:
- RoundTrip: Set → Get → Delete
- RotateRoundTrip: Set → Rotate → GetByVersion (current + prev slots)
- HashTagSlotConsistency: CLUSTER KEYSLOT asserts both key names share a
  slot; rotate Lua script confirms no CROSSSLOT error at runtime

Helper uses valkey/valkey:8 in --cluster-enabled mode (single node with
all 16384 slots via ADDSLOTSRANGE) plus a ClusterSlots override so
go-redis resolves the externally-mapped address rather than the internal
127.0.0.1:6379 the node announces.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Adds clusterRedisClient (wrapping *redis.ClusterClient) and ConnectCluster
constructor, mirroring the existing Connect/redisClient pair but targeting
a cluster seed-address list instead of a single addr. Both satisfy the
same Client interface, so callers switch by swapping the constructor.

Unit test covers the error-wrapping path (bad address → "valkey cluster
connect: …"). Integration tests exercise clusterRedisClient.Get/Set/Del
against a single-node cluster-mode Valkey via the ClusterSlots override
(avoids the testcontainers port-translation problem with cluster topology
discovery that prevents calling ConnectCluster directly in tests).

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…mode)

Replace the single-address NewValkeyStore with NewValkeyClusterStore across
all five services that connect to Valkey:

- room-service:   ValkeyAddr → ValkeyAddrs, NewValkeyStore → NewValkeyClusterStore
- room-worker:    same
- broadcast-worker: same, plus update the empty-address guard message
- history-service/config: ValkeyConfig.Addr → Addrs (envSeparator:",")
- history-service/main: same guard + NewValkeyClusterStore
- search-service: ValkeyConfig.Addr → Addrs, Connect → ConnectCluster

All services now accept VALKEY_ADDRS=<seed1>,<seed2>,... instead of the
single VALKEY_ADDR, enabling Valkey cluster-mode deployments per site.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
All service docker-compose.yml files:
- VALKEY_ADDR → VALKEY_ADDRS (matches migrated service configs)

docker-local/compose.deps.yaml and history-service/deploy/docker-compose.yml:
- Replace single-node valkey/valkey:8-alpine with a single-node cluster-mode
  instance: entrypoint starts valkey-server --cluster-enabled yes, waits for
  PING, then runs CLUSTER ADDSLOTSRANGE 0 16383 to form a valid single-master
  cluster. Healthcheck verifies cluster_state:ok before dependents start.

This ensures local dev and CI compose stacks match the cluster-mode client
that all services now use (NewValkeyClusterStore / valkeyutil.ConnectCluster).

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
pkg/model:
- RoomKeyEnsureRequest{RoomID} — payload for the room key ensure RPC
- RoomKeyEnsureResponse{RoomID, Version, PublicKey, PrivateKey} — reply
  (both keys returned; callers are trusted server-side components)

pkg/subject:
- RoomKeyEnsure(siteID) → "chat.server.request.room.{siteID}.key.ensure"

JSON round-trip tests added for both request and response types.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
New server-to-server RPC on chat.server.request.room.{siteID}.key.ensure:
- If a key already exists for the room, returns it immediately
- If no key exists, generates a fresh P-256 key pair, stores it via Set
  (version 0), and returns it
- Returns RoomKeyEnsureResponse{roomId, version, publicKey, privateKey}
  — both key bytes are returned because callers are trusted server-side
  components (connectors, not end-clients)

Registered in RegisterCRUD under the "room-service" queue group.

Tests cover: key exists, key not found (set path), malformed request,
missing roomId, Get error, Set error, nil key store.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…ailure leaks

- Replace redisAdapter+clusterAdapter with a single universalAdapter in
  pkg/roomkeystore — both *redis.Client and *redis.ClusterClient implement
  redis.UniversalClient, eliminating ~70 lines of duplicated method bodies
- Replace redisClient+clusterRedisClient with universalClient in
  pkg/valkeyutil for the same reason
- NewValkeyStore now closes the client on ping failure (matches the
  existing behaviour in Connect/ConnectCluster and prevents pool leaks)
- Remove dead env struct tags from ClusterConfig (fields are populated
  directly by callers, never via caarlos0/env)
- Drop redundant comment on roomprevkey (function name is self-explanatory)

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Replace the dual single-node/cluster code paths with a single *redis.ClusterClient
path everywhere. Removes universalClient, redisAdapter, NewValkeyStore, and the old
Connect function; adds a shared StartValkeyCluster testutil helper so container setup
is centralised in one place rather than duplicated across six packages.

- pkg/testutil/valkey.go (new): StartValkeyCluster starts a single-node cluster-mode
  Valkey container with ClusterSlots override for testcontainers address mapping
- pkg/valkeyutil: remove Connect/redisClient/universalClient; keep only clusterClient;
  add WrapClusterClient for tests
- pkg/roomkeystore: remove Config, NewValkeyStore, redisAdapter, universalAdapter;
  keep only clusterAdapter/NewValkeyClusterStore; add NewValkeyClusterStoreFromClient
  for tests; fix client leak on ping failure
- Integration tests (roomkeystore, roomsubcache, valkeyutil, room-service, room-worker,
  search-service): replace per-package container boilerplate with testutil.StartValkeyCluster

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
… reply

RoomKeyEnsureResponse was structurally identical to RoomKeyEvent minus
Timestamp. Reuse RoomKeyEvent directly (with Timestamp set) and delete the
near-duplicate type. Also drop the unused subject parameter from
handleEnsureRoomKey and trim the over-explained doc comment on
NatsHandleEnsureRoomKey.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
…prove log field

- subject.go: update RoomKeyEnsure comment to reference RoomKeyEvent (not the
  deleted RoomKeyEnsureResponse type)
- testimages.go: remove ValkeyCluster constant — StartValkeyCluster uses the
  plain Valkey image with manual slot assignment, not bitnami/valkey-cluster
- broadcast-worker/main.go: replace misleading valkey_addrs_set boolean field
  with valkey_addrs slice so the actual configured value is visible in the log

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- history-service/internal/config/config.go: remove "Addrs is validated only
  when encryption is enabled (see main.go)" — cross-file reference in a struct
  comment is fragile and the information belongs at the validation site
- room-service/handler_test.go: remove // --- TestHandler_EnsureRoomKey ---
  section divider; test names already provide sufficient navigation

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- adapter.go: drop trailing sentence from NewValkeyClusterStore doc that
  restated what ClusterConfig's field comment already explains
- valkey.go: drop clusterClient type comment that restated the type name;
  the interface-level Client doc is the right place for consumer guidance

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
The connector only needs to ensure a room has an encryption key pair stored in
Valkey — it does not consume the key material itself. broadcast-worker reads the
public key from Valkey to encrypt outgoing messages, and clients receive the
private key via the room-worker fan-out path. Returning both keys to a caller
that doesn't need them violates least-privilege.

Replace the RoomKeyEvent response (which carries PublicKey + PrivateKey) with a
new RoomKeyEnsureResponse { roomId, version } that confirms the key exists in
Valkey without exposing key bytes. Behaviour is otherwise unchanged: existing
keys are returned as-is, missing keys are generated and stored.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Both TestRoomKeyEnsureRequestJSON and TestRoomKeyEnsureResponseJSON were
hand-rolling the marshal/unmarshal/DeepEqual cycle. The generic roundTrip
helper at the bottom of model_test.go is the established convention used
20+ times in this file for the same purpose.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
loadgen was left using the removed NewValkeyStore/Config single-node API;
update to NewValkeyClusterStore/ClusterConfig to match all other services.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
@ngangwar962 ngangwar962 force-pushed the claude/general-session-8c4ER branch from 470b3a1 to c1abab9 Compare May 19, 2026 11:01
claude added 3 commits May 19, 2026 11:09
- history-service/cmd/main.go: fix goimports formatting (blank line)
- tools/loadgen/main.go: fix goimports formatting (struct field alignment)
- pkg/valkeyutil/valkey.go: restore WHY comment on ping-failure close path
  (closes half-constructed ClusterClient to prevent go-redis pool leaks)

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
- pkg/roomkeystore/adapter.go: replace strings.Contains error match with
  isLuaNoCurrentKeyErr helper using exact match; add WHY comment explaining
  go-redis surfaces Lua error_reply as an untyped string error; drop unused
  "strings" import
- docker-local/compose.deps.yaml: make CLUSTER ADDSLOTSRANGE idempotent —
  guard with cluster_slots_assigned:16384 check so container restarts don't
  exit due to already-assigned slots
- docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md: add
  "text" language identifier to three unlabeled fenced code blocks (MD040);
  add bson tag to RoomKeyEnsureRequest spec example to match coding guidelines

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Combined `len(cfg.ValkeyAddrs) == 0 || cfg.ValkeyKeyGracePeriod <= 0` with
a single ambiguous error message. Splitting into two checks with specific
messages matches room-worker's pattern and makes misconfigured startup
failures immediately actionable.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md`:
- Around line 309-313: Update the documented RPC response and examples to use
the new RoomKeyEnsureResponse contract instead of model.RoomKeyEvent (i.e.,
remove raw key bytes from replies); replace occurrences of `model.RoomKeyEvent{
RoomID, Version, PublicKey, PrivateKey, Timestamp }` with
`RoomKeyEnsureResponse` and ensure the sequence example that shows the RPC reply
(previously emitting PrivateKey/PublicKey) now only includes confirmation fields
present on RoomKeyEnsureResponse; also adjust any text referencing
`keyStore.Set`/idempotency to state that keys are stored but not returned in the
RPC response.

In `@tools/loadgen/main.go`:
- Around line 43-44: The VALKEY_ADDRS env tag is marked required on the config
struct (fields ValkeyAddrs / ValkeyPassword) which forces config parsing to fail
before subcommand dispatch; remove the `required` constraint from the
ValkeyAddrs (and related ValkeyPassword) struct tags so parsing succeeds for
commands that don't need the keystore, and add/ensure a runtime check inside
connectKeyStore (the connectKeyStore function used by the seed and teardown
subcommands) to validate that ValkeyAddrs is present and return a clear error if
missing; also apply the same tag removal to the duplicate fields referenced
around the later block (the other Valkey* fields noted in the comment) so only
connectKeyStore enforces presence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed7e147e-6175-4449-a207-e9b8fff447d3

📥 Commits

Reviewing files that changed from the base of the PR and between 08be38e and e6777e9.

📒 Files selected for processing (29)
  • broadcast-worker/deploy/docker-compose.yml
  • broadcast-worker/main.go
  • docker-local/compose.deps.yaml
  • docs/superpowers/plans/2026-05-19-valkey-cluster-support.md
  • docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/roomkeystore/adapter.go
  • pkg/roomkeystore/integration_test.go
  • pkg/roomkeystore/roomkeystore.go
  • pkg/roomsubcache/integration_test.go
  • pkg/subject/subject.go
  • pkg/testutil/valkey.go
  • pkg/valkeyutil/integration_test.go
  • pkg/valkeyutil/valkey.go
  • pkg/valkeyutil/valkey_test.go
  • room-service/deploy/docker-compose.yml
  • room-service/handler.go
  • room-service/handler_test.go
  • room-service/integration_test.go
  • room-service/main.go
  • room-worker/deploy/docker-compose.yml
  • room-worker/integration_test.go
  • room-worker/main.go
  • room-worker/mock_publisher_test.go
  • search-service/deploy/docker-compose.yml
  • search-service/integration_test.go
  • search-service/main.go
  • tools/loadgen/main.go
✅ Files skipped from review due to trivial changes (2)
  • room-worker/mock_publisher_test.go
  • search-service/deploy/docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (22)
  • room-service/deploy/docker-compose.yml
  • pkg/subject/subject.go
  • pkg/valkeyutil/integration_test.go
  • pkg/roomsubcache/integration_test.go
  • pkg/valkeyutil/valkey_test.go
  • pkg/model/event.go
  • pkg/roomkeystore/roomkeystore.go
  • broadcast-worker/deploy/docker-compose.yml
  • pkg/testutil/valkey.go
  • pkg/roomkeystore/adapter.go
  • docker-local/compose.deps.yaml
  • room-worker/deploy/docker-compose.yml
  • room-worker/integration_test.go
  • room-worker/main.go
  • search-service/integration_test.go
  • search-service/main.go
  • broadcast-worker/main.go
  • room-service/handler_test.go
  • pkg/valkeyutil/valkey.go
  • room-service/handler.go
  • docs/superpowers/plans/2026-05-19-valkey-cluster-support.md
  • room-service/integration_test.go

Comment on lines +309 to +313
- **Reply payload (success):** `model.RoomKeyEvent{ RoomID, Version, PublicKey, PrivateKey, Timestamp }`
- **Reply payload (error):** `model.ErrorResponse` via `natsutil.ReplyError`

**Idempotent by design:** if a key already exists in Valkey for the room, it is returned immediately without generating a new one. If no key exists (backfill case), a new key pair is generated, stored in Valkey via `keyStore.Set`, and then returned.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Room key ensure RPC response contract is outdated in this spec.

Line 309 and Line 348 still specify a model.RoomKeyEvent reply (including key material), but this PR’s current contract is RoomKeyEnsureResponse with confirmation fields only. Please update this section and the sequence example (Line 324) to avoid documenting raw key bytes in the response.

Also applies to: 333-334, 348-349

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-valkey-cluster-support-design.md` around
lines 309 - 313, Update the documented RPC response and examples to use the new
RoomKeyEnsureResponse contract instead of model.RoomKeyEvent (i.e., remove raw
key bytes from replies); replace occurrences of `model.RoomKeyEvent{ RoomID,
Version, PublicKey, PrivateKey, Timestamp }` with `RoomKeyEnsureResponse` and
ensure the sequence example that shows the RPC reply (previously emitting
PrivateKey/PublicKey) now only includes confirmation fields present on
RoomKeyEnsureResponse; also adjust any text referencing
`keyStore.Set`/idempotency to state that keys are stored but not returned in the
RPC response.

Comment thread tools/loadgen/main.go
Comment on lines +43 to +44
ValkeyAddrs []string `env:"VALKEY_ADDRS,required" envSeparator:","`
ValkeyPassword string `env:"VALKEY_PASSWORD" envDefault:""`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't require VALKEY_ADDRS for loadgen run.

Because config parsing happens before subcommand dispatch, this makes loadgen run fail at startup when VALKEY_ADDRS is unset, even though only seed and teardown call connectKeyStore.

Suggested fix
-	ValkeyAddrs    []string `env:"VALKEY_ADDRS,required" envSeparator:","`
+	ValkeyAddrs    []string `env:"VALKEY_ADDRS" envSeparator:","`
 	ValkeyPassword string   `env:"VALKEY_PASSWORD"       envDefault:""`
@@
 func connectKeyStore(cfg *config) (roomkeystore.RoomKeyStore, error) {
+	if len(cfg.ValkeyAddrs) == 0 {
+		return nil, fmt.Errorf("VALKEY_ADDRS is required for seed and teardown")
+	}
 	return roomkeystore.NewValkeyClusterStore(roomkeystore.ClusterConfig{
 		Addrs:       cfg.ValkeyAddrs,
 		Password:    cfg.ValkeyPassword,
 		GracePeriod: time.Hour,
 	})
 }

Also applies to: 182-186

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/main.go` around lines 43 - 44, The VALKEY_ADDRS env tag is
marked required on the config struct (fields ValkeyAddrs / ValkeyPassword) which
forces config parsing to fail before subcommand dispatch; remove the `required`
constraint from the ValkeyAddrs (and related ValkeyPassword) struct tags so
parsing succeeds for commands that don't need the keystore, and add/ensure a
runtime check inside connectKeyStore (the connectKeyStore function used by the
seed and teardown subcommands) to validate that ValkeyAddrs is present and
return a clear error if missing; also apply the same tag removal to the
duplicate fields referenced around the later block (the other Valkey* fields
noted in the comment) so only connectKeyStore enforces presence.

Copy link
Copy Markdown
Collaborator

@mliu33 mliu33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks!

@ngangwar962 ngangwar962 merged commit aa3f342 into main May 21, 2026
13 checks passed
vjauhari-work pushed a commit that referenced this pull request May 21, 2026
PR #199 (Valkey cluster support) renamed the loadgen config field from
ValkeyAddr (single string) to ValkeyAddrs ([]string) and updated main.go
but missed three usages in main_test.go, breaking both the `test` and
`lint` CI jobs on every PR rebased onto current main. Mirror the rename
in the test cfg literals so the loadgen test binary builds again.
Joey0538 pushed a commit that referenced this pull request May 21, 2026
Main introduced Valkey cluster-mode (PR #199 - 'feat: Valkey cluster
support with room key ensure RPC'). Reconciliation:

- Adopt main's per-test testutil.StartValkeyCluster instead of my
  shared Valkey helpers. valkey is now per-test (each test gets its
  own cluster-mode container), not process-shared.
- Drop my testutil.Valkey / EnsureValkey / TerminateValkey /
  FlushValkey — replaced by StartValkeyCluster.
- Remove TerminateValkey from TerminateAll (nothing process-shared
  to clean up).
- Update search-service per-endpoint files (integration_ccs_test.go,
  integration_rooms_test.go) to use
  valkeyutil.WrapClusterClient(testutil.StartValkeyCluster(t)).
- Drop search-service's local valkeyClient/flushValkey helpers.
- Keep main's roomkeystore/roomsubcache integration tests but switch
  roomsubcache back to internal package (consistency).
- Add main_test.go for pkg/valkeyutil (newly has integration tests on
  main).
- Re-migrate room-service setupNATS to testutil.NATS (main reverted
  this when it touched the file).
- Update tools/loadgen unit tests for the new ValkeyAddrs []string
  config field.
- Update CLAUDE.md: Valkey is now per-test, not shared.
Joey0538 pushed a commit that referenced this pull request May 21, 2026
main_test.go was missed when PR #199 renamed the config field from
ValkeyAddr string to ValkeyAddrs []string, causing a typecheck lint failure
on main.

https://claude.ai/code/session_01RVazYxcu73oBNFePtSiTMp

Co-authored-by: Claude <noreply@anthropic.com>
GITMateuszCharczuk pushed a commit that referenced this pull request May 21, 2026
End-to-end runnability reviewer found a ship-blocking regression:
`tools/loadgen/deploy/docker-compose.yml:22` still set `VALKEY_ADDR` while
loadgen now reads `VALKEY_ADDRS` (PR #199 cluster rename). First
`make seed` on a fresh clone would exit 2 with "VALKEY_ADDRS is not set".
Fixed the env var name + added a comment explaining the rename.

Other reviewer-flagged items in the same pass:

- CHANGES.md was internally inconsistent — the Valkey section's narrative
  still said "VALKEY_ADDR" in two places after PR #199 renamed it.
  Rewritten to say VALKEY_ADDRS with a parenthetical about the rename.

- `scenarios.go` advertised `large-room-broadcast` and `message-mutate`
  without SKELETON tags despite both having `SKELETON` markers in code.
  Added "(skeleton — ...)" to the one-line descriptions so
  `loadgen scenarios` honestly flags them.

- USAGE.md now has explicit "Status: SKELETON" disclosures for those two
  scenarios with a one-line note about what's stubbed, mirroring the
  existing disclosures for auth-load/first-dm/notification-fanout.

- Stale "four-stage" wording in `scenario_firstdm.go:82,622` and
  `scenario_firstdm_test.go:106,394,490` (residue from the persist-stage
  removal in f599cba). Now consistently says "three stages" / `len == 3`.

- Bare `return err` in the `subscribersAdapter.SubscribeData` shim at
  `scenario_firstdm.go:619` violated CLAUDE.md §3 (no bare error returns).
  Wrapped with `fmt.Errorf("subscribe %s: %w", subj, err)`.

Verified all gates remain green: make lint 0 issues, go test -race -count=1
green (22.2s), go vet -tags integration clean, gosec 0 issues, compose YAML
parses.
general-lex pushed a commit that referenced this pull request May 26, 2026
… rename

PR #199 (Valkey cluster support, merged to main) renamed two
roomkeystore APIs that the suite-v2 runner depends on:

  roomkeystore.NewValkeyStore(Config{Addr, ...})
    → roomkeystore.NewValkeyClusterStore(ClusterConfig{Addrs, ...})

And RoomKeyStore.Set's pair parameter is now by value, not pointer
(documented at pkg/roomkeystore/roomkeystore.go:84).

These breakages were latent on the branch from the moment we rebased
onto main; they only surfaced now because Task 1 of the Part-2
mishap plan tries to commit, triggering the pre-commit make lint
hook for the first time.

Fix is mechanical, scope-minimal:
- runner.go: NewValkeyStore → NewValkeyClusterStore; Config → ClusterConfig;
  Addr (string) → Addrs ([]string{cfg.ValkeyAddr}). Single-seed-node
  pattern is fine for docker-local; multi-node deployments will need
  a comma-separated env var, which is out of scope here.
- seed/loader.go: *pair deref at the Set call site.

make lint passes (0 issues). No behavior change.

https://claude.ai/code/session_0139upFqMPspygX8XqTjpRN1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants