feat: room encryption keys end-to-end by vjauhari-work · Pull Request #171 · hmchangw/chat

vjauhari-work · 2026-05-11T09:47:19Z

Summary

Wires the existing pkg/roomkeystore (Valkey-backed key storage) and pkg/roomkeysender (NATS key delivery) libraries into the room lifecycle. Every room now gets a P-256 keypair generated at create time, replicated to every participating site at origin's exact version, and pushed to every member's NATS subject so clients can decrypt messages encrypted by broadcast-worker. Channel member removal rotates the key so the removed user can no longer decrypt messages sent after their removal.

Spec: docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md
Plan: docs/superpowers/plans/2026-05-08-room-encryption-keys.md

Architecture

Three-stage handoff per the established codebase pattern:

Stage	Service	Responsibility
1	`room-service`	Generate keypair on create (`Set`). Rotate on channel member-remove (`Rotate`, with `Set` fallback for un-keyed rooms). Skip rotation when no subscriptions are actually deleted (dual-membership individual remove; org-remove with all org members also individually subscribed).
2	`room-worker` (origin)	Consume canonical events; gate Mongo writes on key presence; fan out `RoomKeyEvent` to every room member — local and remote — via `roomkeysender.Send`. NATS supercluster routes `chat.user.{account}.event.room.key` to home sites. Serve `chat.server.request.roomkey.{siteID}.get` RPC for remote inbox-workers.
3	`inbox-worker` (remote)	Pull keypair from origin via NATS RPC; write into local Valkey at origin's exact version via `SetWithVersion`. No `Rotate`, no user-side `Send` — origin already published, supercluster handles routing.

Key material never enters JetStream. Cross-site replication is request/reply over core NATS.

Security properties

Rotate-before-publish ordering: from the moment of rotation, broadcast-worker encrypts under the new key. The about-to-be-removed user holds only the old private key and cannot decrypt anything published after rotation. Test invariant enforced via flag-in-DoAndReturn (rotate must complete before publish callback fires).
Cross-site version sync: each remote site mirrors origin's exact version via SetWithVersion. Without this, remote sites would track per-site versions that diverge from origin, and broadcast-worker envelopes on the remote would carry a version no client (which received origin's RoomKeyEvent) holds. Redelivered events no-op when local version is at or beyond origin's.
No rotation churn on no-op removes: dual-membership individual removes and org-removes where every targeted account is also individually subscribed both skip rotation entirely. Flag renamed to skipKeyRotation to make intent explicit.
Single-rotator invariant: only room-service originates rotations, so a scalar int version is sufficient for ordering. Documented at the version gate.
Single-master Valkey per site is required (the Rotate Lua script does not work across Redis Cluster slots). Documented in pkg/roomkeystore/doc.go.
Sentinel preservation across RPC boundary: inbox-worker's cross-site client maps the wire-format "room key not found" reply back to an errRoomKeyAbsent sentinel so downstream callers can errors.Is against the permanent-miss case (as distinguished from a transient RPC error).

Observability

Seven new OTel meter instruments in pkg/roomkeymetrics:

Metric	Type	Where
`room_key_generated_total`	counter	room-service create + Set-fallback path
`room_key_rotated_total`	counter	room-service true rotations only
`room_key_fanout_errors_total`	counter	every `Send` failure (worker + inbox)
`room_key_rpc_duration_seconds`	histogram	wraps `natsInterSiteKeyClient.GetRoomKey`
`room_key_valkey_errors_total{op}`	counter	tagged by `Get` / `Set` / `Rotate` / `GetMany` / `SetWithVersion` at call sites
`room_key_absent_total`	counter	distinguishes "Valkey healthy, key gone" from transient Valkey errors
`room_key_replication_terminated_total`	counter	inbox-worker bounded-retry threshold hits

Configuration

VALKEY_ADDR is a hard startup requirement on every worker that touches keys. The earlier "optional with slog.Warn" mode produced a silent split-brain (rooms created in Mongo with no key flowing to clients) and — once cross-site versions had to match origin's exactly — made cross-site messages undecryptable. Failing loudly at startup is honest about what the system requires.

Env var	Service	Default	Notes
`VALKEY_ADDR`	`room-service`, `room-worker`, `inbox-worker`	—	Required. Worker refuses to start without it.
`VALKEY_PASSWORD`	all three	`""`
`VALKEY_KEY_GRACE_PERIOD`	all three	`24h`	Validated `> 0` at startup. TTL on the `:prev` slot after rotation.
`ROOM_KEY_RPC_TIMEOUT`	`inbox-worker`	`5s`	Validated `> 0` at startup. Per-call timeout on the cross-site key-fetch RPC.
`ROOM_KEY_MAX_REDELIVER`	`inbox-worker`	`10`	Validated `> 0` at startup. JetStream redelivery cap before terminating (Ack + log) to free the consumer when origin is unreachable.

Failure modes

Stale version at worker gate (pair.Version < req.NewKeyVersion) → transient error → NAK + retry. Valkey propagation race, not a malformed event.
Key absent on origin (room-worker's errRoomKeyAbsent) → permanent error → AsyncJobResult error; clients retry the create. Preserved across the inter-site RPC for inbox-worker callers.
Cross-site RPC origin error → NAK; bounded by ROOM_KEY_MAX_REDELIVER.
ListByRoom survivor lookup failure (room-worker remove path) → NAK + retry. Previously logged-and-acked, which could leave a room on a key no survivor received.
buildAndFanOutRoomKey failure (room-worker create path) → NAK + retry. Previously logged-and-acked, which could persist a room without delivering the initial key event.
Sender.Send per-account failure → logged, fan-out continues (best-effort to remaining members).
Rotate returns ErrNoCurrentKey → fall back to Set (un-keyed room being remove-keyed for the first time).
member_removed with empty Accounts on a remote site → no Mongo delete, but still fetches the rotated key from origin (otherwise local broadcast-worker would keep encrypting under an older version).
Subscription redelivery race → idempotent. mongoInboxStore.BulkCreateSubscriptions uses bulkWrite with UpdateOne + $setOnInsert keyed on (roomId, u.account), so existing read-state (LastSeenAt, Alert) is preserved and replicateLocalKey is always re-attempted on retry.

Test plan

Unit tests green with -race (make test)
Lint clean (make lint, 0 issues)
Compiles with -tags=integration (go build -tags integration ./...)
Integration tests pass against dockerd --storage-driver=vfs (local sandbox; CI runs overlay2):
- inbox-worker — including cross-site key replication, idempotent upsert, FilterScoping
- room-worker — create-room key fan-out, remove-member survivor fan-out
- room-service — Valkey persistence on create
- pkg/roomkeystore — Lua atomicity + grace period + new SetWithVersion round-trip
- pkg/roomkeysender — TypeScript client tests correctly skip on vfs via skipOnVFS
Client API documented in docs/client-api.md §5 (Server-Pushed Events)
Deploy compose files (room-worker/deploy/, inbox-worker/deploy/) carry the new env vars; VALKEY_ADDR documented as required

Changes since initial review

Three rounds of CodeRabbit review feedback addressed across the PR's lifetime:

Round 1 (22 findings) — initial review. Key bugs fixed: replicateLocalKey / fetchAndStoreKey errors are now propagated (not log-and-Ack); room-worker survivor-fanout ListByRoom failure now NAKs; create-room buildAndFanOutRoomKey failure now NAKs; room-worker/store_mongo.ListByRoom wraps errors with operation context; room-service.handleRemoveMember's overloaded targetIsDualMembership flag renamed to skipKeyRotation; new compound index (rid, member.type, member.account) added to room_members to back CountOrgOnlySubs. Test bugs: rotate-before-publish ordering enforced; inbox-worker cross-site test asserts preserved origin version. Plus docs + nitpicks (spec ownership table, fenced-code language tags, full-slice assertions, etc.).

Architectural change — VALKEY_ADDR promoted from "optional with slog.Warn" to a hard startup requirement on both workers. Eliminated dead nil-guard code paths in inbox-worker and room-worker (production deps are guaranteed non-nil). Spec, plan, and deploy/docker-compose.yml comments updated to match.

Round 2 (6 findings) — idempotent subscription upserts: mongoInboxStore.BulkCreateSubscriptions switched from InsertMany-with-dup-filter to bulkWrite with UpdateOne + $setOnInsert, preserving existing read-state on redelivery. Plus assertion polish (deep-copy non-mutation snapshot, full-slice key comparisons, require.NoError on test-setup marshals) and stale plan paragraphs flagged.

Round 3 (3 findings + repo hygiene) — inbox-worker.handleMemberRemoved no longer short-circuits on empty Accounts (still fetches the rotated key from origin); shutdown hook order corrected so OTel exporters flush AFTER nc.Drain / mongo / keyStore.Close; ROOM_KEY_RPC_TIMEOUT validated at startup; integration test's embedded NATS server replaced with the testcontainers-go NATS module to match repo convention. Plan doc snippets rewritten to reflect the shipped fail-fast wiring. Three pre-existing repo mock-file inconsistencies (broadcast-worker, message-worker, search-service) regenerated so make generate is idempotent on the merged tree.

New roomkeystore.SetWithVersion API — added because Set always writes version=0 and Rotate increments by 1, so neither preserves origin's exact version. The new method writes the current slot at a caller-supplied version, used exclusively by inbox-worker for cross-site replication. Other consumers (room-service, broadcast-worker, history-service) declare their own narrower interfaces and are unaffected.

Known follow-ups (intentionally deferred)

Plan doc — Part 2 / Part 3 task bodies still describe earlier iterations of the design. A top-of-file drift notice points readers to the spec as authoritative. Full rewrite of the per-task snippets is deferred; the spec is current.
Survivor lookup performance — ListByRoom after a delete scans all subs in the room for fan-out. Fine at current room sizes; pre-computing survivor lists from the delete result is an optimization for very large channels.

Generated by Claude Code

Summary by CodeRabbit

New Features
- Per-room encryption keys: automatic generation, rotation on member removal, and server-pushed key events to clients; keys distributed to new members for realtime sync across sites.
Documentation
- Added client API section and comprehensive design/plan docs for room-encryption key lifecycle and federation behavior.
Bug Fixes
- Idempotent subscription upserts to avoid duplicate subscriptions on redeliveries.
Tests
- Broad unit and integration coverage for key generation, replication, RPCs, rotation, and end-to-end scenarios.

Summary by CodeRabbit

New Features
- End-to-end per-room encryption keys: generated at room creation, rotated on member removals, and pushed to members.
- Cross-site key replication so remote sites receive origin keys and users get server-pushed key events to keep clients synchronized.
- Idempotent subscription upserts to avoid duplicate subscriptions on redelivery.
Documentation
- New API docs for server-pushed key events, key payloads, and inter-site key exchange.

coderabbitai · 2026-05-11T09:47:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements end-to-end room encryption key management across room-service, room-worker, and inbox-worker: P-256 keypair generation, Valkey-backed storage and versioning (Set/Rotate/SetWithVersion), NATS fan-out and inter-site RPC (ServerRoomKeyGet), OpenTelemetry metrics, gating rules, and tests/docs/deploy updates.

Changes

Room Encryption Keys Implementation

Layer / File(s)	Summary
Plans & Specs / Client Docs `docs/superpowers/plans/2026-05-08-room-encryption-keys.md`, `docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`, `docs/client-api.md`	Add implementation plan and design spec; add client API §5 “Server-Pushed Events” (RoomKeyEvent); renumber Error envelope to §6 and update anchors.
Models & Subjects `pkg/model/event.go`, `pkg/model/member.go`, `pkg/model/model_test.go`, `pkg/subject/subject.go`, `pkg/subject/subject_test.go`	Add `NewKeyVersion` to member-remove types, add `RoomType` to `RemoveMemberRequest`, add `RoomKeyGetRequest` RPC type, and add `ServerRoomKeyGet(siteID)` subject builder with tests.
Observability & Docs `pkg/roomkeymetrics/metrics.go`, `pkg/roomkeymetrics/metrics_test.go`, `pkg/roomkeystore/doc.go`	Introduce OpenTelemetry metrics for room-key fan-out, RPC latency, key lifecycle and error counters; add package documentation for roomkeystore semantics.
RoomKeySender API `pkg/roomkeysender/roomkeysender.go`, tests, integration	Change `Sender.Send` to accept `RoomKeyEvent` by value and stamp Timestamp internally; update tests and integration usage (skip-on-VFS).
RoomKeystore Versioning `pkg/roomkeystore/*`	Add `SetWithVersion` API, adapter `hsetWithVersion`, unit and integration tests verifying explicit-version writes and behavior.
Room-Service Store & Mocks `room-service/store.go`, `room-service/store_mongo.go`, `room-service/mock_store_test.go`	Add `CountOrgOnlySubs` to RoomStore; extend RoomKeyStore with `Set` and `Rotate`; implement Mongo aggregation; update generated mocks.
Room-Service Keygen & Handlers `room-service/keygen.go`, `room-service/handler.go`, tests, integration	Add P-256 key generation utility; persist key in keyStore before publishing on create; perform rotate-first or Set fallback on remove-member and stamp `NewKeyVersion`; add tests and integration coverage.
Room-Worker Store & Mocks `room-worker/store.go`, `room-worker/store_mongo.go`, `room-worker/mock_store_test.go`	Add `siteID` parameter to `ListByRoom`, introduce read-only `RoomKeyStore` interface with `Get`, and update mocks and Mongo query handling.
Room-Worker Handler: Gates, Fan-out & RPC `room-worker/handler.go`, `room-worker/main.go`	Gate create/add/remove on key presence and channel-only constraints; add `buildAndFanOutRoomKey`, `fanOutRoomKeyToSurvivors`, `handleGetRoomKey` and `NatsHandleGetRoomKey`; propagate `NewKeyVersion` into outbox payloads; register ServerRoomKeyGet subject; initialize metrics and meter shutdown.
Room-Worker Tests & Integration `room-worker/handler_test.go`, `room-worker/integration_test.go`	Add unit tests for gating, fan-out, version checks, RPC handling, and metrics; add crypto-focused integration that uses Valkey and embedded NATS to assert RoomKeyEvent publications.
Inbox-Worker Interfaces & Handlers `inbox-worker/handler.go`, `inbox-worker/intersite_key.go`, `inbox-worker/main.go`	Add `RoomKeyStore` and `InterSiteKeyClient` interfaces; replicate/fetch/store keys on `room_created`, `member_added`, `member_removed`; implement `natsInterSiteKeyClient.GetRoomKey` with timeout/metrics/error translation; require VALKEY_ADDR and configure redelivery cap.
Inbox-Worker Tests & Integration `inbox-worker/handler_test.go`, `inbox-worker/intersite_key_test.go`, `inbox-worker/integration_test.go`, `inbox-worker/consumer_config_test.go`	Add unit and integration tests for replication/rotation behavior, inter-site RPC, request-ID forwarding, idempotent BulkCreateSubscriptions upsert semantics, redelivery termination, and metric initialization.
Mocks, Test Helpers & Deployment `room-worker/mock_publisher_test.go`, `mock_`, `deploy/docker-compose.yml` files	Add mockPublisher, stub RoomKeyStore, updated gomock outputs, skip-on-VFS helper, and docker-compose env vars for Valkey and room-key settings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hmchangw/chat#40: The main PR’s room-key “fan-out” implementation reuses and directly modifies the same pkg/roomkeysender library introduced in PR #40.
hmchangw/chat#118: Overlapping changes around remove-member and RoomType gating.
hmchangw/chat#145: Related inbox-worker event handling and cross-site plumbing.

Suggested reviewers

mliu33
hmchangw

"I hopped through the code to stitch the keys,
Small bytes tucked safe beneath Valkey trees,
From create to rotate across NATS' song,
Members keep hush — the keys move along."

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/room-encryption-keys-5vlQ2

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

room-worker/store_mongo.go (1)

38-49: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Wrap the new ListByRoom driver errors with operation context.

Both failure paths return bare Mongo errors, which makes room/site-specific failures much harder to diagnose and breaks the repo’s error-wrapping rule.

🧩 Suggested fix

 func (s *MongoStore) ListByRoom(ctx context.Context, roomID, siteID string) ([]model.Subscription, error) {
 	filter := bson.M{"roomId": roomID}
 	if siteID != "" {
 		filter["siteId"] = siteID
 	}
 	cursor, err := s.subscriptions.Find(ctx, filter)
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("list subscriptions for room %q site %q: find: %w", roomID, siteID, err)
 	}
 	var subs []model.Subscription
 	if err := cursor.All(ctx, &subs); err != nil {
-		return nil, err
+		return nil, fmt.Errorf("list subscriptions for room %q site %q: decode: %w", roomID, siteID, err)
 	}
 	return subs, nil
 }

As per coding guidelines, "Always wrap errors with context using `fmt.Errorf(\"short description: %w\", err)` describing what the current function was doing" and "Never return bare `err`".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-worker/store_mongo.go` around lines 38 - 49, The ListByRoom function
returns raw Mongo errors; wrap both failure returns with contextual messages
using fmt.Errorf so callers get operation and identifier info. Specifically
update the error returned from s.subscriptions.Find(...) to something like
fmt.Errorf("ListByRoom: find failed for room %s site %s: %w", roomID, siteID,
err) and likewise wrap the cursor.All(...) error with fmt.Errorf("ListByRoom:
decode failed for room %s site %s: %w", roomID, siteID, err) so failures mention
ListByRoom, the roomID/siteID and the underlying err instead of returning bare
err.

room-service/handler.go (1)

489-551: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale comment and overloaded targetIsDualMembership semantics in org-remove path.

Two related clarity issues around the new rotation-skip logic:

Line 520 comment is now stale. It states "Org removes rotate unconditionally; dual-membership users are filtered in room-worker after the rotation lands." but lines 537–547 explicitly skip rotation when CountOrgOnlySubs == 0. The org-remove path is no longer unconditional.
targetIsDualMembership is reused with different semantics. On line 518 it captures a per-user state (target has both individual and org membership). On line 545 it is reused as a "no subscriptions will be deleted → skip rotation" sentinel for an org-remove. For an org-remove, no single "target" user exists, and the flag is set based on the collective condition that every org member is also individually subscribed — that is not the same property. The current name makes the gate at line 551 (!targetIsDualMembership) ambiguous when read in the org-remove context.

Rename to something intent-revealing like skipKeyRotation (or fold the org-only-subs check into the existing else branch so the flag is assigned in one place per remove-mode), and update/remove the stale comment.

♻️ Suggested restructure

-	var targetIsDualMembership bool
+	// skipKeyRotation = true means no subscriptions will actually be deleted
+	// (individual: target keeps org membership; org: every org member is also
+	// individually subscribed), so we don't need to rotate the room key.
+	var skipKeyRotation bool
 	if req.Account != "" {
 		// ... individual-remove validation ...
-		targetIsDualMembership = target.HasIndividualMembership && target.HasOrgMembership
+		skipKeyRotation = target.HasIndividualMembership && target.HasOrgMembership
 	} else {
-		// Org removes rotate unconditionally; dual-membership users are filtered in room-worker after the rotation lands.
-		// Owner-removes-org: only the requester's owner role matters here; org members resolved downstream.
+		// Owner-removes-org: only the requester's owner role matters here; org members resolved downstream.
+		// Rotation is skipped below when every org member is also individually subscribed —
+		// the org-remove then deletes only room_members org entries, not subscriptions.
 		sub, err := h.store.GetSubscription(ctx, requesterAccount, roomID)
 		if err != nil {
 			return nil, fmt.Errorf("get requester subscription: %w", err)
 		}
 		if !hasRole(sub.Roles, model.RoleOwner) {
 			return nil, fmt.Errorf("only owners can remove members")
 		}
+		if h.keyStore != nil {
+			count, err := h.store.CountOrgOnlySubs(ctx, req.RoomID, req.OrgID)
+			if err != nil {
+				return nil, fmt.Errorf("count org-only subs: %w", err)
+			}
+			skipKeyRotation = count == 0
+		}
 	}
 
 	// Stable seed for room-worker's deterministic system-message IDs across JetStream redeliveries.
 	req.Timestamp = time.Now().UTC().UnixMilli()
 
-	// For org-removes, skip rotation when no subscriptions would actually be deleted.
-	// This happens when every org member is also individually subscribed (dual-membership),
-	// so the org-remove only removes room_members org entries — no subscription changes.
-	if h.keyStore != nil && req.OrgID != "" {
-		count, err := h.store.CountOrgOnlySubs(ctx, req.RoomID, req.OrgID)
-		if err != nil {
-			return nil, fmt.Errorf("count org-only subs: %w", err)
-		}
-		if count == 0 {
-			// No subscriptions will be deleted; skip rotation (member list changes, key does not).
-			// Fall through to publish the canonical event with NewKeyVersion=0.
-			targetIsDualMembership = true
-		}
-	}
-
 	// Rotate before publish so broadcast-worker encrypts under the new key immediately.
-	// Skip rotation when target is dual-membership: no actual removal happens in that case.
-	if h.keyStore != nil && !targetIsDualMembership {
+	// Skip rotation when no actual removal happens (see skipKeyRotation assignment above).
+	if h.keyStore != nil && !skipKeyRotation {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-service/handler.go` around lines 489 - 551, The flag
targetIsDualMembership is overloaded and the comment about "Org removes rotate
unconditionally" is stale; rename and clarify the logic so the rotation gate is
unambiguous: introduce a boolean skipKeyRotation (or set skipKeyRotation instead
of targetIsDualMembership) and assign it only to mean "skip key rotation for
this request", set it in the individual-remove branch to
target.HasIndividualMembership && target.HasOrgMembership (using
GetSubscriptionWithMembership) and set it in the org-remove branch based on
CountOrgOnlySubs == 0 (using CountOrgOnlySubs with req.RoomID/req.OrgID), then
update/remove the stale comment and change the rotation check from if h.keyStore
!= nil && !targetIsDualMembership to if h.keyStore != nil && !skipKeyRotation so
intent is clear.

🧹 Nitpick comments (4)

pkg/roomkeysender/roomkeysender_test.go (1)
83-97: ⚡ Quick win

Assert the new non-mutation contract here.

The signature change is meant to keep Send from mutating the caller’s event, but this test only checks the published payload. Please also assert that tt.evt is unchanged after Send so the by-value behavior doesn’t regress.
🧪 Suggested assertion
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			wantTimestamp := tt.evt.Timestamp
 			pub := &mockPublisher{err: tt.publishErr}
 			sender := roomkeysender.NewSender(pub)
 
 			err := sender.Send(tt.account, tt.evt)
+			assert.Equal(t, wantTimestamp, tt.evt.Timestamp)
 
 			if tt.wantErr != "" {
 				require.Error(t, err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/roomkeysender/roomkeysender_test.go` around lines 83 - 97, The test must
also verify Send no longer mutates the caller's event: before calling
roomkeysender.NewSender(...).Send, make a deep copy of tt.evt (e.g., orig :=
deep copy of tt.evt) and after Send returns assert that tt.evt equals orig (use
require.Equal or require.True with reflect.DeepEqual) for both success and error
paths; update the subtest that constructs mockPublisher and calls sender.Send to
create the copy and add the post-call equality assertion so the non-mutation
contract for Send is enforced.
room-service/keygen_test.go (1)
29-36: ⚡ Quick win

This “round trip” test never validates the private key.

It only proves that roomcrypto.Encode accepts the generated public key. A broken generateRoomKeyPair that returned mismatched public/private keys would still pass here. Please add a decrypt/assertion step, or rename the test if encode-shape is all you want to verify.

As per coding guidelines, "Tests must cover: happy path, error paths, edge cases (empty collections, boundary conditions), and invalid input".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-service/keygen_test.go` around lines 29 - 36, The test
TestGenerateRoomKeyPair_RoundTripWithRoomcrypto only verifies the encoded blob
shape but never validates the private key; update the test to perform a full
round-trip by calling the corresponding roomcrypto.Decode (or the library's
decrypt function) with pair.PrivateKey and the encrypted result from
roomcrypto.Encode and assert the decrypted plaintext equals "hello" (or rename
the test to reflect that it only checks public-key encoding if you intentionally
don't want decryption). Ensure you reference generateRoomKeyPair to get
pair.PublicKey/pair.PrivateKey and the roomcrypto.Encode/Decode functions so a
mismatched keypair would fail the assertion.
docs/superpowers/plans/2026-05-08-room-encryption-keys.md (1)
746-752: ⚡ Quick win

Update this plan snippet to the repo’s otelnats publish pattern.

This adapter is documented as a plain *nats.Conn with Publish(subj, data), but this repo’s standard connection path uses the OpenTelemetry wrapper and a context-aware publish so request IDs/traces flow via NATS headers. As written, the plan nudges follow-up work toward code that won’t match the existing connection type or propagation behavior.

Based on learnings: pkg/natsutil/connect.go returns an OpenTelemetry-wrapped NATS connection (*otelnats.Conn), and Publish(ctx, ...) is the standard/correct usage across the codebase.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md` around lines 746 -
752, The snippet's natsPublisherAdapter uses a plain *nats.Conn and a
Publish(subj, data) signature, but this repo uses the OpenTelemetry-wrapped
connection from pkg/natsutil/connect.go (type *otelnats.Conn) and the
context-aware Publish(ctx, ...) to propagate request IDs/traces; update
natsPublisherAdapter to hold an *otelnats.Conn (or the concrete type returned by
connect.go) and expose a Publish method that accepts a context.Context and
delegates to the connection's Publish(ctx, subject, data) so headers/OTel
context flow correctly.
room-worker/handler_test.go (1)
3287-3290: ⚡ Quick win

Make the survivor fan-out assertion order-independent.

This test is checking that every survivor receives a key, not that publishes happen in a specific order. The indexed assertions will get brittle if the implementation later dedups through a map or publishes concurrently.
Suggested assertion shape
-	subjects := pub.subjects
-	assert.Contains(t, subjects[0], "chat.user.alice.event.room.key")
-	assert.Contains(t, subjects[1], "chat.user.bob.event.room.key")
-	assert.Contains(t, subjects[2], "chat.user.remote-carol.event.room.key")
+	assert.ElementsMatch(t, []string{
+		"chat.user.alice.event.room.key",
+		"chat.user.bob.event.room.key",
+		"chat.user.remote-carol.event.room.key",
+	}, pub.subjects)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-worker/handler_test.go` around lines 3287 - 3290, The test currently
asserts ordering by indexing into pub.subjects (subjects[0], subjects[1],
subjects[2]) which makes the spec brittle; update the assertion to be
order-independent by comparing the set of actual subjects to the expected set
(e.g. use assert.ElementsMatch or assert.Equal with a sorted slice) or iterate
expected values and assert each is contained in pub.subjects; refer to the
pub.subjects / subjects variable in handler_test.go and replace the indexed
asserts with an order-independent check for "chat.user.alice.event.room.key",
"chat.user.bob.event.room.key", and "chat.user.remote-carol.event.room.key".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/client-api.md`:
- Around line 1752-1795: The overview's scope statement still marks "room-key
rotation" as out of scope but you added Section 5 (RoomKeyEvent) documenting
room-key events; update the overview bullet that mentions "room-key rotation" to
reflect that room key events (RoomKeyEvent, room key generation/rotation and
client behaviors) are now documented in Section 5, or remove the out-of-scope
note entirely so the overview and the new Section 5 are consistent with each
other.

In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Around line 57-58: Several fenced code blocks currently lack language
identifiers (they contain just the block starting with ``` and content like
"Client"); update those fenced blocks (the examples labeled "Client" and the
other similar example blocks) to include an explicit language tag such as
```text (or ```go where appropriate) so the markdown linter (MD040) is satisfied
and rendering is predictable—search for the fenced blocks containing "Client"
and replace the opening ``` with ```text or the correct language identifier.
- Around line 358-361: The ownership table is inverted: update the Service/Role
entries so that `room-service` is credited with generating/rotating keys
(`Set`/`Rotate`), while `room-worker` is documented as the reader/fan‑out that
consumes those changes and emits `RoomKeyEvent` to every room member via
`roomkeysender.Send`; keep `inbox-worker` described as only replicating key
bytes into local Valkey (`Set`/`Rotate`) and not performing any user-event
fan‑out.
- Around line 296-299: Update the spec to stop claiming `inbox-worker` fan-outs
keys: change the descriptions of `handleRoomCreated`, `handleMemberAdded`, and
`handleMemberRemoved` so they call `replicateOrSendLocalKey` /
`rotateAndFanOutLocalKey` only to replicate into local Valkey and perform RPCs
to origin, and remove any mentions of local `Send`/fan-out from these handlers;
instead document that the origin `room-worker` is responsible for publishing
`RoomKeyEvent`s to user subjects and that `inbox-worker` only stores/replicates
key material locally (with Valkey `Get` failures returning errors and RPC
fallbacks/`Set`/`Rotate` semantics unchanged).

In `@inbox-worker/handler_test.go`:
- Around line 1394-1402: The test currently expects replicateLocalKey to
silently succeed when dependencies are nil, but missing key-replication deps
should fail fast; update the implementation of Handler.replicateLocalKey to
check that h.keyStore and h.interSiteClient are non-nil and return a descriptive
error immediately when either is nil, and then update
TestReplicateLocalKey_NoOpsWhenDepsNil (and the similar test at 1484-1489) to
assert an error is returned instead of require.NoError; reference NewHandler to
locate how the handler is constructed and replicateLocalKey for where to add the
nil checks and error return.
- Around line 1365-1390: The test currently asserts the local stored key version
becomes 1 after fetchAndStoreKey, which is wrong for replicated keys — the
origin's version must be preserved; update the assertions in
TestFetchAndStoreKey_RotatesWhenLocalKeyExists (and the similar test around
lines 1514-1539) to expect the fetched origin version (e.g., 5 or 3 as set on
stubInterSiteClient.getResp) instead of 1 by checking the returned pair.Version
from keyStore.Get after calling h.fetchAndStoreKey; leave the rest of the setup
and call to h.fetchAndStoreKey and keyStore unchanged.

In `@inbox-worker/handler.go`:
- Around line 147-153: The replicateLocalKey call currently swallows errors
(only logs) which causes the handler to Ack the event even when Valkey/RPC
replication failed; change the handler to return the error from
h.replicateLocalKey so the create/add event is retried instead of being
acknowledged. Locate the replicateLocalKey invocation in the handler (the block
that logs "replicate local key") and replace the log-only branch with
propagation of the error (wrap/contextualize if needed) so the caller sees a
non-nil error; apply the same change to the other replicateLocalKey occurrence
in this file to ensure both paths retry on failure.
- Around line 377-396: fetchAndStoreKey currently ignores the fetched key's
version and always calls keyStore.Rotate, causing duplicate deliveries to bump
the published version; fix by reading the fetched version from
interSiteClient.GetRoomKey and comparing it to the local stored version before
deciding to rotate: call keyStore.Get (or equivalent) to obtain the current
local version for roomID, if no current key (roomkeystore.ErrNoCurrentKey) call
keyStore.Set with the fetched RoomKeyPair, if the local version is >=
fetched.Version do nothing (avoid re-rotating), and only call keyStore.Rotate
when fetched.Version is strictly greater than the local version; update metric
labels/op names appropriately around the Set/Rotate paths (references:
fetchAndStoreKey, interSiteClient.GetRoomKey, roomkeystore.RoomKeyPair,
keyStore.Get/Set/Rotate, roomkeystore.ErrNoCurrentKey).

In `@inbox-worker/intersite_key_test.go`:
- Around line 37-45: The test races because NATS subscriptions register
asynchronously; after each nc.Subscribe call (the ones registering handlers for
subject.ServerRoomKeyGet and similar) add a call to nc.Flush() to ensure the
subscription is propagated before calling
newNatsInterSiteKeyClient(...).GetRoomKey(...) (and other immediate requests) so
the request won’t timeout; locate the nc.Subscribe invocations in this file (and
the other two blocks noted) and insert nc.Flush() right after each Subscribe
returns and before creating/using the client.

In `@inbox-worker/intersite_key.go`:
- Around line 45-47: The code in intersite_key.go currently converts parsed
origin errors into a generic string via fmt.Errorf, losing sentinel identities;
change the branch that handles natsutil.TryParseError(resp.Data) so it returns
the parsed error object directly (e.g., return nil, errResp) instead of wrapping
it in fmt.Errorf, preserving sentinel comparisons like errors.Is(...,
errRoomKeyAbsent) in downstream code.

In `@inbox-worker/main.go`:
- Around line 285-299: The check that drops messages uses
cfg.RoomKeyMaxRedeliver and will immediately terminate messages if that value is
zero or negative; validate RoomKeyMaxRedeliver at startup (when building/loading
cfg) and reject non-positive values (return an error/exit the process) so the
consumes never run with an invalid limit. Update the configuration
parsing/validation path that constructs cfg (where cfg.RoomKeyMaxRedeliver is
set) to ensure it's > 0, provide a clear startup error message if not, and keep
the existing runtime check in exceedsMaxRedeliver and the inbox handler
unchanged.

In `@pkg/roomkeysender/integration_test.go`:
- Around line 174-183: The skipOnVFS helper currently skips tests when
DOCKER_STORAGE_DRIVER is empty or "vfs", which hides tests when the env var is
unset; update skipOnVFS so it only calls t.Skip when DOCKER_STORAGE_DRIVER ==
"vfs". Locate skipOnVFS and the os.Getenv checks and remove the emptiness branch
(or replace with a runtime check that runs `docker info`/inspects the Docker
storage driver and only skips when that result equals "vfs"), and ensure the
t.Skip message remains clear and is only invoked when the driver is explicitly
vfs.

In `@room-service/handler_test.go`:
- Around line 883-903: The test currently doesn't enforce that keyStore.Rotate
is called before publishToStream; update the test around
keyStore.EXPECT().Rotate and the publish callback used in
Handler.handleRemoveMember to enforce ordering by either wrapping the
expectations in gomock.InOrder(keyStore.EXPECT()...) with a subsequent EXPECT
for Rotate then any publish call, or by using a boolean/flag set inside the
Rotate DoAndReturn (e.g., set rotated = true) and make the publish callback
assert that rotated is true before accepting the publish; reference the existing
keyStore.EXPECT().Rotate DoAndReturn, the publish variable passed as
Handler.publishToStream, and the h.handleRemoveMember invocation when making the
change.

In `@room-service/store_mongo.go`:
- Around line 763-776: The $lookup in store_mongo.go currently matches
room_members on (rid, member.account, member.type) using let acc "$u.account",
but EnsureIndexes only creates an index on (rid, member.type, member.id),
causing full scans; fix by either changing the lookup to join on the indexed
member.id (compare "$member.id" to "$u.account" in the "$match" inside the
"$lookup" pipeline used by the aggregation in question) or add a new compound
index in EnsureIndexes for room_members on (rid, member.type, member.account) so
the existing lookup uses an index; adjust whichever you choose consistently
(modify the "$lookup" match or update EnsureIndexes) and run tests/perf check.

In `@room-worker/handler.go`:
- Around line 352-361: The removal currently proceeds to Ack even when
h.store.ListByRoom(ctx, req.RoomID, "") fails, which can leave survivors without
the rotated key; change the flow in the handler (the block using ListByRoom and
calling h.fanOutRoomKeyToSurvivors) so that on listErr you do not acknowledge
the removal but instead return or propagate an error (so the caller can retry),
and apply the same change to the analogous block around fanOutRoomKeyToSurvivors
at the later section (lines referenced in the review). Ensure ListByRoom
failures are logged and then cause the handler to abort the removal/ack path
rather than continuing.
- Around line 1263-1266: The call to buildAndFanOutRoomKey currently swallows
errors (logs and continues) which lets room/subscriptions persist without
members receiving keys; change the create-room flow so that when
buildAndFanOutRoomKey returns an error you do not treat it as success: either
(a) propagate the error back from the create-room handler (return err) so the
higher-level transaction/operation fails and can be retried, or (b) implement a
short retry loop with backoff around buildAndFanOutRoomKey (e.g., 3 attempts,
small sleep between attempts) and only proceed when it succeeds; include attempt
count and the original error in logs. Ensure you update the call site that
currently does "if err := h.buildAndFanOutRoomKey(...); err != nil {
slog.Error(... ) }" to either return the error or retry and only continue when
successful.

---

Outside diff comments:
In `@room-service/handler.go`:
- Around line 489-551: The flag targetIsDualMembership is overloaded and the
comment about "Org removes rotate unconditionally" is stale; rename and clarify
the logic so the rotation gate is unambiguous: introduce a boolean
skipKeyRotation (or set skipKeyRotation instead of targetIsDualMembership) and
assign it only to mean "skip key rotation for this request", set it in the
individual-remove branch to target.HasIndividualMembership &&
target.HasOrgMembership (using GetSubscriptionWithMembership) and set it in the
org-remove branch based on CountOrgOnlySubs == 0 (using CountOrgOnlySubs with
req.RoomID/req.OrgID), then update/remove the stale comment and change the
rotation check from if h.keyStore != nil && !targetIsDualMembership to if
h.keyStore != nil && !skipKeyRotation so intent is clear.

In `@room-worker/store_mongo.go`:
- Around line 38-49: The ListByRoom function returns raw Mongo errors; wrap both
failure returns with contextual messages using fmt.Errorf so callers get
operation and identifier info. Specifically update the error returned from
s.subscriptions.Find(...) to something like fmt.Errorf("ListByRoom: find failed
for room %s site %s: %w", roomID, siteID, err) and likewise wrap the
cursor.All(...) error with fmt.Errorf("ListByRoom: decode failed for room %s
site %s: %w", roomID, siteID, err) so failures mention ListByRoom, the
roomID/siteID and the underlying err instead of returning bare err.

---

Nitpick comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 746-752: The snippet's natsPublisherAdapter uses a plain
*nats.Conn and a Publish(subj, data) signature, but this repo uses the
OpenTelemetry-wrapped connection from pkg/natsutil/connect.go (type
*otelnats.Conn) and the context-aware Publish(ctx, ...) to propagate request
IDs/traces; update natsPublisherAdapter to hold an *otelnats.Conn (or the
concrete type returned by connect.go) and expose a Publish method that accepts a
context.Context and delegates to the connection's Publish(ctx, subject, data) so
headers/OTel context flow correctly.

In `@pkg/roomkeysender/roomkeysender_test.go`:
- Around line 83-97: The test must also verify Send no longer mutates the
caller's event: before calling roomkeysender.NewSender(...).Send, make a deep
copy of tt.evt (e.g., orig := deep copy of tt.evt) and after Send returns assert
that tt.evt equals orig (use require.Equal or require.True with
reflect.DeepEqual) for both success and error paths; update the subtest that
constructs mockPublisher and calls sender.Send to create the copy and add the
post-call equality assertion so the non-mutation contract for Send is enforced.

In `@room-service/keygen_test.go`:
- Around line 29-36: The test TestGenerateRoomKeyPair_RoundTripWithRoomcrypto
only verifies the encoded blob shape but never validates the private key; update
the test to perform a full round-trip by calling the corresponding
roomcrypto.Decode (or the library's decrypt function) with pair.PrivateKey and
the encrypted result from roomcrypto.Encode and assert the decrypted plaintext
equals "hello" (or rename the test to reflect that it only checks public-key
encoding if you intentionally don't want decryption). Ensure you reference
generateRoomKeyPair to get pair.PublicKey/pair.PrivateKey and the
roomcrypto.Encode/Decode functions so a mismatched keypair would fail the
assertion.

In `@room-worker/handler_test.go`:
- Around line 3287-3290: The test currently asserts ordering by indexing into
pub.subjects (subjects[0], subjects[1], subjects[2]) which makes the spec
brittle; update the assertion to be order-independent by comparing the set of
actual subjects to the expected set (e.g. use assert.ElementsMatch or
assert.Equal with a sorted slice) or iterate expected values and assert each is
contained in pub.subjects; refer to the pub.subjects / subjects variable in
handler_test.go and replace the indexed asserts with an order-independent check
for "chat.user.alice.event.room.key", "chat.user.bob.event.room.key", and
"chat.user.remote-carol.event.room.key".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59fa3ada-6b44-42e6-962b-c48663f05f9f

📥 Commits

Reviewing files that changed from the base of the PR and between 7817eb6 and a780ff8.

📒 Files selected for processing (40)

docs/client-api.md
docs/superpowers/plans/2026-05-08-room-encryption-keys.md
docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md
inbox-worker/consumer_config_test.go
inbox-worker/deploy/docker-compose.yml
inbox-worker/handler.go
inbox-worker/handler_test.go
inbox-worker/integration_test.go
inbox-worker/intersite_key.go
inbox-worker/intersite_key_test.go
inbox-worker/intersite_stubs_test.go
inbox-worker/main.go
pkg/model/event.go
pkg/model/member.go
pkg/model/model_test.go
pkg/roomkeymetrics/metrics.go
pkg/roomkeymetrics/metrics_test.go
pkg/roomkeysender/integration_test.go
pkg/roomkeysender/roomkeysender.go
pkg/roomkeysender/roomkeysender_test.go
pkg/roomkeystore/doc.go
pkg/subject/subject.go
pkg/subject/subject_test.go
room-service/handler.go
room-service/handler_test.go
room-service/integration_test.go
room-service/keygen.go
room-service/keygen_test.go
room-service/mock_store_test.go
room-service/store.go
room-service/store_mongo.go
room-worker/deploy/docker-compose.yml
room-worker/handler.go
room-worker/handler_test.go
room-worker/integration_test.go
room-worker/main.go
room-worker/mock_publisher_test.go
room-worker/mock_store_test.go
room-worker/store.go
room-worker/store_mongo.go

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

inbox-worker/handler.go (1)
145-155: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make add-member retries re-enter the key replication path.

If replicateLocalKey fails after BulkCreateSubscriptions succeeds, the redelivery will usually hit duplicate keys at Line 145 and return before key replication is retried. That leaves local Valkey stale even though the message keeps retrying. At minimum, this path needs the same idempotent duplicate-key escape hatch as handleRoomCreated, or an upsert-based subscription write.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inbox-worker/handler.go` around lines 145 - 155, BulkCreateSubscriptions may
return a duplicate-key error which currently causes the handler to return before
replicateLocalKey is retried, leaving local Valkey stale; update the add-member
flow so a duplicate-key error from h.store.BulkCreateSubscriptions is treated as
success (or convert the store call to an upsert) and the code always proceeds to
call h.replicateLocalKey(ctx, evt.SiteID, event.RoomID); alternatively implement
the same idempotent duplicate-key escape hatch used in handleRoomCreated so that
replicateLocalKey is retried even when subscription writes collide.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 7-8: Update the docs to reflect that key ownership and rotation
live with the origin site: indicate that room-worker is the sole publisher of
RoomKeyEvent and the authoritative source of key versions, and that inbox-worker
must NOT implement its own Set/Rotate or a separate key fan-out path; instead
inbox-worker should only fetch keys via the
chat.server.request.roomkey.{siteID}.get RPC (or mirror the origin RoomKeyEvent)
and write the exact version into local Valkey, then use that mirrored value for
local delivery without changing versioning semantics; revise all referenced
sections (those around the current paragraphs and the ranges called out) to
remove instructions that give inbox-worker Set/Rotate responsibilities or
independent fan-out behavior and explicitly state the mirroring-only behavior
and exact-version requirement.

In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Line 455: The spec line for remove-member incorrectly marks a `Get` returning
version < `NewKeyVersion` as a permanent error; update the text for the
`remove-member` case to document stale local key versions as a transient `NAK +
retry` condition (not permanent), keeping the same retryable semantics used
elsewhere (mention `Get`, `NewKeyVersion`, `NAK + retry` and JetStream behavior)
so tests and JetStream handling continue to treat it as retryable.

In `@inbox-worker/handler_test.go`:
- Around line 1250-1252: Several test setup blocks in handler_test.go silently
ignore json.Marshal errors (e.g., lines creating pData, data used in OutboxEvent
payloads); replace occurrences like "pData, _ := json.Marshal(...)" and "data, _
:= json.Marshal(...)" with code that checks the marshal succeeded using
require.NoError(t, err) (or require.NoError where t is the test variable)
immediately after the marshal and before using pData/data, updating all
instances referenced in this file (examples around creation of model.OutboxEvent
in the member_added test and the other ranges noted) so test setup failures fail
fast instead of being silently discarded.
- Around line 1447-1448: The current test only compares the first byte of
pair.KeyPair.PublicKey which can miss overwrites; update the assertion to
compare the entire public key slice (pair.KeyPair.PublicKey) against the
expected full byte slice so the test fails if any byte changes, ensuring the "no
overwrite" property is strongly enforced for the KeyPair.PublicKey.

In `@inbox-worker/handler.go`:
- Around line 342-347: The duplicate-key handler for
h.store.BulkCreateSubscriptions currently treats any mongo.IsDuplicateKeyError
as full success; update the logic in the BulkCreateSubscriptions error branch to
verify that all intended subscriptions in subs were actually created (e.g.,
query h.store for the list of subscriptions for data.RoomID/data.HomeSiteID and
compare against subs) and only call h.fetchAndStoreKey(ctx, data.HomeSiteID,
data.RoomID) and return nil when the full target set exists; if any are missing,
instead perform an idempotent per-subscription upsert loop (or retry creating
only the missing entries) to reconcile the partial success before acknowledging
the event.

In `@pkg/roomkeysender/roomkeysender_test.go`:
- Around line 85-94: The test currently takes a shallow snapshot with before :=
tt.evt so slice headers (PublicKey, PrivateKey) still share backing arrays and
mutations could be missed; update the test to take a deep copy of the
RoomKeyEvent before calling roomkeysender.NewSender(...).Send — e.g., create a
deepCopy of tt.evt that copies the struct and makes new slices for PublicKey and
PrivateKey (copying their contents) and assign that to before, then call
sender.Send(tt.account, tt.evt) and assert.Equal(t, before, tt.evt) to ensure no
in-place slice mutations are missed.

---

Outside diff comments:
In `@inbox-worker/handler.go`:
- Around line 145-155: BulkCreateSubscriptions may return a duplicate-key error
which currently causes the handler to return before replicateLocalKey is
retried, leaving local Valkey stale; update the add-member flow so a
duplicate-key error from h.store.BulkCreateSubscriptions is treated as success
(or convert the store call to an upsert) and the code always proceeds to call
h.replicateLocalKey(ctx, evt.SiteID, event.RoomID); alternatively implement the
same idempotent duplicate-key escape hatch used in handleRoomCreated so that
replicateLocalKey is retried even when subscription writes collide.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4db44448-7ccd-4053-9dc0-f2be5ba22da3

📥 Commits

Reviewing files that changed from the base of the PR and between a780ff8 and 9910d21.

📒 Files selected for processing (22)

docs/client-api.md
docs/superpowers/plans/2026-05-08-room-encryption-keys.md
docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md
inbox-worker/handler.go
inbox-worker/handler_test.go
inbox-worker/intersite_key.go
inbox-worker/intersite_key_test.go
inbox-worker/intersite_stubs_test.go
inbox-worker/main.go
pkg/roomkeysender/integration_test.go
pkg/roomkeysender/roomkeysender_test.go
pkg/roomkeystore/adapter.go
pkg/roomkeystore/integration_test.go
pkg/roomkeystore/roomkeystore.go
pkg/roomkeystore/roomkeystore_test.go
room-service/handler.go
room-service/handler_test.go
room-service/keygen_test.go
room-service/store_mongo.go
room-worker/handler.go
room-worker/handler_test.go
room-worker/store_mongo.go

🚧 Files skipped from review as they are similar to previous changes (10)

room-worker/store_mongo.go
inbox-worker/intersite_key_test.go
inbox-worker/main.go
inbox-worker/intersite_stubs_test.go
room-service/store_mongo.go
inbox-worker/intersite_key.go
room-service/handler.go
room-service/handler_test.go
room-worker/handler.go
room-worker/handler_test.go

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

room-worker/main.go (1)
173-195: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify shutdown hook order: tracer/meter before NATS drain may lose telemetry.

The current order shuts down tracer (line 188) and meter (line 189) before draining NATS (line 190). This may prevent traces and metrics generated during the NATS drain, Mongo disconnect, or Valkey close from being captured and exported.

Standard observability shutdown happens after all client operations to ensure telemetry is flushed.
🔄 Suggested hook order
 	hooks := []func(ctx context.Context) error{
 		func(ctx context.Context) error {
 			iter.Stop()
 			return nil
 		},
 		func(ctx context.Context) error {
 			done := make(chan struct{})
 			go func() { wg.Wait(); close(done) }()
 			select {
 			case <-done:
 				return nil
 			case <-ctx.Done():
 				return fmt.Errorf("worker drain timed out: %w", ctx.Err())
 			}
 		},
-		func(ctx context.Context) error { return tracerShutdown(ctx) },
-		func(ctx context.Context) error { return meterShutdown(ctx) },
 		func(ctx context.Context) error { return nc.Drain() },
 		func(ctx context.Context) error { mongoutil.Disconnect(ctx, mongoClient); return nil },
 		func(ctx context.Context) error { return keyStore.Close() },
+		func(ctx context.Context) error { return tracerShutdown(ctx) },
+		func(ctx context.Context) error { return meterShutdown(ctx) },
 	}
This ensures all client operations complete before flushing observability exporters.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-worker/main.go` around lines 173 - 195, The tracerShutdown and
meterShutdown hooks are ordered before nc.Drain/mongo disconnect/keyStore.Close,
which can drop telemetry produced during those operations; move the calls to
tracerShutdown and meterShutdown to run after nc.Drain(),
mongoutil.Disconnect(ctx, mongoClient) and keyStore.Close() in the hooks slice
so telemetry is flushed only after NATS drain, Mongo disconnect and key store
close complete; keep the iter.Stop and wg.Wait hooks earlier so work finishes
before the client drains, and ensure shutdown.Wait still receives the reordered
hooks.
inbox-worker/handler.go (1)
168-184: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Empty member_removed.Accounts should not short-circuit key replication.

A remote site can still need the rotated key even when no local subscriptions are deleted. Returning before fetchAndStoreKey leaves local Valkey stale, so this site's broadcast-worker can keep encrypting with an older version than local survivors hold.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inbox-worker/handler.go` around lines 168 - 184, The handler currently
returns early when memberEvt.Accounts is empty which prevents key rotation;
change Handler.handleMemberRemoved so that h.fetchAndStoreKey(ctx, evt.SiteID,
memberEvt.RoomID) is always invoked even if memberEvt.Accounts is empty, and
only call h.store.DeleteSubscriptionsByAccounts when len(memberEvt.Accounts) > 0
(preserve existing error handling for both calls). Ensure the final error
messages remain descriptive (retain the existing fmt.Errorf wrapping) and that
key fetch/store runs regardless of subscription deletion.

♻️ Duplicate comments (3)

docs/superpowers/plans/2026-05-08-room-encryption-keys.md (2)
1476-1935: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Part 3 still documents the old inbox-worker ownership model.

These tasks tell implementers to wire a sender into inbox-worker, call local Set/Rotate, and fan out keys to users. The shipped design is RPC fetch + SetWithVersion only, with origin room-worker remaining the sole RoomKeyEvent publisher. This section should be rewritten before someone follows it literally.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md` around lines 1476
- 1935, The document incorrectly instructs wiring a sender and performing local
Set/Rotate + fan-out of room keys; instead rewrite the section to reflect the
current design where inbox-worker must RPC-fetch keys from origin (using
InterSiteKeyClient / newNatsInterSiteKeyClient), call the Valkey-backed
SetWithVersion (roomkeystore.SetWithVersion or equivalent) to store the fetched
versioned key locally, and NOT act as an origin publisher or fan-out sender;
update references to Handler, replicateRoomKey, handleMemberAdded, and
handleMemberRemoved to describe fetching via GetRoomKey and using
SetWithVersion/Rotate semantics only, remove guidance to create or wire
roomkeysender for outbound key publishing, and adjust the tests/instructions to
expect RPC fetch + local SetWithVersion behavior (and note that origin
room-worker remains the sole RoomKeyEvent publisher).
1172-1241: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Task 12 still treats stale key versions as permanent errors.

Both the test and the implementation snippet turn pair.Version < req.NewKeyVersion into errPermanent. That case needs NAK + retry; otherwise remove-member can Ack away the normal Valkey propagation race instead of waiting for the rotated version to appear.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md` around lines 1172
- 1241, The test and handler currently treat a stale key version (in
processRemoveMember when h.keyStore.Get(ctx, req.RoomID) returns pair.Version <
req.NewKeyVersion) as a permanent error via newPermanent/errPermanent; change
that branch in processRemoveMember to return a non-permanent error (e.g., a
plain fmt.Errorf or a dedicated transient error type) instead of newPermanent so
the caller will NAK and retry the remove-member operation; leave the other
room-type and key-get error handling unchanged and only replace the
newPermanent(...) used when pair == nil || pair.Version < req.NewKeyVersion.
docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md (1)
38-39: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This scope bullet still gives inbox-worker its own fan-out/versioning path.

It says the remote worker uses local Set/Rotate semantics and fans out RoomKeyEvents to local users, but the shipped flow is RPC fetch + SetWithVersion only; origin room-worker remains the sole user-event publisher. As written, the top-level scope contradicts the corrected flow documented later in the spec.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md` around
lines 38 - 39, The bullet claiming inbox-worker performs local Set/Rotate and
independently fans out RoomKeyEvent is incorrect; update the scope to match the
implemented flow where inbox-worker performs an RPC to room-worker
(`chat.server.request.roomkey.{originSiteID}.get`) and uses SetWithVersion (not
Set/Rotate) so the origin `room-worker` remains the sole publisher of
user-facing RoomKeyEvent; change wording to reflect that inbox-worker only
fetches keypair bytes via RPC and applies them with SetWithVersion while leaving
event fan-out/versioning to the origin `room-worker` (keep references to
`inbox-worker`, `room-worker`, `RoomKeyEvent`, `Set`, `Rotate`, and
`SetWithVersion` to make the intended correction clear).

🧹 Nitpick comments (3)

room-worker/mock_publisher_test.go (1)
41-48: ⚡ Quick win

Clarify that the returned keys are dummy/non-cryptographic stubs.

The comment states "returns a valid version-0 key," but the keys are dummy byte sequences (all 0x04 and 0x05) that are not cryptographically valid P-256 keys. While the broader comment (line 34) mentions "don't exercise key behavior," the word "valid" here is misleading and could cause future developers to mistakenly use this stub in tests that do perform cryptographic operations.
📝 Suggested comment clarification
-func (stubRoomKeyStore) Get(_ context.Context, _ string) (*roomkeystore.VersionedKeyPair, error) {
+// Get returns a dummy (non-cryptographic) VersionedKeyPair. The key bytes are
+// not valid P-256 keys and must not be used for actual crypto operations.
+func (stubRoomKeyStore) Get(_ context.Context, _ string) (*roomkeystore.VersionedKeyPair, error) {
 	return &roomkeystore.VersionedKeyPair{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-worker/mock_publisher_test.go` around lines 41 - 48, Update the
misleading comment in stubRoomKeyStore.Get that says it "returns a valid
version-0 key" to explicitly state these are synthetic/dummy byte sequences (all
0x04 and 0x05) and are NOT cryptographically valid P-256 keys and must not be
used for any crypto behavior; mention the returned types
(roomkeystore.VersionedKeyPair and roomkeystore.RoomKeyPair) so readers know
which stubbed values are non-cryptographic.
room-worker/main.go (1)
92-96: ⚡ Quick win

Move config validation before external connections.

Validating VALKEY_KEY_GRACE_PERIOD > 0 after Mongo connect (line 81) and stream bootstrap (line 87) wastes resources on external services before checking a simple config invariant.
♻️ Suggested placement

Move the validation immediately after config parsing:
 	cfg, err := env.ParseAs[config]()
 	if err != nil {
 		slog.Error("parse config", "error", err)
 		os.Exit(1)
 	}
+
+	if cfg.ValkeyKeyGracePeriod <= 0 {
+		slog.Error("VALKEY_KEY_GRACE_PERIOD must be a positive duration",
+			"valkey_key_grace_period", cfg.ValkeyKeyGracePeriod)
+		os.Exit(1)
+	}
 
 	ctx := context.Background()
Then remove lines 92–96.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@room-worker/main.go` around lines 92 - 96, Move the VALKEY_KEY_GRACE_PERIOD
validation so it runs immediately after config parsing instead of after external
setup: check cfg.ValkeyKeyGracePeriod > 0 right after the config is built/parsed
(where the environment vars are loaded) and exit with an error if invalid, and
then remove the later block that checks cfg.ValkeyKeyGracePeriod (the existing
if that logs and calls os.Exit at the end of main). This ensures the validation
occurs before creating Mongo connections or bootstrapping streams.
inbox-worker/integration_test.go (1)
678-693: ⚡ Quick win

Use the standard NATS testcontainer path in this integration test.

This new helper spins up an embedded server instead of the official testcontainers-go NATS module already used elsewhere in the file. Reusing the container-based setup keeps dependency orchestration consistent with the repo's integration-test rules.

As per coding guidelines, “Use testcontainers-go with official modules (mongodb, cassandra, nats) for real dependencies in integration tests”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inbox-worker/integration_test.go` around lines 678 - 693, The helper
startEmbeddedNATS spins up an in-process server; replace it with the
repository's standard testcontainers-go NATS container usage: remove the
natsserver.Options/NewServer/ns.Start flow in startEmbeddedNATS and instead
create and start an official testcontainers nats container (using the same
testcontainers NATS module pattern used elsewhere in this file), wait for
readiness, connect using nats.Connect(ns.ClientURL()) (or the container's mapped
host/port), and register t.Cleanup to terminate the container and close the
client; update any imports and ensure the function still returns *nats.Conn and
uses require.NoError/require.True as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 715-745: The docs currently instruct adding an optional Valkey
startup path checking cfg.ValkeyAddr and only wiring keyStore/keySender when it
is non-empty; update the instructions in room-worker/main.go to reflect the new
fail-fast behavior (Valkey is required at startup) instead of the old optional
branch: remove or rewrite the `if cfg.ValkeyAddr != ""` flow and document that
roomkeystore.NewValkeyStore must be initialized and that startup should exit if
it fails, and that roomkeysender.NewSender (using nc.NatsConn()) must always be
created and assigned to keySender; reference cfg.ValkeyAddr,
roomkeystore.NewValkeyStore, roomkeysender.NewSender and nc.NatsConn() so the
maintainer locates and replaces the optional wiring with the required, fail-fast
wiring.

In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Around line 429-434: Update the inbox-worker row in the config summary table
to include the ROOM_KEY_MAX_REDELIVER environment variable: add
`ROOM_KEY_MAX_REDELIVER` (describe default if applicable) alongside
`VALKEY_ADDR`, `VALKEY_PASSWORD`, `VALKEY_KEY_GRACE_PERIOD`, and
`ROOM_KEY_RPC_TIMEOUT` in the `inbox-worker` new env vars column so the table
matches the bounded-redelivery behavior described earlier; edit the table row
for `inbox-worker` in specs/2026-05-08-room-encryption-keys-design.md to list
`ROOM_KEY_MAX_REDELIVER` explicitly.

In `@inbox-worker/main.go`:
- Around line 266-285: Summary: startup lacks validation for
ROOM_KEY_RPC_TIMEOUT so a zero/negative cfg.RoomKeyRPCTimeout will make
inter-site key RPCs fail immediately. Fix: add a fail-fast check after the
existing validations to ensure cfg.RoomKeyRPCTimeout > 0; if not, call
slog.Error with a clear message (e.g., "ROOM_KEY_RPC_TIMEOUT must be a positive
duration" and include "room_key_rpc_timeout", cfg.RoomKeyRPCTimeout) and
os.Exit(1) before creating interSiteClient with newNatsInterSiteKeyClient,
mirroring the pattern used for ValkeyKeyGracePeriod and RoomKeyMaxRedeliver.

---

Outside diff comments:
In `@inbox-worker/handler.go`:
- Around line 168-184: The handler currently returns early when
memberEvt.Accounts is empty which prevents key rotation; change
Handler.handleMemberRemoved so that h.fetchAndStoreKey(ctx, evt.SiteID,
memberEvt.RoomID) is always invoked even if memberEvt.Accounts is empty, and
only call h.store.DeleteSubscriptionsByAccounts when len(memberEvt.Accounts) > 0
(preserve existing error handling for both calls). Ensure the final error
messages remain descriptive (retain the existing fmt.Errorf wrapping) and that
key fetch/store runs regardless of subscription deletion.

In `@room-worker/main.go`:
- Around line 173-195: The tracerShutdown and meterShutdown hooks are ordered
before nc.Drain/mongo disconnect/keyStore.Close, which can drop telemetry
produced during those operations; move the calls to tracerShutdown and
meterShutdown to run after nc.Drain(), mongoutil.Disconnect(ctx, mongoClient)
and keyStore.Close() in the hooks slice so telemetry is flushed only after NATS
drain, Mongo disconnect and key store close complete; keep the iter.Stop and
wg.Wait hooks earlier so work finishes before the client drains, and ensure
shutdown.Wait still receives the reordered hooks.

---

Duplicate comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 1476-1935: The document incorrectly instructs wiring a sender and
performing local Set/Rotate + fan-out of room keys; instead rewrite the section
to reflect the current design where inbox-worker must RPC-fetch keys from origin
(using InterSiteKeyClient / newNatsInterSiteKeyClient), call the Valkey-backed
SetWithVersion (roomkeystore.SetWithVersion or equivalent) to store the fetched
versioned key locally, and NOT act as an origin publisher or fan-out sender;
update references to Handler, replicateRoomKey, handleMemberAdded, and
handleMemberRemoved to describe fetching via GetRoomKey and using
SetWithVersion/Rotate semantics only, remove guidance to create or wire
roomkeysender for outbound key publishing, and adjust the tests/instructions to
expect RPC fetch + local SetWithVersion behavior (and note that origin
room-worker remains the sole RoomKeyEvent publisher).
- Around line 1172-1241: The test and handler currently treat a stale key
version (in processRemoveMember when h.keyStore.Get(ctx, req.RoomID) returns
pair.Version < req.NewKeyVersion) as a permanent error via
newPermanent/errPermanent; change that branch in processRemoveMember to return a
non-permanent error (e.g., a plain fmt.Errorf or a dedicated transient error
type) instead of newPermanent so the caller will NAK and retry the remove-member
operation; leave the other room-type and key-get error handling unchanged and
only replace the newPermanent(...) used when pair == nil || pair.Version <
req.NewKeyVersion.

In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Around line 38-39: The bullet claiming inbox-worker performs local Set/Rotate
and independently fans out RoomKeyEvent is incorrect; update the scope to match
the implemented flow where inbox-worker performs an RPC to room-worker
(`chat.server.request.roomkey.{originSiteID}.get`) and uses SetWithVersion (not
Set/Rotate) so the origin `room-worker` remains the sole publisher of
user-facing RoomKeyEvent; change wording to reflect that inbox-worker only
fetches keypair bytes via RPC and applies them with SetWithVersion while leaving
event fan-out/versioning to the origin `room-worker` (keep references to
`inbox-worker`, `room-worker`, `RoomKeyEvent`, `Set`, `Rotate`, and
`SetWithVersion` to make the intended correction clear).

---

Nitpick comments:
In `@inbox-worker/integration_test.go`:
- Around line 678-693: The helper startEmbeddedNATS spins up an in-process
server; replace it with the repository's standard testcontainers-go NATS
container usage: remove the natsserver.Options/NewServer/ns.Start flow in
startEmbeddedNATS and instead create and start an official testcontainers nats
container (using the same testcontainers NATS module pattern used elsewhere in
this file), wait for readiness, connect using nats.Connect(ns.ClientURL()) (or
the container's mapped host/port), and register t.Cleanup to terminate the
container and close the client; update any imports and ensure the function still
returns *nats.Conn and uses require.NoError/require.True as before.

In `@room-worker/main.go`:
- Around line 92-96: Move the VALKEY_KEY_GRACE_PERIOD validation so it runs
immediately after config parsing instead of after external setup: check
cfg.ValkeyKeyGracePeriod > 0 right after the config is built/parsed (where the
environment vars are loaded) and exit with an error if invalid, and then remove
the later block that checks cfg.ValkeyKeyGracePeriod (the existing if that logs
and calls os.Exit at the end of main). This ensures the validation occurs before
creating Mongo connections or bootstrapping streams.

In `@room-worker/mock_publisher_test.go`:
- Around line 41-48: Update the misleading comment in stubRoomKeyStore.Get that
says it "returns a valid version-0 key" to explicitly state these are
synthetic/dummy byte sequences (all 0x04 and 0x05) and are NOT cryptographically
valid P-256 keys and must not be used for any crypto behavior; mention the
returned types (roomkeystore.VersionedKeyPair and roomkeystore.RoomKeyPair) so
readers know which stubbed values are non-cryptographic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f514e45a-046a-4bfa-9b7c-c1c9dfb4d6b0

📥 Commits

Reviewing files that changed from the base of the PR and between 9910d21 and 54b938a.

📒 Files selected for processing (14)

docs/superpowers/plans/2026-05-08-room-encryption-keys.md
docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md
inbox-worker/deploy/docker-compose.yml
inbox-worker/handler.go
inbox-worker/handler_test.go
inbox-worker/integration_test.go
inbox-worker/main.go
pkg/roomkeysender/roomkeysender_test.go
room-worker/deploy/docker-compose.yml
room-worker/handler.go
room-worker/handler_test.go
room-worker/integration_test.go
room-worker/main.go
room-worker/mock_publisher_test.go

🚧 Files skipped from review as they are similar to previous changes (5)

room-worker/deploy/docker-compose.yml
room-worker/handler.go
room-worker/handler_test.go
inbox-worker/deploy/docker-compose.yml
room-worker/integration_test.go

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/client-api.md (1)
830-830: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix inconsistent anchor reference.

This line still references #5-error-envelope-reference, but the Error envelope section was renumbered to section 6. All other references in the document (lines 162, 250, 306, etc.) were correctly updated to #6-error-envelope-reference.
🔗 Proposed fix
-See [Error envelope](`#5-error-envelope-reference`). Common errors:
+See [Error envelope](`#6-error-envelope-reference`). Common errors:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/client-api.md` at line 830, Update the incorrect anchor reference
"#5-error-envelope-reference" to "#6-error-envelope-reference" in the sentence
"See [Error envelope](`#5-error-envelope-reference`). Common errors:" so it
matches the renumbered Error envelope section and the other references in the
document.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/client-api.md`:
- Around line 1840-1842: The fenced code block containing the NATS subject
pattern "chat.user.{account}.event.room.key" is missing a language identifier;
update that fenced block to include a language token (e.g., use ```text) so the
block becomes ```text ... ``` to satisfy markdownlint MD040 and improve
rendering consistency.

---

Outside diff comments:
In `@docs/client-api.md`:
- Line 830: Update the incorrect anchor reference "#5-error-envelope-reference"
to "#6-error-envelope-reference" in the sentence "See [Error
envelope](`#5-error-envelope-reference`). Common errors:" so it matches the
renumbered Error envelope section and the other references in the document.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 75a42379-f748-4132-8168-6d6004b4dbb1

📥 Commits

Reviewing files that changed from the base of the PR and between 54b938a and c7bff9d.

📒 Files selected for processing (47)

broadcast-worker/mock_store_test.go
docs/client-api.md
docs/superpowers/plans/2026-05-08-room-encryption-keys.md
docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md
inbox-worker/consumer_config_test.go
inbox-worker/deploy/docker-compose.yml
inbox-worker/handler.go
inbox-worker/handler_test.go
inbox-worker/integration_test.go
inbox-worker/intersite_key.go
inbox-worker/intersite_key_test.go
inbox-worker/intersite_stubs_test.go
inbox-worker/main.go
message-worker/mock_store_test.go
pkg/model/event.go
pkg/model/member.go
pkg/model/model_test.go
pkg/roomkeymetrics/metrics.go
pkg/roomkeymetrics/metrics_test.go
pkg/roomkeysender/integration_test.go
pkg/roomkeysender/roomkeysender.go
pkg/roomkeysender/roomkeysender_test.go
pkg/roomkeystore/adapter.go
pkg/roomkeystore/doc.go
pkg/roomkeystore/integration_test.go
pkg/roomkeystore/roomkeystore.go
pkg/roomkeystore/roomkeystore_test.go
pkg/subject/subject.go
pkg/subject/subject_test.go
room-service/handler.go
room-service/handler_test.go
room-service/integration_test.go
room-service/keygen.go
room-service/keygen_test.go
room-service/mock_store_test.go
room-service/store.go
room-service/store_mongo.go
room-worker/deploy/docker-compose.yml
room-worker/handler.go
room-worker/handler_test.go
room-worker/integration_test.go
room-worker/main.go
room-worker/mock_publisher_test.go
room-worker/mock_store_test.go
room-worker/store.go
room-worker/store_mongo.go
search-service/mock_store_test.go

✅ Files skipped from review due to trivial changes (7)

pkg/roomkeystore/doc.go
room-worker/deploy/docker-compose.yml
message-worker/mock_store_test.go
pkg/roomkeymetrics/metrics_test.go
search-service/mock_store_test.go
room-service/mock_store_test.go
room-worker/mock_store_test.go

🚧 Files skipped from review as they are similar to previous changes (30)

pkg/subject/subject_test.go
room-service/keygen.go
room-service/store.go
pkg/subject/subject.go
pkg/roomkeystore/roomkeystore.go
room-service/integration_test.go
pkg/roomkeystore/adapter.go
pkg/model/event.go
pkg/roomkeysender/roomkeysender.go
pkg/roomkeystore/roomkeystore_test.go
inbox-worker/deploy/docker-compose.yml
room-worker/store_mongo.go
pkg/roomkeysender/integration_test.go
inbox-worker/intersite_key.go
pkg/roomkeystore/integration_test.go
pkg/model/member.go
room-service/keygen_test.go
inbox-worker/main.go
pkg/model/model_test.go
pkg/roomkeymetrics/metrics.go
room-worker/mock_publisher_test.go
inbox-worker/intersite_key_test.go
inbox-worker/intersite_stubs_test.go
room-worker/integration_test.go
room-worker/handler.go
room-service/store_mongo.go
inbox-worker/integration_test.go
room-worker/main.go
room-worker/handler_test.go
inbox-worker/handler_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/client-api.md`:
- Line 1836: Update the ambiguous statement "Each room has a P-256 keypair"
(line with that exact phrase) and the nearby lines about keys sent to initial
members and "(channels only)" key rotations (the sentences mentioning keys to
initial members and member add/remove rotation) to explicitly state the intended
behavior: either change the phrase to "Each channel room has a P-256 keypair" if
DM rooms do not receive keys, or add a clarifying note that all rooms currently
receive keys at creation for consistency but DM rooms use plaintext `message`
fields (not `encryptedMessage`) so clients may opt not to store DM keys, and
indicate if DM encryption is planned for a future release; ensure you update the
sections that reference `message` vs `encryptedMessage` in §4 to match this
decision so the documentation is consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 735e9700-0e9e-441f-9d38-3391de6d2c69

📥 Commits

Reviewing files that changed from the base of the PR and between c7bff9d and 70fab83.

📒 Files selected for processing (1)

docs/client-api.md

End-to-end design for distributing P-256 key pairs per room. room-service generates keys on create and rotates on channel member removal. Origin room-worker fans out RoomKeyEvent to all members (cross-site routed by NATS supercluster). inbox-worker replicates keypairs into local Valkey so each site's broadcast-worker can encrypt locally. Key material never enters JetStream — cross-site replication uses a server-to-server NATS RPC. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

Three-part plan: room-service foundation (key gen + rotate), room-worker (consume canonical events + fan out RoomKeyEvent), inbox-worker (cross-site Valkey replication via RPC). TDD discipline, frequent commits, integration tests for end-to-end persistence and cross-site replication. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

Adds chat.server.request.roomkey.{siteID}.get subject used by inbox-worker to pull a room's keypair from origin during cross-site replication. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

Adds NewKeyVersion to RemoveMemberRequest + MemberRemoveEvent for stamping the post-rotation version through the canonical event and outbox payload. Adds RoomType to RemoveMemberRequest so room-worker can assert channel-only without an extra GetRoom call. Adds RoomKeyGetRequest as the shared schema for the inter-site RPC request body. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…vent by value Send now accepts RoomKeyEvent by value, eliminating the silent shared-pointer mutation that would trample Timestamp across a fan-out loop (and prevent any data race if fan-out is ever parallelised). NatsPublisher exports the *nats.Conn → Publisher adapter that was previously duplicated in both worker mains. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

New package exporting five OTel instruments — FanoutErrors, RPCDuration, KeyGenerated, KeyRotated, ValkeyErrors (op-tagged). Consumers in room-service, room-worker, and inbox-worker emit metrics on success and failure paths so operators can dashboard key generation rate, RPC latency, fan-out failures, and per-op Valkey errors. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…topology Documents the public contract: scalar int versioning with current/prev slot, Lua atomicity for Rotate, single-master Valkey topology requirement, and federation responsibility (cross-site replication is inbox-worker's job). https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…air helper Extends the consumer-side RoomKeyStore interface with Set and Rotate so handlers can write/rotate keys. Adds the generateRoomKeyPair helper that emits a fresh P-256 keypair via crypto/ecdh. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…remove On create-room (all types), generate a P-256 keypair and write to Valkey before publishing the canonical event so the worker's Get gate sees the key. On channel member-remove, rotate the key (with Set fallback on ErrNoCurrentKey for pre-existing un-keyed rooms) and stamp NewKeyVersion on the canonical request payload. Dual-membership removals (target has both individual and org membership) skip rotation since the user remains via their org. Block non-channel rooms from member.remove. KeyGenerated/KeyRotated/ValkeyErrors counters emitted on respective paths. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

Adds the consumer-side RoomKeyStore interface (Get-only + Mongo ListByRoom now takes siteID for push-down filtering) and wires roomkeystore.NewValkeyStore + roomkeysender.NewSender in main.go, gated on VALKEY_ADDR being set. Emits a startup slog.Warn when key handling is disabled so partial deployments are visible. Wires otelutil.InitMeter for the metrics package. New mock_publisher_test.go supports capturing fan-out payloads in tests. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…all members processCreateRoom gates Mongo writes on keyStore.Get returning the pre-Set keypair, then fans out to every initial member via buildAndFanOutRoomKey. processAddMembers fans the current key out to newly-added accounts on channel rooms. processRemoveMember version-gates on req.NewKeyVersion (NAK on stale read so jetstream retries while Valkey propagation catches up), fetches survivors via ListByRoom, and calls fanOutRoomKeyToSurvivors. The outbox MemberRemoveEvent now carries NewKeyVersion. Fan-out targets every member account (local + remote) — NATS supercluster routes user-event subjects to the user's home site. Sentinel errors ErrRoomKeyNotFound and ErrRoomKeyStoreInternal exported for the RPC handler path. Defensive roomType guard uses req.RoomType with a backward-compat gate for empty field from pre-this-change senders. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

Drives processCreateRoom against real Valkey + Mongo + NATS containers and asserts RoomKeyEvent is published on chat.user.{account}.event.room.key for every member account. Exercises the version-gate, fan-out, and the cross-site key-fetch RPC handler. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

natsInterSiteKeyClient issues chat.server.request.roomkey.{originSiteID}.get via nats.Conn with X-Request-ID propagated through NATS headers, so trace correlation survives the cross-site boundary. Timeout configurable via ROOM_KEY_RPC_TIMEOUT (default 5s). RPCDuration histogram wraps every call. main.go wires keystore + sender gated on VALKEY_ADDR; startup slog.Warn when disabled. otelutil.InitMeter registered for metrics. Test stubs (stubKeyStore, stubInterSiteClient, stubRoomKeyPublisher) reusable across handler tests. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…s-site Three handler extensions: handleRoomCreated calls replicateRoomKey (RPC origin + Set local Valkey), handleMemberAdded calls replicateLocalKey (cache hit short-circuits, miss falls back to RPC), and handleMemberRemoved calls rotateLocalKey (RPC + Rotate with Set fallback for ErrNoCurrentKey). No user-side fan-out — origin room-worker already publishes chat.user.{account}.event.room.key for every member and NATS supercluster routes to home sites. inbox-worker is purely state replication into local Valkey so broadcast-worker on each site can encrypt locally. RPC failure in the remove path NAKs so jetstream retries; reads vs writes are correctly classified. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…e and cross-site replication room-service: drives handleCreateRoom against real Valkey and asserts the keypair is persisted at the expected hash key with non-empty bytes. inbox-worker: two-site test simulates an origin via an in-memory NATS responder, drives handleRoomCreated on the destination, and asserts the destination's Valkey now holds the replicated keypair. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

… deploys Adds a new client-facing section documenting chat.user.{account}.event.room.key, the RoomKeyEvent payload, client-side (roomId, version) → privateKey map semantics, and the trigger scenarios (create, add-member, remove-member with rotation). Deploy compose files set VALKEY_ADDR and VALKEY_KEY_GRACE_PERIOD for both workers; inbox-worker also gets ROOM_KEY_RPC_TIMEOUT for operator visibility. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…tors inbox-worker/integration_test.go called NewHandler with 5 args after the fan-out removal trimmed it to 4. pkg/roomkeysender/integration_test.go passed *RoomKeyEvent after Send switched to by-value. Both compiled only under -tags=integration so the unit test gate missed them. Fixed 8 NewHandler call sites and the one Send call. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

Cleanup pass surfaced by the final review: - Drop NatsPublisher adapter (*nats.Conn already satisfies Publisher) - Drop TestSender_DoesNotMutateInputTimestamp (tests a language guarantee) - Drop mongoInboxStore.ListByRoom (orphaned after fan-out removal) - Drop 5 vacuous pub.count() == 0 assertions in inbox-worker tests - Unexport room-key sentinel errors (no external consumers) - Drop dead error return from fanOutRoomKeyToSurvivors (always nil) - Drop unnecessary []model.User -> []*model.User conversion - Drop vanity TestMetrics_AreNonNil - Reword version-gate error message and "Batch-3" internal naming - Single-line replicateLocalKey doc; document VALKEY_KEY_GRACE_PERIOD; document the compose-file Valkey dependency - Fix KeyRotated/KeyGenerated metric semantics on Set-fallback path https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…mKey into fetchAndStoreKey https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

…are actually deleted https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

inbox-worker no longer calls chat.server.request.roomkey.{siteID}.get, so the server-side handler and its supporting types are unreachable. Removed: - room-worker.NatsHandleGetRoomKey + handleGetRoomKey - room-worker QueueSubscribe registration for the RPC subject - model.RoomKeyGetRequest - subject.ServerRoomKeyGet + its unit test - errRoomKeyNotFound + errRoomKeyStoreInternal sentinels - TestHandler_handleGetRoomKey

Clients only use PrivateKey to decrypt; PublicKey is server-side only (broadcast-worker reads it from Valkey). Stop populating PublicKey at the two fan-out call sites and mark the JSON tag omitempty so the field disappears from on-wire RoomKeyEvent payloads. Struct field is retained for any future server-side producer. Addresses review comments #10 and #12.

Closes the survivor decrypt-failure window: room-worker now does delete → fan-out new key → Valkey Rotate → publish system message, so survivors hold v+1 before broadcast-worker starts encrypting under it. - room-service stops calling keyStore.Rotate on member-remove; instead reads the current Valkey version and stamps it on the canonical event as RemoveMemberRequest.BaseKeyVersion. Dual-membership / no-deletion detection moves into room-worker (it owns the deletion decision). - room-worker reads currentPair on entry and skips the rotation block when Valkey is already ahead of req.BaseKeyVersion (redelivery guard for the crash-after-Rotate-ack-lost case). - room-worker generates the new keypair (via roomkeystore.GenerateKeyPair), fans it out to ListByRoom-post-delete survivors at predicted version v+1, then commits via keyStore.Rotate (Set fallback on ErrNoCurrentKey). - Moves generateRoomKeyPair from room-service to pkg/roomkeystore. - Removes MemberRemoveEvent.NewKeyVersion (no consumer left after the inter-site path was deleted). Renames RemoveMemberRequest.NewKeyVersion → BaseKeyVersion to match the new semantic. - Removes the now-unused room-service CountOrgOnlySubs + its index. Residual risks (accepted, documented in spec amendment): - Removed-user-read window (~10-100ms) between canonical publish and Mongo delete in room-worker. - Key-gen non-idempotence between fan-out and Rotate on JetStream redelivery (skip-version guard catches the after-Rotate case only).

Integration test still asserted PublicKey was populated in the fan-out RoomKeyEvent, but the client wire payload no longer carries it (per review comments #10 and #12, omitted via json:",omitempty").

…leness markers - inbox-worker: remove unused Handler.siteID field and constructor param (12 call sites updated) - inbox-worker: revert otelutil.InitMeter init and shutdown reorder; inbox-worker emits no metrics, so both were dead weight - inbox-worker: condense Handler doc comment to a single line - room-worker/main.go: collapse stale 3-line comment that still referenced the removed inter-site `chat.server.request.roomkey.{siteID}.get` RPC - spec: add per-section "Stale — superseded by ... amendment" markers to the Remove-member scope bullet, the Remove-member data-flow section, the "Why rotate-first" rationale, and the cross-site key RPC handler subsection so individual sections aren't misleading on their own https://claude.ai/code/session_013zd5nk3mukiVitkvvhniFi

The canonical remove-member event flows through `chat.room.canonical.{site}.member.remove`, which is owned by the ROOMS stream (per pkg/stream/stream.go), not MESSAGES_CANONICAL. MESSAGES_CANONICAL carries chat.msg.canonical.* for the message pipeline. Updated both sequence diagrams. Also spelled out the two concrete skip cases hidden behind `actually_deleted`: - individual remove with dual org-membership (user stays via org sub) - org remove where every target is also individually subscribed (accounts slice empty) https://claude.ai/code/session_013zd5nk3mukiVitkvvhniFi

…members Clients store room keys under a (roomID → key) map keyed by their existing subscriptions. If room.key arrives before subscription.update, the client has nowhere to store it and either drops it (security regression: subsequent messages in that room can't be decrypted) or has to buffer keys for unknown rooms (complexity the spec doesn't require). processAddMembers had the wrong order: buildAndFanOutRoomKey at line 823, SubscriptionUpdateEvent loop at line 827. Swapped. processCreateRoom's finishCreateRoom path (line 1198) already had the correct order. Added regression test TestHandler_ProcessAddMembers_PublishesSubscriptionUpdateBeforeRoomKey that wires both the regular publish callback and the keySender to a shared mockPublisher so the timeline can be asserted across event kinds. https://claude.ai/code/session_013zd5nk3mukiVitkvvhniFi

mliu33

Excellent work! Thanks! Just couple minor suggestions for optimizations

mliu33 · 2026-05-15T12:36:59Z

+		roomkeymetrics.ValkeyErrors.Add(ctx, 1, metric.WithAttributes(attribute.String("op", "Get")))
+		return fmt.Errorf("get room key: %w", err)
+	}
+	// Skip-rotation guard: a prior redelivery of this canonical event already rotated Valkey past req.BaseKeyVersion.


I actually can't think of the case where this might happen. Because Jetstream will not reliever unless you nak, and once the message is delivered, only one pod will receive the message.

Therefore, current version will always be less than or equal to req.BaseKeyVersion because if a prior redelivery already rotates the key, there won't be another redelivery happening.

If there is no such case, we should remove this check and also the one in room-service to reduce redundant valkey db call.

mliu33 · 2026-05-16T06:26:31Z

 	roomID = req.RoomID

+	// Gate: key MUST exist before any Mongo write.
+	pair, err := h.keyStore.Get(ctx, req.RoomID)


One optimization is to pass room key down so that buildAndFanout does not need to make db call to fetch again

mliu33 · 2026-05-16T07:15:46Z

+		roomkeymetrics.ValkeyErrors.Add(ctx, 1, metric.WithAttributes(attribute.String("op", "Rotate")))
+		return fmt.Errorf("rotate room key: %w", err)
+	}
+	roomkeymetrics.KeyRotated.Add(ctx, 1)


In my opinion, I think there is not much benefit to track success rotate/generate key metrics. It's good to track error count, but not so much for success count as adding metric also costs some cpu time.

After rebasing onto main (which brought in #171 room encryption keys), two classes of test adaptations were needed: 1. #171 tests of processAddMembers / processRemoveIndividual now hit the PR's stricter name-field validation. Added EngName/ChineseName to user fixtures and added the GetUser expectation for the requester account. 2. PR tests of processAddMembers that build the Handler via struct literal now reach the unconditional buildAndFanOutRoomKey call added in #171. Added keyStore: testKeyStore, keySender: testKeySender to all such Handler inits so the no-op stubs satisfy the fan-out call. Also retired one non-UUID request ID literal ("req-add-members-ordering") that the PR's UUID-format validation now rejects. No production-code changes — purely test-side reconciliation between the two feature branches.

…ntent (#185) * docs(spec): add room-worker membership fixes design spec Spec for the room-worker membership filtering, system-message sender/content population, request-ID validation, fail-closed HasOrgRoomMembers handling, and DM participant fields work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): add room-worker membership fixes implementation plan Task-by-task plan covering the membership-fix work: filter individual room docs, populate sender/content on system messages, fail-closed HasOrgRoomMembers, X-Request-ID UUID validation, and DM participant fields (Tasks 11-17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(client-api): align sys-msg row and document Room.uids/accounts Align the system-message type row with the new msg/sender population and document the new Room.uids and Room.accounts DM-participant fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(model): add Room.UIDs/Accounts fields and BuildDMParticipants helper DM rooms now carry both the deterministic UID list and the matching account list as siblings, so clients can render DM members without an extra round-trip. BuildDMParticipants returns the canonically sorted pair for any two users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-worker): add UpdateDMParticipants store method \$set's uids/accounts on a dm/botDM room after the counterpart account has been resolved. Idempotent under JetStream redelivery; returns an error when the target room is missing (the caller treats this as a permanent failure rather than silently skipping). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(room-worker): filter individual room docs to actual members on create/add The MongoDB org-membership backfill was over-broad: it created individual sub-room docs for every member of the source org rooms, not just the users being added. processAddMembers now filters indiv docs to req.Users only, createRoom filters to ResolvedUsers ∪ {requester}, and the backfill gate fires only on the first-org transition so subsequent orgs don't replay it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-worker): populate sender and content on member system messages Previously member_added / member_removed / member_left system messages went out with bare member lists and no sender, leaving clients nothing to render the actor or the affected users with. The room-worker now fetches the requester (validating EngName / SectName / SectID along the way), harvests SectName from the unfiltered org members so sect details survive the indiv-doc filter, and renders Content via new sysmsg formatters. Single- and multi-form Content variants cover the individual path and the org-bearing path; a 1-member org expansion correctly stays on multi form so future org members are accounted for. Adds pkg/model.MessageTypeMemberRemoved and MessageTypeMemberLeft to name the two new sys-msg variants the room-worker now emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(room-worker): validation hardening — UUID request IDs and fail-closed org backfill Two defensive fixes in the same path: - processAddMembers and processCreateRoom now validate that the X-Request-ID header carries a well-formed UUID (v4 or v7) via idgen.IsValidUUID; non-UUIDs return a permanent error so log correlation stays usable. - HasOrgRoomMembers failures no longer fall through silently with the zero-value bool — defaulting hadOrgsBefore=false on error would spuriously trigger first-org backfill on a room that already has org docs, and the duplicate inserts would only be masked by the Mongo unique index. Surface the error so JetStream redelivery retries with fresh state. Comments on these branches trimmed to single-line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-worker): set DM participant fields on DM/botDM rooms DM and botDM rooms now persist matching uids/accounts pairs at room creation across all three create paths: - Async DM create (processCreateRoom, dm type) — calls UpdateDMParticipants after counterpart resolution. - Async botDM create (processCreateRoom, botDM type) — same path, bot account resolved like a regular user. - Sync DM create (handleSyncCreateDM) — sets UIDs/Accounts on the CreateRoom literal so the participants land in the same write as the room itself; on a dup-key (the DM existed already), the path is forward-only and does not backfill — the participants are immutable for a DM room, so no recovery work is needed. UpdateDMParticipants returns an error when no room matches the ID (room-not-found is permanent, not silently skipped). The channel create path is left untouched; a guard test pins that channels must omit UIDs/Accounts so omitempty drops them from BSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(room-worker): adapt tests for room-encryption-keys merge After rebasing onto main (which brought in #171 room encryption keys), two classes of test adaptations were needed: 1. #171 tests of processAddMembers / processRemoveIndividual now hit the PR's stricter name-field validation. Added EngName/ChineseName to user fixtures and added the GetUser expectation for the requester account. 2. PR tests of processAddMembers that build the Handler via struct literal now reach the unconditional buildAndFanOutRoomKey call added in #171. Added keyStore: testKeyStore, keySender: testKeySender to all such Handler inits so the no-op stubs satisfy the fan-out call. Also retired one non-UUID request ID literal ("req-add-members-ordering") that the PR's UUID-format validation now rejects. No production-code changes — purely test-side reconciliation between the two feature branches. * docs(plan,spec) + test(room-worker): align scope statements and extend integration tests Documentation: the plan's architecture summary and file-map preamble, and the spec's §2.6 closing line, all claimed there were no model or store interface changes — contradicting Tasks 11-12 (Room.UIDs/ Accounts + BuildDMParticipants, UpdateDMParticipants on SubscriptionStore). Updated the three lines so the scope statements match what the plan and spec actually deliver. Tests: the existing room-worker feature integration tests exercised the code paths for the four bug fixes but did not assert the user-facing artifacts those fixes produce. Extended: - TestSyncCreateDM_DM_PersistsRoomAndSubs + TestProcessCreateRoomDM- PersistsTwoSubsAndZeroMembers: assert room.UIDs and room.Accounts are persisted (sorted by uid, paired by index) — covers DM participant fields persistence. - TestProcessAddMembers_PublishesLocalInbox_Integration: assert the canonical members_added sys-message carries UserAccount=requester and formatter-rendered Content — covers empty-Content + missing- sender bugs on add. - TestProcessRemoveIndividual_PublishesLocalInbox_Integration: same assertion for the member_removed sys-message — covers empty-Content + missing-sender bugs on forced removal. Org-member duplication remains covered by handler_test.go unit tests; the existing AddMembers integration test uses only direct users so it doesn't exercise the duplication scenario. * feat(broadcast-worker): fan out DM/botDM mutation events via room.Accounts; skip bots fanOutMutationEvent previously called ListSubscriptions on every DM edit/delete to recover recipient accounts — but Room.Accounts (denormalized onto the room doc by room-worker for DM/botDM rooms) already carries the same data. Drop the Mongo round-trip and iterate room.Accounts directly. Merge the DM and BotDM cases (BotDM previously fell into the unknown-room-type default branch and logged a misleading warning) and skip bot accounts so the bot in a botDM does not receive its own edit/delete echo on a subject nothing listens on. Add a local isBot helper mirroring the predicate already present in message-gatekeeper and room-service; promotion to a shared pkg/botid is the right cleanup once a fourth duplicate is no longer acceptable. Tests: - existing DM edit/delete fan-out tests switched fixtures from a ListSubscriptions mock to Accounts on the Room struct (same coverage) - new TestHandleUpdated_BotDMRoom_SkipsBotAccount asserts the bot is filtered in a botDM room Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(room-worker): single-write DM/botDM create; dedupe displayName Addresses two PR #185 review comments from mliu: 1. processCreateRoom previously inserted the room and then ran a follow-up UpdateDMParticipants UpdateOne to backfill UIDs/Accounts after the counterpart user was resolved. That second Mongo round-trip is avoidable for new rooms: fetch the counterpart upfront, populate Room.UIDs/Accounts on the literal, and let CreateRoom persist the complete document in one write — matching the sync DM path's pattern. UpdateDMParticipants is removed from SubscriptionStore (interface + Mongo impl + mock); no callers remain. The duplicate-key replay branch keeps the existing identity-equality semantics (Type/SiteID/Name/CreatedBy must match); it no longer rewrites UIDs/Accounts onto the existing doc, since rooms created by this code path already have those fields and legacy rooms are out of scope for the hot path. The newly-similar processCreateRoomDM/BotDM helpers collapsed into the switch, mirroring the inline if-RoomTypeBotDM pattern handleSyncCreateDM already uses. 2. displayName(u) used to return TrimSpace(EngName + " " + ChineseName). For account-only users where EngName == ChineseName, that rendered as "Bob Bob". Short-circuit to a single rendering when the two fields are equal. Channels remain the only consumer of system-message content; DM/botDM paths don't render via this helper. Tests: - F1/F2 (DM/BotDM SetsParticipantFields) now capture the Room passed to CreateRoom and assert UIDs/Accounts on it, replacing the previous UpdateDMParticipants arg-pin - CollisionMismatchType test threads the upfront counterpart GetUser - TestFormatLeft_EngEqualsChineseRendersOnce pins the new dedupe contract - mocks regenerated; integration test compile-clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(room-worker): sys-message format follow-ups for PR #185 Three follow-ups from PR #185 review: 1. Drop validateUserNames + its five call sites (mliu PR comment). EngName and ChineseName are now treated as optional. displayName falls through to a single non-empty side, and finally to the Account string when both are blank — the rendered body never collapses to a bare quote pair. 2. Quote display names in every channel sys-message: `"Alice Anderson 王愛麗" added "Frank Fischer 費法蘭" to the channel` (and analogous shapes for added-multi / removed-user / removed-org / left). formatRemovedOrg quotes the SectName too. 3. Populate the sender envelope (Message.UserID + UserAccount) on the member_left / member_removed sys-messages emitted by processRemoveIndividual and processRemoveOrg. message-worker looks up the user by UserID to fill the Cassandra `sender` Participant column; previously these rows landed with sender = null. Self-leave reuses the already-fetched leaving user; forced-removal and org-removal fetch the requester via the existing store.GetUser (no new store method needed). Older Cassandra rows are intentionally not backfilled — only newly emitted messages get the populated sender. Also capitalise "A new room has been created" on the room_created sys-message body. Tests: - sysmsg_test: drop TestValidateUserNames; add TestDisplayName covering every fallback branch and TestFormatLeft_FallsBackToAccount / TestFormatAddedSingle_SingleNameSide; update existing formatter assertions for the new quoted shape. - handler_test: drop the four empty-name permanent-error tests (TestHandler_ProcessAddMembers_RequesterEmptyName, _AddedUserEmptyName, TestHandler_ProcessRemoveIndividual_EmptyName, TestProcessCreateRoom_RequesterEmptyName_ReturnsPermanent); update existing content assertions to the quoted format; mock GetUser on the requester for forced-remove / org-remove paths and assert sysMsg.UserID is set to the requester ID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(room-worker): update integration test assertions for quoted sys-msg shape Two integration assertions still expected the unquoted sys-message bodies; mirror the change from the prior commit so the live MongoDB+Cassandra container test agrees with the in-memory handler tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs.

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs.

@mliu33

Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs.

@mliu33

* docs(spec): member-add improvements + DM dedup + CreatedBy/TargetUser cleanups Five-part spec, perf-hardened per second /simplify pass. Parts 1-3: org→individual silent no-op fix, deptId support with prefer-dept-on-overlap, and CreateRoomDialog dedup-reply short-circuit. New GetCapacityCheckPipeline + GetAddMemberCandidatesPipeline split capacity-counting (3 stages, no room_members lookup) from full-candidate resolution (4 stages, room_members lookup via member.id covered by the existing unique index). New (deptId, account) index keeps both branches of the extended $or index-covered. Enrichment _orgMatch pipeline stays at 3 inner stages by emitting raw {name, tcName} and combining in Go; combineWithFallback moves to pkg/displayfmt as the single source of truth for both sys-msg and member-list display. Part 4: drops Room.CreatedBy and rewrites the JetStream replay-equivalence check to use a reusable reconcileRoomOnDuplicateKey helper. The new check verifies (Type, SiteID) structural match and requester-sub-exists; this fixes DM concurrent-create (today incorrectly rejected as "room ID collision") while still catching real ID collisions (requester has no sub in the unrelated existing room). Part 5: drops never-written target_user column from all four Cassandra message tables, the Go struct, history-service read-side baseColumns, and the docs. All new queries verified index-covered; performance/indexes table in Part 2 enumerates every read path. * docs(plan): implementation plan for member-add improvements + cleanups 14 tasks across five parts mirroring the spec, hardened per the second /simplify pass. Notable updates over the first draft: - pkg/displayfmt shared package introduced in Task 7 — single source of truth for CombineWithFallback, consumed by both room-worker (sys-msgs) and room-service (member-list enrichment). - Task 11 enrichment pipeline kept at 3 inner stages (vs. the earlier 6-stage draft) by emitting raw {name, tcName} and applying the combine in Go after cursor.All. - Task 13 (Room.CreatedBy removal) extracts the duplicate-key recovery into a single Handler.reconcileRoomOnDuplicateKey helper used by both processCreateRoom and the sync-DM path. Adds a "no extra writes" pre/post snapshot assertion to guard against silent-dedup false passes. Frontend and docs/client-api.md grep uses word-boundary regex to avoid stripping the unrelated createdByAccount field. - Task 14 (target_user removal) adds explicit test-fixture cleanup steps covering pkg/model/cassandra/message_test.go and the two history-service integration tests that today assert on .TargetUser. TDD per task: failing test → minimal implementation → green → commit. Each task is independently committable and reviewable. * feat(pipelines): add GetCapacityCheckPipeline + GetAddMemberCandidatesPipeline https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * feat(model): add SectTCName, DeptID, DeptName, DeptTCName fields to User https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * feat(room-service): add (deptId, account) index for deptId-matching pipelines https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * feat(room-worker): add ListAddMemberCandidates with per-candidate flags Adds AddMemberCandidate type, ListAddMemberCandidates to SubscriptionStore interface and MongoStore implementation backed by GetAddMemberCandidatesPipeline, returning per-user {hasSubscription, hasIndividualRoomMember} flags so the handler can split candidates into needSub vs needIRM upgrade paths. Regenerates mock and adds integration test covering the bug scenario (sub-without-IRM) plus new-member and full-member states. Also includes room-service CountNewMembers integration test from the preceding task that was not yet committed. https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * fix(chat-frontend): dedup reply navigates directly without summaries-wait When the server replies with "dm already exists", the handleSubmit now calls onCreated+onClose immediately and returns, bypassing setPendingRoom entirely. This prevents the 3-second "taking longer than expected" banner from firing on a BUCKETS_LOADED race where summaries is still empty at the moment the dedup response lands. https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * refactor(cassandra): drop unused target_user column from message schema The target_user column was never written anywhere; all reads returned NULL. Removes the field from the Go struct, all four init CQL files, the history-service baseColumns select lists, both docker-compose inline schemas, the canonical docs, and all test fixtures/assertions. https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * feat(pipelines): match deptId alongside sectId in candidate $or Extends matchCandidates (used by GetCapacityCheckPipeline and GetAddMemberCandidatesPipeline) and GetNewMembersPipeline to include a deptId branch in the $or filter so users whose org affiliation is stored in deptId are picked up the same way as those in sectId. Also updates the existing pipeline unit tests that asserted the old $or clause count, and adds an integration test (TestMongoStore_ListAddMemberCandidates_DeptMatching_Integration) that proves both fields are matched correctly. https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * feat(room-service): enrichment prefers dept on overlap with Go-side combine, falls back to orgID https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * fix(room-worker): create individual room_members row on org→individual member.add Rewrite processAddMembers around the per-candidate ListAddMemberCandidates flags: split candidates into needSub (no subscription yet) and needIRM (no individual room_members row yet, gated by writeIndividuals). This unblocks the org→individual upgrade path — a user previously added via org expansion (subscription + org room_members row, no individual row) can now be re-added via req.Users and get the missing individual row WITHOUT a duplicate subscription. Remove ListNewMembers from the store interface and its MongoStore impl; the new ListAddMemberCandidates fully supersedes it. * refactor(room-worker): OrgMemberStatus carries (Name,TCName,IsDept); lookup via member.id * feat(room-worker): processRemoveOrg uses dept-first tiebreak with displayOrg + orgID fallback * refactor(room-worker): drop Room.CreatedBy, extract reconcileRoomOnDuplicateKey, use requester-sub-exists for replay equivalence The JetStream duplicate-key recovery compared Type+SiteID+Name+CreatedBy on the existing room, which was over-determined and incorrectly rejected DM concurrent-create by counterpart: the loser's worker saw existing CreatedBy != room.CreatedBy and returned "room ID collision" even though the room IS the intended DM and both subscriptions were already written by the winner. Switch to Type+SiteID+requester-sub-exists. If the requester already has a subscription in the existing room, the collision is either a legitimate JetStream redelivery or the concurrent-create-equivalent case. Extract a reconcileRoomOnDuplicateKey helper so both processCreateRoom and handleSyncCreateDM share one source of truth. CreatedBy was also exposed on the wire (json:"createdBy") and in the frontend Room type but never read. Remove the field entirely from pkg/model.Room and update the client-api docs (createdByAccount on the create-request schema is a separate field and stays intact). The two non-listed test files (room-service/integration_test.go, inbox-worker/handler_test.go) had to drop CreatedBy fixture lines to keep compilation green after the model-field removal. * test: add pipeline unit tests + strengthen DM concurrent test; polish slog + pre-alloc + godoc - pkg/pipelines/member_test.go: add TestGetCapacityCheckPipeline and TestGetAddMemberCandidatesPipeline covering stage counts, $match shapes for orgIDs-only / directAccounts-only / both, excludeAccount $ne, and panic on empty roomID - pkg/pipelines/member.go: add "Panics if roomID is empty" to godocs for GetCapacityCheckPipeline and GetAddMemberCandidatesPipeline - room-worker/handler.go processRemoveOrg: add requestID field to slog.Warn for log correlation; pre-allocate needSub/needIRM slices with make(…, 0, cap) - room-worker/store_mongo.go: add godoc to BulkCreateSubscriptions explaining silent dup-key swallow, idempotency contract, and why it is required https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * fix: 4 review findings — reconcileRoomOnDuplicateKey idempotency, spurious sys-msg on upgrade, dept-blind membership lookups, room-service/worker display parity - room-worker/handler.go reconcileRoomOnDuplicateKey: drop requester-must-be- member check; structural (Type+SiteID) match is sufficient. The earlier guard turned mid-write crashes (CreateRoom succeeded, worker died before BulkCreateSubscriptions) into permanent failures that orphaned rooms in Mongo. With the requester check removed, JetStream redelivery finds the existing room, finishes the unfinished subscription writes, and the room is no longer stuck. Updated both call sites (processCreateRoom and handleSyncCreateDM) and the existing collision-mismatch tests; added a channel mid-write-crash integration test. - room-worker/handler.go processAddMembers: gate MemberAddEvent publish and members_added sys-msg construction on len(actualAccounts)>0 || len(req.Orgs)>0. The org→individual upgrade path (only needIRM non-empty) writes the missing room_members row silently — no membership state changed for the room, so emitting an empty member-add event and an "added members to the channel" sys-msg with no actual members would mislead end users. Extended the existing OrgToIndividualUpgrade unit test to assert neither event fires. - room-service/store_mongo.go GetSubscriptionWithMembership and room-worker/store_mongo.go GetUserWithMembership: extend the org-lookup $let from {sectId} to {sectId, deptId} and the inner $match from a single $eq on member.id to a $or over both. A user added via Orgs:["X"] whose deptId is "X" (no sectId match) previously reported HasOrgMembership=false — the symmetric bug to the GetOrgMembersWithIndividualStatus fix from Part 1. Added a dept-only-match regression test in each integration suite. - room-service/store_mongo.go getRoomMembers Go-side combine: switch from unconditional dept-on-overlap to a two-pass dept-first-with-fallback that mirrors room-worker's processRemoveOrg. A user with deptId="X" + empty deptName plus a sibling with sectName="Engineering" now renders "Engineering" instead of the orgID, restoring byte-identical output between member-list and sys-msg paths. Added a regression test for this exact case to TestMongoStore_ListRoomMembers_Enrich_Integration. https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA * feat(room-service): ListOrgMembers matches sectId OR deptId Symmetric to the dept-aware membership lookups landed in e016623 (GetSubscriptionWithMembership, GetUserWithMembership) and to GetOrgMembersWithIndividualStatus in room-worker. The org-row "expand" RPC (chat.user.{a}.request.orgs.{orgId}.members) previously returned errInvalidOrg for any org added by deptId, so the member dialog rendered "Failed to load members" on dept-scoped org rows. Both (sectId, account) and (deptId, account) indexes back the $or so the query stays index-covered. Added two integration sub-tests under TestMongoStore_ListOrgMembers_Integration covering the distinct-sectId/deptId case and the sectId==deptId invariant case. Updated docs/client-api.md and the branch spec (2026-05-19-org-to-individual-membership-upgrade-design.md) to reflect the broadened semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(sast): pin runtime toolchain on sast-vuln target govulncheck shells out to `go list` at runtime; when the system Go diverges from TOOLS_GO_TOOLCHAIN, the binary (built with 1.25) sees a 1.26 source tree and panics with "file requires newer Go version". Setting GOTOOLCHAIN at runtime makes the dispatcher pick the same toolchain the binary was built against. Belt-and-suspenders even when the system Go matches the pin — keeps the target robust to future toolchain skews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-service): reject phantom org IDs and accounts at request time Both member.add and channel-create silently accepted orgs/users with no backing user document. The candidates aggregation dropped them, so the async-job reply reported success while the worker either wrote a room_members row for a zero-user org (firing a spurious members_added sys-msg) or skipped the targeted user entirely. Gate at the room-service request boundary with two new store methods (FindExistingOrgIDs, FindExistingAccounts) and two handler helpers (validateOrgIDs, validateAccountsExist). errInvalidOrg / errUserNotFound sentinels were already in sanitizeError's allow-list; clients now see a clean synchronous error envelope and nothing reaches the canonical stream. Also folds in two follow-ups from the prior review pass: - Mongo testimage comment now records that the retired 4.4.15 guard against partialFilterExpression operator-allow-list regressions has been intentionally given up under the mongo:8 bump. - ListOrgMembers integration test gains a "neither sectId nor deptId matches" sub-test so dedupe across the broadened \$or is exercised. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 6 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 15 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(room-worker): address PR #171 follow-up review findings Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: apply branch-review findings — panic→error, backfill fail-hard, phantom-validation tests - pkg/pipelines: convert GetCapacityCheckPipeline / GetAddMemberCandidatesPipeline from panic-on-empty-roomID to (bson.A, error); library code should not crash the worker on bad caller input. Extract bot/pseudo regex into package-level const so GetNewMembersPipeline and matchCandidates can't drift. - room-worker/processAddMembers: propagate backfill query errors instead of log-and-continue. Log-and-continue would silently corrupt room_members (existing subs would never get IRM rows). Retry is safe — subs are already written so needSub is empty, hadOrgsBefore stays false until BulkCreateRoomMembers commits, and the backfill re-runs cleanly. - room-worker: requestID propagation on the two member-add publish-error logs now inside the gated block; wrap bare returns after reconcileRoomOnDuplicateKey and rotateAndFanOut; refine error-wrap messages on the dup-key fetch path. Trim BulkCreateSubscriptions doc to one why-line. Project u.account on GetSubscriptionAccounts. - room-service: rename _orgRaw BSON tag to orgRaw, add omitempty across orgRawDisplay siblings. Clarify validateOrgIDs / validateAccountsExist godoc to record the fmt.Errorf wrap-with-input shape that callers see via errors.Is. - room-service tests: consolidate 5 sibling AddMembers phantom tests into a single table-driven case; add parallel CreateRoomChannel phantom table so the same gate is exercised on both paths; add store-error branch cases to cover the FindExistingOrgIDs / FindExistingAccounts error propagation path (was untested per CLAUDE.md §4). - room-worker tests: add backfill-error fail-hard tests for both GetSubscriptionAccounts and FindUsersByAccounts failure paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(make): drop COMPOSE_BAKE/COMPOSE_PARALLEL_LIMIT from `make up` Removes the inline `COMPOSE_BAKE=false COMPOSE_PARALLEL_LIMIT=1` env-vars from both branches of the `up` target. The Docker Compose CLI applies its own defaults; pinning them here predates the local toolchain and is no longer needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: resolve coderabbit findings on PR #212 - pkg/displayfmt/combine_test.go: switch to in-package tests (`package displayfmt`) per CLAUDE.md §4 rule; drop the now-unused local import. - room-worker/handler.go: wrap bare `return nil, reconcileErr` in processSyncCreateDM's dup-key reconcile branch with sync-DM context (CLAUDE.md §3 "Never return bare err"). - room-service/integration_test.go: make two dept-only test cases actually dept-only by setting SectID to a non-matching value (was `dept-x`, so the tests would have passed via sectId match alone — defeating the intent of "dept-only branch is doing its job"). - room-worker/store.go: refresh stale doc on ListNewMembersForNewRoom — refer to ListAddMemberCandidates (the current contract) instead of the removed ListNewMembers. - docs/superpowers: rename stale `_orgRaw` → `orgRaw` in both the plan snippet and the spec narrative to match the renamed BSON tag; add language identifiers (```sh) to the four shell-command fenced blocks the reviewer called out in the plan. Skipped intentionally: - Makefile COMPOSE_BAKE finding: the user explicitly asked to REMOVE COMPOSE_BAKE/COMPOSE_PARALLEL_LIMIT (commit 3df2b19); bot's suggestion to set COMPOSE_BAKE=true contradicts that directive. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(client-api): document users/orgs/channels on Create Room Close the consistency gap CodeRabbit flagged on PR #212: the Create Room request schema was missing users/orgs/channels even though the handler accepts them and the same section's error envelope mentions "invalid org" / "user not found" rejections for those exact fields. Add them to the request-body table (marked channel-only and optional, matching the gate the worker enforces) and update the example JSON so the schema, example, and error envelope agree. * refactor(room-service,room-worker): drop unused single-row CRUD store methods All production write paths now use Bulk* variants (BulkCreateSubscriptions, BulkCreateRoomMembers); the single-row CreateSubscription, CreateRoom (room- service only), and CreateRoomMember linger on the store interfaces with zero live callers and are only kept alive by integration-test seeding. Remove them from the interface + Mongo impl, point the test seed sites at the existing mustInsertRoom / mustInsertSub helpers (direct collection inserts), and drop the now-misleading "existing methods (invite flow)" section comment in room-worker/store.go. Mocks regenerated; lint, unit tests, and integration tests all green locally. * refactor: drop shouldRotate guard and BaseKeyVersion per PR #171 review PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs. * docs: drop stale createdBy / createdByAccount from Room and Create Room Neither field exists on the wire. Room.CreatedBy was removed from the model earlier in this PR (Task 13), and createdByAccount was never a field on CreateRoomRequest — the creator's account is taken from the {account} segment of the create-room subject by ParseRoomCreateSubject and never read from the request body. - docs/client-api.md: remove the createdByAccount row from the Create Room request-body table and from the example JSON; add a one-liner pointing readers at the subject-derived creator. - docs/superpowers/spec.md: remove the stale CreatedBy row from the Room schema table in section 4.1. - docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md: rewrite Step 11/12 — the original plan instructed the grep to spare createdByAccount on the assumption it was a real field; both names should be stripped. Historical design/plan docs (2026-03..2026-05-07) that record the field as it existed at those points in time are left intact. * fix: address remaining CodeRabbit findings on PR #212 room-worker/handler.go (Major): Backfill in processAddMembers fed FindUsersByAccounts the existing subscriber accounts and assumed the result covered every requested one. When a stale subscription points at a deleted user document the call returns a subset; the buggy code committed room_members for the rows it got, BulkCreateRoomMembers flipped hadOrgsBefore=true, and no future redelivery could ever repair the missing IRM rows (backfill only fires on the first-org transition). Add a fail-hard guard that compares returned accounts against the requested set and returns newPermanent("backfill user %s not found ...") before any write. New regression test in handler_test.go locks in the contract. docs/client-api.md: Replace the stale "creating a room only enrolls the owner" line — it contradicts the request schema introduced earlier this PR, which now documents users/orgs/channels enrollment on create. docs/superpowers/specs/2026-05-19-...-upgrade-design.md: Reconcile contradictory stage-count claims for the _orgMatch $lookup inner pipeline. The later "grows to 6 stages" sentence was wrong; verified against room-service/store_mongo.go:520-552 — the pipeline is 3 stages ($match → $addFields → $group), same as before. The $group merely widens its accumulator set. pkg/testutil/testimages/testimages.go + docker-local/compose.deps.yaml: Replace the floating "mongo:8" / "valkey/valkey:8-alpine" tags with patch-pinned mongo:8.2.9 / valkey/valkey:8.1.7-alpine so testcontainers can't drift across patch releases between CI runs. * docs(client-api): widen userCount note for create-with-initial-members The Room schema table still claimed userCount is "1 immediately after creation (the owner)", but the same section now documents users / orgs / channels enrollment at create time, so the count can land >1. Rewrite the note to match. * refactor(room-service): fold FindExistingOrgIDs dedupe into one pass Per @mliu33's review on room-service/store_mongo.go:770. The previous implementation iterated sectIDs, deptIDs, then a separate map → slice loop (three passes). Build the result slice directly inside the sectIDs / deptIDs loops, using the map only as a dedupe set, so we make two passes instead of three and skip the map → slice copy. Behavior unchanged; existing integration test covers it. * perf(room-service): collapse FindExistingOrgIDs into one Mongo round-trip Per @mliu33's review: the previous implementation issued two Distinct calls (one on sectId, one on deptId) and then merged + deduped in Go. That's two network round-trips to Mongo per room.member.add and channel-create request, which run on every gate check. Rewrite as a single aggregate using $unionWith. The main pipeline does the sectId distinct, $unionWith folds in the deptId distinct, and a terminal $group dedupes the union server-side. Each $match sits at the top of its own pipeline so the (sectId, account) and (deptId, account) compound indexes both still cover their respective branches — explain() against the integration container confirms DISTINCT_SCAN on both indexes (Mongo even applies $groupByDistinctScan), no COLLSCAN. Net: 1 round-trip instead of 2, same index coverage, less Go-side allocation (no more dedup map + intermediate slices). Added a regression test for the overlap case (same orgID matched via sectId on one user + deptId on another) that the existing suite didn't pin down. * revert: drop $unionWith aggregation in FindExistingOrgIDs Restore the two parallel Distinct calls from before 95f7bd7. The $unionWith version was benchmarked at ~8.5% wall-clock improvement (885µs → 810µs per call against a local Mongo) for one fewer round-trip, but at the cost of ~55% more Go-side allocations per call and a deeper dependency on Mongo's aggregation optimizer. With the gain that small and the call rate at this gate not on the critical hot path, the simpler two-Distinct form is the better trade: less code, less BSON marshaling, version-agnostic from Mongo 4.4 onward, and explicit debuggability (each Distinct's error names which field failed). Verified TestMongoStore_FindExistingOrgIDs_Integration passes on both mongo:8.2.9 (current default) and mongo:4.4.30, including the new overlap-dedup case added in the previous commit. Comment on the function records why we landed back here so a future reader doesn't re-do the round-trip migration without fresh perf numbers. * perf(pipelines): project $lookup sub-results to {_id: 1} Per @mliu33's review on pkg/pipelines/member.go (lines 112 and 135). All four $lookup sub-pipelines in this file end with $match + $limit:1, then the outer pipeline only reads $size of the resulting array to check emptiness — it never touches any field of the joined doc. Mongo was still fetching and shipping the full subscription / room_members document into _sub / _irm / existingSub on every match. Append $project {_id: 1} after each $limit so each entry in the join array drops to ~30 bytes instead of the full doc. Same emptiness check ($eq: []) and same $size semantics work unchanged. Affects: - GetNewMembersPipeline (existingSub lookup) - GetCapacityCheckPipeline (_sub lookup) - GetAddMemberCandidatesPipeline (_sub and _irm lookups) No behavior change; existing unit + integration tests cover all three pipelines (TestMongoStore_FindExistingOrgIDs, ListAddMemberCandidates, CountNewMembers) and remain green. * refactor(pipelines): drop GetCapacityCheckPipeline — duplicate of GetNewMembersPipeline Per @mliu33's review on room-service/store_mongo.go:300. The two functions produced byte-identical pipelines for roomID != "" (modulo cosmetic variable names: existingSub/userAccount vs _sub/acct). GetNewMembersPipeline already handles both cases natively — it returns just the $match stage when roomID is empty, and adds the $lookup + not-already-subscribed $match when roomID is non-empty. - pkg/pipelines/member.go: delete GetCapacityCheckPipeline. Refactor GetNewMembersPipeline to use the shared matchCandidates() helper for the $match stage (DRY with GetAddMemberCandidatesPipeline). - pkg/pipelines/member_test.go: delete TestGetCapacityCheckPipeline. Behavior is fully covered by the existing TestGetNewMembersPipeline* cases (both branches). - room-service/store_mongo.go: CountNewMembers no longer needs the roomID-empty bifurcation — it now calls GetNewMembersPipeline unconditionally with the supplied roomID. - docs/superpowers/specs/2026-05-19-...-upgrade-design.md: spec rewritten to reflect the consolidation. The plan doc (a historical record of Task 1 as originally executed) is left intact. * refactor(room-worker): single-pass dept-first tiebreak in processRemoveOrg Per @mliu33's review on room-worker/handler.go:562. The previous two-pass logic scanned `members` once for a dept row and then (only if none was found) scanned it again for the first sect row. OrgMemberStatus already carries the `IsDept` flag, so we can do this in a single pass: stash the first sect candidate as we scan, then let any encountered dept row override it (and short-circuit further iteration). After the loop, fall back to the stashed sect candidate only if no dept row appeared. No behavior change — every TestHandler_ProcessRemoveOrg_DeptFirstTiebreak subcase (all sect, all dept, mixed dept-wins, all-names-empty → orgID fallback) still passes. Saves one full pass over the members slice on the org-remove hot path. * fix(room-worker): preserve sub when removed org has a sibling covering the same user Multi-org overlap bug introduced by Part 2 of this PR (dept-aware org matching). Before: orgs matched only on sectId, so a user could be reachable in a room via at most one org. After Part 2: orgs match sectId OR deptId, so a user with both fields set can be covered by two org rows simultaneously (one matching their sectId, one matching their deptId). processRemoveOrg's old toRemove rule only checked HasIndividualMembership: any org-only user was scheduled for sub deletion. With multi-org overlap, this silently orphans the user's sub even though the sibling org row still claims them as a member of the room. They lose notifications, can't see new messages, and effectively get booted — visible only on the next list-members call where the sibling org still expands to them. Fix is two-layer: - room-worker/store.go: add HasOtherOrgMembership bool to OrgMemberStatus. - room-worker/store_mongo.go: GetOrgMembersWithIndividualStatus gains a second $lookup against room_members. For each candidate, search for another org row whose member.id matches their sectId or deptId (excluding the orgID being removed). Same $project: {_id: 1} trick as the other lookups to keep the join entry small. - room-worker/handler.go: processRemoveOrg skips members where HasOtherOrgMembership=true from toRemove. Org row deletion + count reconcile + sys-msg still fire (the org WAS removed); no rotate, no per-account events when toRemove is empty. Tests: - Handler unit test (TestHandler_ProcessRemoveOrg_OtherOrgCovers_PreservesSub): mock returns alice with HasOtherOrgMembership=true; asserts DeleteSubscriptionsByAccounts and ListByRoom are never called. - Pipeline integration test (TestMongoStore_GetOrgMembersWithIndividualStatus_OtherOrgCovers_Integration): alice has sectId=X + deptId=Y, both X and Y as org rows; queries org X and asserts alice.HasOtherOrgMembership=true (Y still covers her) while carol — sectId=X only — has HasOtherOrgMembership=false. Both tests were red before the corresponding fix and green after; existing TestHandler_ProcessRemoveOrg* and TestMongoStore_GetOrgMembersWithIndividualStatus* cases still pass. --------- Co-authored-by: Claude <noreply@anthropic.com>

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread docs/superpowers/plans/2026-05-08-room-encryption-keys.md

Comment thread docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md

Comment thread inbox-worker/main.go Outdated

vjauhari-work force-pushed the claude/room-encryption-keys-5vlQ2 branch from 44f115c to c7bff9d Compare May 13, 2026 11:33

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread docs/client-api.md Outdated

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread docs/client-api.md Outdated

vjauhari-work force-pushed the claude/room-encryption-keys-5vlQ2 branch from cdabf9d to 9c42a1d Compare May 14, 2026 02:00

mliu33 reviewed May 14, 2026

View reviewed changes

vjauhari-work force-pushed the claude/room-encryption-keys-5vlQ2 branch from 9c42a1d to 9f1bdc1 Compare May 15, 2026 08:16

claude added 20 commits May 15, 2026 09:24

feat(pkg/subject): ServerRoomKeyGet builder for inter-site key RPC

2e78b60

Adds chat.server.request.roomkey.{siteID}.get subject used by inbox-worker to pull a room's keypair from origin during cross-site replication. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

refactor(inbox-worker): merge duplicate rotateLocalKey + replicateRoo…

89f86eb

…mKey into fetchAndStoreKey https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

fix(room-service): skip rotation on org-remove when no subscriptions …

441cd53

…are actually deleted https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m

claude added 6 commits May 15, 2026 09:24

test(room-worker): flip PublicKey assertion to expect empty on wire

b9a627f

Integration test still asserted PublicKey was populated in the fan-out RoomKeyEvent, but the client wire payload no longer carries it (per review comments #10 and #12, omitted via json:",omitempty").

vjauhari-work force-pushed the claude/room-encryption-keys-5vlQ2 branch from 0cb8286 to 860dad0 Compare May 15, 2026 09:25

mliu33 approved these changes May 16, 2026

View reviewed changes

mliu33 merged commit fa2e0ad into main May 16, 2026
11 checks passed

vjauhari-work deleted the claude/room-encryption-keys-5vlQ2 branch May 17, 2026 07:54

coderabbitai Bot mentioned this pull request May 19, 2026

feat: Valkey cluster support with room key ensure RPC #199

Merged

5 tasks

vjauhari-work mentioned this pull request May 20, 2026

Fix member-add bug + dept-aware org matching + Cassandra cleanups #212

Merged

5 tasks

coderabbitai Bot mentioned this pull request May 21, 2026

feat(roomcrypto): direct AES-256-GCM room encryption; add client decoder #207

Merged

9 tasks

coderabbitai Bot mentioned this pull request May 29, 2026

perf(room-worker): marshal room-key once per fan-out + projected survivor list #242

Merged

Conversation

vjauhari-work commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Security properties

Observability

Configuration

Failure modes

Test plan

Changes since initial review

Known follow-ups (intentionally deferred)

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mliu33 left a comment

Choose a reason for hiding this comment

Uh oh!

mliu33 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mliu33 May 16, 2026

Choose a reason for hiding this comment

Uh oh!

mliu33 May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vjauhari-work commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading