feat: room encryption keys end-to-end#171
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR implements end-to-end room encryption key management across room-service, room-worker, and inbox-worker: P-256 keypair generation, Valkey-backed storage and versioning (Set/Rotate/SetWithVersion), NATS fan-out and inter-site RPC (ServerRoomKeyGet), OpenTelemetry metrics, gating rules, and tests/docs/deploy updates. ChangesRoom Encryption Keys Implementation
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
room-worker/store_mongo.go (1)
38-49: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winWrap the new
ListByRoomdriver errors with operation context.Both failure paths return bare Mongo errors, which makes room/site-specific failures much harder to diagnose and breaks the repo’s error-wrapping rule.
As per coding guidelines, "Always wrap errors with context using `fmt.Errorf(\"short description: %w\", err)` describing what the current function was doing" and "Never return bare `err`".🧩 Suggested fix
func (s *MongoStore) ListByRoom(ctx context.Context, roomID, siteID string) ([]model.Subscription, error) { filter := bson.M{"roomId": roomID} if siteID != "" { filter["siteId"] = siteID } cursor, err := s.subscriptions.Find(ctx, filter) if err != nil { - return nil, err + return nil, fmt.Errorf("list subscriptions for room %q site %q: find: %w", roomID, siteID, err) } var subs []model.Subscription if err := cursor.All(ctx, &subs); err != nil { - return nil, err + return nil, fmt.Errorf("list subscriptions for room %q site %q: decode: %w", roomID, siteID, err) } return subs, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-worker/store_mongo.go` around lines 38 - 49, The ListByRoom function returns raw Mongo errors; wrap both failure returns with contextual messages using fmt.Errorf so callers get operation and identifier info. Specifically update the error returned from s.subscriptions.Find(...) to something like fmt.Errorf("ListByRoom: find failed for room %s site %s: %w", roomID, siteID, err) and likewise wrap the cursor.All(...) error with fmt.Errorf("ListByRoom: decode failed for room %s site %s: %w", roomID, siteID, err) so failures mention ListByRoom, the roomID/siteID and the underlying err instead of returning bare err.room-service/handler.go (1)
489-551:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale comment and overloaded
targetIsDualMembershipsemantics in org-remove path.Two related clarity issues around the new rotation-skip logic:
Line 520 comment is now stale. It states "Org removes rotate unconditionally; dual-membership users are filtered in room-worker after the rotation lands." but lines 537–547 explicitly skip rotation when
CountOrgOnlySubs == 0. The org-remove path is no longer unconditional.
targetIsDualMembershipis reused with different semantics. On line 518 it captures a per-user state (target has both individual and org membership). On line 545 it is reused as a "no subscriptions will be deleted → skip rotation" sentinel for an org-remove. For an org-remove, no single "target" user exists, and the flag is set based on the collective condition that every org member is also individually subscribed — that is not the same property. The current name makes the gate at line 551 (!targetIsDualMembership) ambiguous when read in the org-remove context.Rename to something intent-revealing like
skipKeyRotation(or fold the org-only-subs check into the existingelsebranch so the flag is assigned in one place per remove-mode), and update/remove the stale comment.♻️ Suggested restructure
- var targetIsDualMembership bool + // skipKeyRotation = true means no subscriptions will actually be deleted + // (individual: target keeps org membership; org: every org member is also + // individually subscribed), so we don't need to rotate the room key. + var skipKeyRotation bool if req.Account != "" { // ... individual-remove validation ... - targetIsDualMembership = target.HasIndividualMembership && target.HasOrgMembership + skipKeyRotation = target.HasIndividualMembership && target.HasOrgMembership } else { - // Org removes rotate unconditionally; dual-membership users are filtered in room-worker after the rotation lands. - // Owner-removes-org: only the requester's owner role matters here; org members resolved downstream. + // Owner-removes-org: only the requester's owner role matters here; org members resolved downstream. + // Rotation is skipped below when every org member is also individually subscribed — + // the org-remove then deletes only room_members org entries, not subscriptions. sub, err := h.store.GetSubscription(ctx, requesterAccount, roomID) if err != nil { return nil, fmt.Errorf("get requester subscription: %w", err) } if !hasRole(sub.Roles, model.RoleOwner) { return nil, fmt.Errorf("only owners can remove members") } + if h.keyStore != nil { + count, err := h.store.CountOrgOnlySubs(ctx, req.RoomID, req.OrgID) + if err != nil { + return nil, fmt.Errorf("count org-only subs: %w", err) + } + skipKeyRotation = count == 0 + } } // Stable seed for room-worker's deterministic system-message IDs across JetStream redeliveries. req.Timestamp = time.Now().UTC().UnixMilli() - // For org-removes, skip rotation when no subscriptions would actually be deleted. - // This happens when every org member is also individually subscribed (dual-membership), - // so the org-remove only removes room_members org entries — no subscription changes. - if h.keyStore != nil && req.OrgID != "" { - count, err := h.store.CountOrgOnlySubs(ctx, req.RoomID, req.OrgID) - if err != nil { - return nil, fmt.Errorf("count org-only subs: %w", err) - } - if count == 0 { - // No subscriptions will be deleted; skip rotation (member list changes, key does not). - // Fall through to publish the canonical event with NewKeyVersion=0. - targetIsDualMembership = true - } - } - // Rotate before publish so broadcast-worker encrypts under the new key immediately. - // Skip rotation when target is dual-membership: no actual removal happens in that case. - if h.keyStore != nil && !targetIsDualMembership { + // Skip rotation when no actual removal happens (see skipKeyRotation assignment above). + if h.keyStore != nil && !skipKeyRotation {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-service/handler.go` around lines 489 - 551, The flag targetIsDualMembership is overloaded and the comment about "Org removes rotate unconditionally" is stale; rename and clarify the logic so the rotation gate is unambiguous: introduce a boolean skipKeyRotation (or set skipKeyRotation instead of targetIsDualMembership) and assign it only to mean "skip key rotation for this request", set it in the individual-remove branch to target.HasIndividualMembership && target.HasOrgMembership (using GetSubscriptionWithMembership) and set it in the org-remove branch based on CountOrgOnlySubs == 0 (using CountOrgOnlySubs with req.RoomID/req.OrgID), then update/remove the stale comment and change the rotation check from if h.keyStore != nil && !targetIsDualMembership to if h.keyStore != nil && !skipKeyRotation so intent is clear.
🧹 Nitpick comments (4)
pkg/roomkeysender/roomkeysender_test.go (1)
83-97: ⚡ Quick winAssert the new non-mutation contract here.
The signature change is meant to keep
Sendfrom mutating the caller’s event, but this test only checks the published payload. Please also assert thattt.evtis unchanged afterSendso the by-value behavior doesn’t regress.🧪 Suggested assertion
for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { + wantTimestamp := tt.evt.Timestamp pub := &mockPublisher{err: tt.publishErr} sender := roomkeysender.NewSender(pub) err := sender.Send(tt.account, tt.evt) + assert.Equal(t, wantTimestamp, tt.evt.Timestamp) if tt.wantErr != "" { require.Error(t, err)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/roomkeysender/roomkeysender_test.go` around lines 83 - 97, The test must also verify Send no longer mutates the caller's event: before calling roomkeysender.NewSender(...).Send, make a deep copy of tt.evt (e.g., orig := deep copy of tt.evt) and after Send returns assert that tt.evt equals orig (use require.Equal or require.True with reflect.DeepEqual) for both success and error paths; update the subtest that constructs mockPublisher and calls sender.Send to create the copy and add the post-call equality assertion so the non-mutation contract for Send is enforced.room-service/keygen_test.go (1)
29-36: ⚡ Quick winThis “round trip” test never validates the private key.
It only proves that
roomcrypto.Encodeaccepts the generated public key. A brokengenerateRoomKeyPairthat returned mismatched public/private keys would still pass here. Please add a decrypt/assertion step, or rename the test if encode-shape is all you want to verify.As per coding guidelines, "Tests must cover: happy path, error paths, edge cases (empty collections, boundary conditions), and invalid input".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-service/keygen_test.go` around lines 29 - 36, The test TestGenerateRoomKeyPair_RoundTripWithRoomcrypto only verifies the encoded blob shape but never validates the private key; update the test to perform a full round-trip by calling the corresponding roomcrypto.Decode (or the library's decrypt function) with pair.PrivateKey and the encrypted result from roomcrypto.Encode and assert the decrypted plaintext equals "hello" (or rename the test to reflect that it only checks public-key encoding if you intentionally don't want decryption). Ensure you reference generateRoomKeyPair to get pair.PublicKey/pair.PrivateKey and the roomcrypto.Encode/Decode functions so a mismatched keypair would fail the assertion.docs/superpowers/plans/2026-05-08-room-encryption-keys.md (1)
746-752: ⚡ Quick winUpdate this plan snippet to the repo’s otelnats publish pattern.
This adapter is documented as a plain
*nats.ConnwithPublish(subj, data), but this repo’s standard connection path uses the OpenTelemetry wrapper and a context-aware publish so request IDs/traces flow via NATS headers. As written, the plan nudges follow-up work toward code that won’t match the existing connection type or propagation behavior.Based on learnings:
pkg/natsutil/connect.goreturns an OpenTelemetry-wrapped NATS connection (*otelnats.Conn), andPublish(ctx, ...)is the standard/correct usage across the codebase.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md` around lines 746 - 752, The snippet's natsPublisherAdapter uses a plain *nats.Conn and a Publish(subj, data) signature, but this repo uses the OpenTelemetry-wrapped connection from pkg/natsutil/connect.go (type *otelnats.Conn) and the context-aware Publish(ctx, ...) to propagate request IDs/traces; update natsPublisherAdapter to hold an *otelnats.Conn (or the concrete type returned by connect.go) and expose a Publish method that accepts a context.Context and delegates to the connection's Publish(ctx, subject, data) so headers/OTel context flow correctly.room-worker/handler_test.go (1)
3287-3290: ⚡ Quick winMake the survivor fan-out assertion order-independent.
This test is checking that every survivor receives a key, not that publishes happen in a specific order. The indexed assertions will get brittle if the implementation later dedups through a map or publishes concurrently.
Suggested assertion shape
- subjects := pub.subjects - assert.Contains(t, subjects[0], "chat.user.alice.event.room.key") - assert.Contains(t, subjects[1], "chat.user.bob.event.room.key") - assert.Contains(t, subjects[2], "chat.user.remote-carol.event.room.key") + assert.ElementsMatch(t, []string{ + "chat.user.alice.event.room.key", + "chat.user.bob.event.room.key", + "chat.user.remote-carol.event.room.key", + }, pub.subjects)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-worker/handler_test.go` around lines 3287 - 3290, The test currently asserts ordering by indexing into pub.subjects (subjects[0], subjects[1], subjects[2]) which makes the spec brittle; update the assertion to be order-independent by comparing the set of actual subjects to the expected set (e.g. use assert.ElementsMatch or assert.Equal with a sorted slice) or iterate expected values and assert each is contained in pub.subjects; refer to the pub.subjects / subjects variable in handler_test.go and replace the indexed asserts with an order-independent check for "chat.user.alice.event.room.key", "chat.user.bob.event.room.key", and "chat.user.remote-carol.event.room.key".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/client-api.md`:
- Around line 1752-1795: The overview's scope statement still marks "room-key
rotation" as out of scope but you added Section 5 (RoomKeyEvent) documenting
room-key events; update the overview bullet that mentions "room-key rotation" to
reflect that room key events (RoomKeyEvent, room key generation/rotation and
client behaviors) are now documented in Section 5, or remove the out-of-scope
note entirely so the overview and the new Section 5 are consistent with each
other.
In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Around line 57-58: Several fenced code blocks currently lack language
identifiers (they contain just the block starting with ``` and content like
"Client"); update those fenced blocks (the examples labeled "Client" and the
other similar example blocks) to include an explicit language tag such as
```text (or ```go where appropriate) so the markdown linter (MD040) is satisfied
and rendering is predictable—search for the fenced blocks containing "Client"
and replace the opening ``` with ```text or the correct language identifier.
- Around line 358-361: The ownership table is inverted: update the Service/Role
entries so that `room-service` is credited with generating/rotating keys
(`Set`/`Rotate`), while `room-worker` is documented as the reader/fan‑out that
consumes those changes and emits `RoomKeyEvent` to every room member via
`roomkeysender.Send`; keep `inbox-worker` described as only replicating key
bytes into local Valkey (`Set`/`Rotate`) and not performing any user-event
fan‑out.
- Around line 296-299: Update the spec to stop claiming `inbox-worker` fan-outs
keys: change the descriptions of `handleRoomCreated`, `handleMemberAdded`, and
`handleMemberRemoved` so they call `replicateOrSendLocalKey` /
`rotateAndFanOutLocalKey` only to replicate into local Valkey and perform RPCs
to origin, and remove any mentions of local `Send`/fan-out from these handlers;
instead document that the origin `room-worker` is responsible for publishing
`RoomKeyEvent`s to user subjects and that `inbox-worker` only stores/replicates
key material locally (with Valkey `Get` failures returning errors and RPC
fallbacks/`Set`/`Rotate` semantics unchanged).
In `@inbox-worker/handler_test.go`:
- Around line 1394-1402: The test currently expects replicateLocalKey to
silently succeed when dependencies are nil, but missing key-replication deps
should fail fast; update the implementation of Handler.replicateLocalKey to
check that h.keyStore and h.interSiteClient are non-nil and return a descriptive
error immediately when either is nil, and then update
TestReplicateLocalKey_NoOpsWhenDepsNil (and the similar test at 1484-1489) to
assert an error is returned instead of require.NoError; reference NewHandler to
locate how the handler is constructed and replicateLocalKey for where to add the
nil checks and error return.
- Around line 1365-1390: The test currently asserts the local stored key version
becomes 1 after fetchAndStoreKey, which is wrong for replicated keys — the
origin's version must be preserved; update the assertions in
TestFetchAndStoreKey_RotatesWhenLocalKeyExists (and the similar test around
lines 1514-1539) to expect the fetched origin version (e.g., 5 or 3 as set on
stubInterSiteClient.getResp) instead of 1 by checking the returned pair.Version
from keyStore.Get after calling h.fetchAndStoreKey; leave the rest of the setup
and call to h.fetchAndStoreKey and keyStore unchanged.
In `@inbox-worker/handler.go`:
- Around line 147-153: The replicateLocalKey call currently swallows errors
(only logs) which causes the handler to Ack the event even when Valkey/RPC
replication failed; change the handler to return the error from
h.replicateLocalKey so the create/add event is retried instead of being
acknowledged. Locate the replicateLocalKey invocation in the handler (the block
that logs "replicate local key") and replace the log-only branch with
propagation of the error (wrap/contextualize if needed) so the caller sees a
non-nil error; apply the same change to the other replicateLocalKey occurrence
in this file to ensure both paths retry on failure.
- Around line 377-396: fetchAndStoreKey currently ignores the fetched key's
version and always calls keyStore.Rotate, causing duplicate deliveries to bump
the published version; fix by reading the fetched version from
interSiteClient.GetRoomKey and comparing it to the local stored version before
deciding to rotate: call keyStore.Get (or equivalent) to obtain the current
local version for roomID, if no current key (roomkeystore.ErrNoCurrentKey) call
keyStore.Set with the fetched RoomKeyPair, if the local version is >=
fetched.Version do nothing (avoid re-rotating), and only call keyStore.Rotate
when fetched.Version is strictly greater than the local version; update metric
labels/op names appropriately around the Set/Rotate paths (references:
fetchAndStoreKey, interSiteClient.GetRoomKey, roomkeystore.RoomKeyPair,
keyStore.Get/Set/Rotate, roomkeystore.ErrNoCurrentKey).
In `@inbox-worker/intersite_key_test.go`:
- Around line 37-45: The test races because NATS subscriptions register
asynchronously; after each nc.Subscribe call (the ones registering handlers for
subject.ServerRoomKeyGet and similar) add a call to nc.Flush() to ensure the
subscription is propagated before calling
newNatsInterSiteKeyClient(...).GetRoomKey(...) (and other immediate requests) so
the request won’t timeout; locate the nc.Subscribe invocations in this file (and
the other two blocks noted) and insert nc.Flush() right after each Subscribe
returns and before creating/using the client.
In `@inbox-worker/intersite_key.go`:
- Around line 45-47: The code in intersite_key.go currently converts parsed
origin errors into a generic string via fmt.Errorf, losing sentinel identities;
change the branch that handles natsutil.TryParseError(resp.Data) so it returns
the parsed error object directly (e.g., return nil, errResp) instead of wrapping
it in fmt.Errorf, preserving sentinel comparisons like errors.Is(...,
errRoomKeyAbsent) in downstream code.
In `@inbox-worker/main.go`:
- Around line 285-299: The check that drops messages uses
cfg.RoomKeyMaxRedeliver and will immediately terminate messages if that value is
zero or negative; validate RoomKeyMaxRedeliver at startup (when building/loading
cfg) and reject non-positive values (return an error/exit the process) so the
consumes never run with an invalid limit. Update the configuration
parsing/validation path that constructs cfg (where cfg.RoomKeyMaxRedeliver is
set) to ensure it's > 0, provide a clear startup error message if not, and keep
the existing runtime check in exceedsMaxRedeliver and the inbox handler
unchanged.
In `@pkg/roomkeysender/integration_test.go`:
- Around line 174-183: The skipOnVFS helper currently skips tests when
DOCKER_STORAGE_DRIVER is empty or "vfs", which hides tests when the env var is
unset; update skipOnVFS so it only calls t.Skip when DOCKER_STORAGE_DRIVER ==
"vfs". Locate skipOnVFS and the os.Getenv checks and remove the emptiness branch
(or replace with a runtime check that runs `docker info`/inspects the Docker
storage driver and only skips when that result equals "vfs"), and ensure the
t.Skip message remains clear and is only invoked when the driver is explicitly
vfs.
In `@room-service/handler_test.go`:
- Around line 883-903: The test currently doesn't enforce that keyStore.Rotate
is called before publishToStream; update the test around
keyStore.EXPECT().Rotate and the publish callback used in
Handler.handleRemoveMember to enforce ordering by either wrapping the
expectations in gomock.InOrder(keyStore.EXPECT()...) with a subsequent EXPECT
for Rotate then any publish call, or by using a boolean/flag set inside the
Rotate DoAndReturn (e.g., set rotated = true) and make the publish callback
assert that rotated is true before accepting the publish; reference the existing
keyStore.EXPECT().Rotate DoAndReturn, the publish variable passed as
Handler.publishToStream, and the h.handleRemoveMember invocation when making the
change.
In `@room-service/store_mongo.go`:
- Around line 763-776: The $lookup in store_mongo.go currently matches
room_members on (rid, member.account, member.type) using let acc "$u.account",
but EnsureIndexes only creates an index on (rid, member.type, member.id),
causing full scans; fix by either changing the lookup to join on the indexed
member.id (compare "$member.id" to "$u.account" in the "$match" inside the
"$lookup" pipeline used by the aggregation in question) or add a new compound
index in EnsureIndexes for room_members on (rid, member.type, member.account) so
the existing lookup uses an index; adjust whichever you choose consistently
(modify the "$lookup" match or update EnsureIndexes) and run tests/perf check.
In `@room-worker/handler.go`:
- Around line 352-361: The removal currently proceeds to Ack even when
h.store.ListByRoom(ctx, req.RoomID, "") fails, which can leave survivors without
the rotated key; change the flow in the handler (the block using ListByRoom and
calling h.fanOutRoomKeyToSurvivors) so that on listErr you do not acknowledge
the removal but instead return or propagate an error (so the caller can retry),
and apply the same change to the analogous block around fanOutRoomKeyToSurvivors
at the later section (lines referenced in the review). Ensure ListByRoom
failures are logged and then cause the handler to abort the removal/ack path
rather than continuing.
- Around line 1263-1266: The call to buildAndFanOutRoomKey currently swallows
errors (logs and continues) which lets room/subscriptions persist without
members receiving keys; change the create-room flow so that when
buildAndFanOutRoomKey returns an error you do not treat it as success: either
(a) propagate the error back from the create-room handler (return err) so the
higher-level transaction/operation fails and can be retried, or (b) implement a
short retry loop with backoff around buildAndFanOutRoomKey (e.g., 3 attempts,
small sleep between attempts) and only proceed when it succeeds; include attempt
count and the original error in logs. Ensure you update the call site that
currently does "if err := h.buildAndFanOutRoomKey(...); err != nil {
slog.Error(... ) }" to either return the error or retry and only continue when
successful.
---
Outside diff comments:
In `@room-service/handler.go`:
- Around line 489-551: The flag targetIsDualMembership is overloaded and the
comment about "Org removes rotate unconditionally" is stale; rename and clarify
the logic so the rotation gate is unambiguous: introduce a boolean
skipKeyRotation (or set skipKeyRotation instead of targetIsDualMembership) and
assign it only to mean "skip key rotation for this request", set it in the
individual-remove branch to target.HasIndividualMembership &&
target.HasOrgMembership (using GetSubscriptionWithMembership) and set it in the
org-remove branch based on CountOrgOnlySubs == 0 (using CountOrgOnlySubs with
req.RoomID/req.OrgID), then update/remove the stale comment and change the
rotation check from if h.keyStore != nil && !targetIsDualMembership to if
h.keyStore != nil && !skipKeyRotation so intent is clear.
In `@room-worker/store_mongo.go`:
- Around line 38-49: The ListByRoom function returns raw Mongo errors; wrap both
failure returns with contextual messages using fmt.Errorf so callers get
operation and identifier info. Specifically update the error returned from
s.subscriptions.Find(...) to something like fmt.Errorf("ListByRoom: find failed
for room %s site %s: %w", roomID, siteID, err) and likewise wrap the
cursor.All(...) error with fmt.Errorf("ListByRoom: decode failed for room %s
site %s: %w", roomID, siteID, err) so failures mention ListByRoom, the
roomID/siteID and the underlying err instead of returning bare err.
---
Nitpick comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 746-752: The snippet's natsPublisherAdapter uses a plain
*nats.Conn and a Publish(subj, data) signature, but this repo uses the
OpenTelemetry-wrapped connection from pkg/natsutil/connect.go (type
*otelnats.Conn) and the context-aware Publish(ctx, ...) to propagate request
IDs/traces; update natsPublisherAdapter to hold an *otelnats.Conn (or the
concrete type returned by connect.go) and expose a Publish method that accepts a
context.Context and delegates to the connection's Publish(ctx, subject, data) so
headers/OTel context flow correctly.
In `@pkg/roomkeysender/roomkeysender_test.go`:
- Around line 83-97: The test must also verify Send no longer mutates the
caller's event: before calling roomkeysender.NewSender(...).Send, make a deep
copy of tt.evt (e.g., orig := deep copy of tt.evt) and after Send returns assert
that tt.evt equals orig (use require.Equal or require.True with
reflect.DeepEqual) for both success and error paths; update the subtest that
constructs mockPublisher and calls sender.Send to create the copy and add the
post-call equality assertion so the non-mutation contract for Send is enforced.
In `@room-service/keygen_test.go`:
- Around line 29-36: The test TestGenerateRoomKeyPair_RoundTripWithRoomcrypto
only verifies the encoded blob shape but never validates the private key; update
the test to perform a full round-trip by calling the corresponding
roomcrypto.Decode (or the library's decrypt function) with pair.PrivateKey and
the encrypted result from roomcrypto.Encode and assert the decrypted plaintext
equals "hello" (or rename the test to reflect that it only checks public-key
encoding if you intentionally don't want decryption). Ensure you reference
generateRoomKeyPair to get pair.PublicKey/pair.PrivateKey and the
roomcrypto.Encode/Decode functions so a mismatched keypair would fail the
assertion.
In `@room-worker/handler_test.go`:
- Around line 3287-3290: The test currently asserts ordering by indexing into
pub.subjects (subjects[0], subjects[1], subjects[2]) which makes the spec
brittle; update the assertion to be order-independent by comparing the set of
actual subjects to the expected set (e.g. use assert.ElementsMatch or
assert.Equal with a sorted slice) or iterate expected values and assert each is
contained in pub.subjects; refer to the pub.subjects / subjects variable in
handler_test.go and replace the indexed asserts with an order-independent check
for "chat.user.alice.event.room.key", "chat.user.bob.event.room.key", and
"chat.user.remote-carol.event.room.key".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 59fa3ada-6b44-42e6-962b-c48663f05f9f
📒 Files selected for processing (40)
docs/client-api.mddocs/superpowers/plans/2026-05-08-room-encryption-keys.mddocs/superpowers/specs/2026-05-08-room-encryption-keys-design.mdinbox-worker/consumer_config_test.goinbox-worker/deploy/docker-compose.ymlinbox-worker/handler.goinbox-worker/handler_test.goinbox-worker/integration_test.goinbox-worker/intersite_key.goinbox-worker/intersite_key_test.goinbox-worker/intersite_stubs_test.goinbox-worker/main.gopkg/model/event.gopkg/model/member.gopkg/model/model_test.gopkg/roomkeymetrics/metrics.gopkg/roomkeymetrics/metrics_test.gopkg/roomkeysender/integration_test.gopkg/roomkeysender/roomkeysender.gopkg/roomkeysender/roomkeysender_test.gopkg/roomkeystore/doc.gopkg/subject/subject.gopkg/subject/subject_test.goroom-service/handler.goroom-service/handler_test.goroom-service/integration_test.goroom-service/keygen.goroom-service/keygen_test.goroom-service/mock_store_test.goroom-service/store.goroom-service/store_mongo.goroom-worker/deploy/docker-compose.ymlroom-worker/handler.goroom-worker/handler_test.goroom-worker/integration_test.goroom-worker/main.goroom-worker/mock_publisher_test.goroom-worker/mock_store_test.goroom-worker/store.goroom-worker/store_mongo.go
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
inbox-worker/handler.go (1)
145-155:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake add-member retries re-enter the key replication path.
If
replicateLocalKeyfails afterBulkCreateSubscriptionssucceeds, the redelivery will usually hit duplicate keys at Line 145 and return before key replication is retried. That leaves local Valkey stale even though the message keeps retrying. At minimum, this path needs the same idempotent duplicate-key escape hatch ashandleRoomCreated, or an upsert-based subscription write.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inbox-worker/handler.go` around lines 145 - 155, BulkCreateSubscriptions may return a duplicate-key error which currently causes the handler to return before replicateLocalKey is retried, leaving local Valkey stale; update the add-member flow so a duplicate-key error from h.store.BulkCreateSubscriptions is treated as success (or convert the store call to an upsert) and the code always proceeds to call h.replicateLocalKey(ctx, evt.SiteID, event.RoomID); alternatively implement the same idempotent duplicate-key escape hatch used in handleRoomCreated so that replicateLocalKey is retried even when subscription writes collide.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 7-8: Update the docs to reflect that key ownership and rotation
live with the origin site: indicate that room-worker is the sole publisher of
RoomKeyEvent and the authoritative source of key versions, and that inbox-worker
must NOT implement its own Set/Rotate or a separate key fan-out path; instead
inbox-worker should only fetch keys via the
chat.server.request.roomkey.{siteID}.get RPC (or mirror the origin RoomKeyEvent)
and write the exact version into local Valkey, then use that mirrored value for
local delivery without changing versioning semantics; revise all referenced
sections (those around the current paragraphs and the ranges called out) to
remove instructions that give inbox-worker Set/Rotate responsibilities or
independent fan-out behavior and explicitly state the mirroring-only behavior
and exact-version requirement.
In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Line 455: The spec line for remove-member incorrectly marks a `Get` returning
version < `NewKeyVersion` as a permanent error; update the text for the
`remove-member` case to document stale local key versions as a transient `NAK +
retry` condition (not permanent), keeping the same retryable semantics used
elsewhere (mention `Get`, `NewKeyVersion`, `NAK + retry` and JetStream behavior)
so tests and JetStream handling continue to treat it as retryable.
In `@inbox-worker/handler_test.go`:
- Around line 1250-1252: Several test setup blocks in handler_test.go silently
ignore json.Marshal errors (e.g., lines creating pData, data used in OutboxEvent
payloads); replace occurrences like "pData, _ := json.Marshal(...)" and "data, _
:= json.Marshal(...)" with code that checks the marshal succeeded using
require.NoError(t, err) (or require.NoError where t is the test variable)
immediately after the marshal and before using pData/data, updating all
instances referenced in this file (examples around creation of model.OutboxEvent
in the member_added test and the other ranges noted) so test setup failures fail
fast instead of being silently discarded.
- Around line 1447-1448: The current test only compares the first byte of
pair.KeyPair.PublicKey which can miss overwrites; update the assertion to
compare the entire public key slice (pair.KeyPair.PublicKey) against the
expected full byte slice so the test fails if any byte changes, ensuring the "no
overwrite" property is strongly enforced for the KeyPair.PublicKey.
In `@inbox-worker/handler.go`:
- Around line 342-347: The duplicate-key handler for
h.store.BulkCreateSubscriptions currently treats any mongo.IsDuplicateKeyError
as full success; update the logic in the BulkCreateSubscriptions error branch to
verify that all intended subscriptions in subs were actually created (e.g.,
query h.store for the list of subscriptions for data.RoomID/data.HomeSiteID and
compare against subs) and only call h.fetchAndStoreKey(ctx, data.HomeSiteID,
data.RoomID) and return nil when the full target set exists; if any are missing,
instead perform an idempotent per-subscription upsert loop (or retry creating
only the missing entries) to reconcile the partial success before acknowledging
the event.
In `@pkg/roomkeysender/roomkeysender_test.go`:
- Around line 85-94: The test currently takes a shallow snapshot with before :=
tt.evt so slice headers (PublicKey, PrivateKey) still share backing arrays and
mutations could be missed; update the test to take a deep copy of the
RoomKeyEvent before calling roomkeysender.NewSender(...).Send — e.g., create a
deepCopy of tt.evt that copies the struct and makes new slices for PublicKey and
PrivateKey (copying their contents) and assign that to before, then call
sender.Send(tt.account, tt.evt) and assert.Equal(t, before, tt.evt) to ensure no
in-place slice mutations are missed.
---
Outside diff comments:
In `@inbox-worker/handler.go`:
- Around line 145-155: BulkCreateSubscriptions may return a duplicate-key error
which currently causes the handler to return before replicateLocalKey is
retried, leaving local Valkey stale; update the add-member flow so a
duplicate-key error from h.store.BulkCreateSubscriptions is treated as success
(or convert the store call to an upsert) and the code always proceeds to call
h.replicateLocalKey(ctx, evt.SiteID, event.RoomID); alternatively implement the
same idempotent duplicate-key escape hatch used in handleRoomCreated so that
replicateLocalKey is retried even when subscription writes collide.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4db44448-7ccd-4053-9dc0-f2be5ba22da3
📒 Files selected for processing (22)
docs/client-api.mddocs/superpowers/plans/2026-05-08-room-encryption-keys.mddocs/superpowers/specs/2026-05-08-room-encryption-keys-design.mdinbox-worker/handler.goinbox-worker/handler_test.goinbox-worker/intersite_key.goinbox-worker/intersite_key_test.goinbox-worker/intersite_stubs_test.goinbox-worker/main.gopkg/roomkeysender/integration_test.gopkg/roomkeysender/roomkeysender_test.gopkg/roomkeystore/adapter.gopkg/roomkeystore/integration_test.gopkg/roomkeystore/roomkeystore.gopkg/roomkeystore/roomkeystore_test.goroom-service/handler.goroom-service/handler_test.goroom-service/keygen_test.goroom-service/store_mongo.goroom-worker/handler.goroom-worker/handler_test.goroom-worker/store_mongo.go
🚧 Files skipped from review as they are similar to previous changes (10)
- room-worker/store_mongo.go
- inbox-worker/intersite_key_test.go
- inbox-worker/main.go
- inbox-worker/intersite_stubs_test.go
- room-service/store_mongo.go
- inbox-worker/intersite_key.go
- room-service/handler.go
- room-service/handler_test.go
- room-worker/handler.go
- room-worker/handler_test.go
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
room-worker/main.go (1)
173-195:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winVerify shutdown hook order: tracer/meter before NATS drain may lose telemetry.
The current order shuts down tracer (line 188) and meter (line 189) before draining NATS (line 190). This may prevent traces and metrics generated during the NATS drain, Mongo disconnect, or Valkey close from being captured and exported.
Standard observability shutdown happens after all client operations to ensure telemetry is flushed.
🔄 Suggested hook order
hooks := []func(ctx context.Context) error{ func(ctx context.Context) error { iter.Stop() return nil }, func(ctx context.Context) error { done := make(chan struct{}) go func() { wg.Wait(); close(done) }() select { case <-done: return nil case <-ctx.Done(): return fmt.Errorf("worker drain timed out: %w", ctx.Err()) } }, - func(ctx context.Context) error { return tracerShutdown(ctx) }, - func(ctx context.Context) error { return meterShutdown(ctx) }, func(ctx context.Context) error { return nc.Drain() }, func(ctx context.Context) error { mongoutil.Disconnect(ctx, mongoClient); return nil }, func(ctx context.Context) error { return keyStore.Close() }, + func(ctx context.Context) error { return tracerShutdown(ctx) }, + func(ctx context.Context) error { return meterShutdown(ctx) }, }This ensures all client operations complete before flushing observability exporters.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-worker/main.go` around lines 173 - 195, The tracerShutdown and meterShutdown hooks are ordered before nc.Drain/mongo disconnect/keyStore.Close, which can drop telemetry produced during those operations; move the calls to tracerShutdown and meterShutdown to run after nc.Drain(), mongoutil.Disconnect(ctx, mongoClient) and keyStore.Close() in the hooks slice so telemetry is flushed only after NATS drain, Mongo disconnect and key store close complete; keep the iter.Stop and wg.Wait hooks earlier so work finishes before the client drains, and ensure shutdown.Wait still receives the reordered hooks.inbox-worker/handler.go (1)
168-184:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEmpty
member_removed.Accountsshould not short-circuit key replication.A remote site can still need the rotated key even when no local subscriptions are deleted. Returning before
fetchAndStoreKeyleaves local Valkey stale, so this site'sbroadcast-workercan keep encrypting with an older version than local survivors hold.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inbox-worker/handler.go` around lines 168 - 184, The handler currently returns early when memberEvt.Accounts is empty which prevents key rotation; change Handler.handleMemberRemoved so that h.fetchAndStoreKey(ctx, evt.SiteID, memberEvt.RoomID) is always invoked even if memberEvt.Accounts is empty, and only call h.store.DeleteSubscriptionsByAccounts when len(memberEvt.Accounts) > 0 (preserve existing error handling for both calls). Ensure the final error messages remain descriptive (retain the existing fmt.Errorf wrapping) and that key fetch/store runs regardless of subscription deletion.
♻️ Duplicate comments (3)
docs/superpowers/plans/2026-05-08-room-encryption-keys.md (2)
1476-1935:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPart 3 still documents the old inbox-worker ownership model.
These tasks tell implementers to wire a sender into
inbox-worker, call localSet/Rotate, and fan out keys to users. The shipped design is RPC fetch +SetWithVersiononly, with originroom-workerremaining the soleRoomKeyEventpublisher. This section should be rewritten before someone follows it literally.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md` around lines 1476 - 1935, The document incorrectly instructs wiring a sender and performing local Set/Rotate + fan-out of room keys; instead rewrite the section to reflect the current design where inbox-worker must RPC-fetch keys from origin (using InterSiteKeyClient / newNatsInterSiteKeyClient), call the Valkey-backed SetWithVersion (roomkeystore.SetWithVersion or equivalent) to store the fetched versioned key locally, and NOT act as an origin publisher or fan-out sender; update references to Handler, replicateRoomKey, handleMemberAdded, and handleMemberRemoved to describe fetching via GetRoomKey and using SetWithVersion/Rotate semantics only, remove guidance to create or wire roomkeysender for outbound key publishing, and adjust the tests/instructions to expect RPC fetch + local SetWithVersion behavior (and note that origin room-worker remains the sole RoomKeyEvent publisher).
1172-1241:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTask 12 still treats stale key versions as permanent errors.
Both the test and the implementation snippet turn
pair.Version < req.NewKeyVersionintoerrPermanent. That case needsNAK + retry; otherwise remove-member can Ack away the normal Valkey propagation race instead of waiting for the rotated version to appear.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md` around lines 1172 - 1241, The test and handler currently treat a stale key version (in processRemoveMember when h.keyStore.Get(ctx, req.RoomID) returns pair.Version < req.NewKeyVersion) as a permanent error via newPermanent/errPermanent; change that branch in processRemoveMember to return a non-permanent error (e.g., a plain fmt.Errorf or a dedicated transient error type) instead of newPermanent so the caller will NAK and retry the remove-member operation; leave the other room-type and key-get error handling unchanged and only replace the newPermanent(...) used when pair == nil || pair.Version < req.NewKeyVersion.docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md (1)
38-39:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThis scope bullet still gives
inbox-workerits own fan-out/versioning path.It says the remote worker uses local
Set/Rotatesemantics and fans outRoomKeyEvents to local users, but the shipped flow is RPC fetch +SetWithVersiononly; originroom-workerremains the sole user-event publisher. As written, the top-level scope contradicts the corrected flow documented later in the spec.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md` around lines 38 - 39, The bullet claiming inbox-worker performs local Set/Rotate and independently fans out RoomKeyEvent is incorrect; update the scope to match the implemented flow where inbox-worker performs an RPC to room-worker (`chat.server.request.roomkey.{originSiteID}.get`) and uses SetWithVersion (not Set/Rotate) so the origin `room-worker` remains the sole publisher of user-facing RoomKeyEvent; change wording to reflect that inbox-worker only fetches keypair bytes via RPC and applies them with SetWithVersion while leaving event fan-out/versioning to the origin `room-worker` (keep references to `inbox-worker`, `room-worker`, `RoomKeyEvent`, `Set`, `Rotate`, and `SetWithVersion` to make the intended correction clear).
🧹 Nitpick comments (3)
room-worker/mock_publisher_test.go (1)
41-48: ⚡ Quick winClarify that the returned keys are dummy/non-cryptographic stubs.
The comment states "returns a valid version-0 key," but the keys are dummy byte sequences (all
0x04and0x05) that are not cryptographically valid P-256 keys. While the broader comment (line 34) mentions "don't exercise key behavior," the word "valid" here is misleading and could cause future developers to mistakenly use this stub in tests that do perform cryptographic operations.📝 Suggested comment clarification
-func (stubRoomKeyStore) Get(_ context.Context, _ string) (*roomkeystore.VersionedKeyPair, error) { +// Get returns a dummy (non-cryptographic) VersionedKeyPair. The key bytes are +// not valid P-256 keys and must not be used for actual crypto operations. +func (stubRoomKeyStore) Get(_ context.Context, _ string) (*roomkeystore.VersionedKeyPair, error) { return &roomkeystore.VersionedKeyPair{🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-worker/mock_publisher_test.go` around lines 41 - 48, Update the misleading comment in stubRoomKeyStore.Get that says it "returns a valid version-0 key" to explicitly state these are synthetic/dummy byte sequences (all 0x04 and 0x05) and are NOT cryptographically valid P-256 keys and must not be used for any crypto behavior; mention the returned types (roomkeystore.VersionedKeyPair and roomkeystore.RoomKeyPair) so readers know which stubbed values are non-cryptographic.room-worker/main.go (1)
92-96: ⚡ Quick winMove config validation before external connections.
Validating
VALKEY_KEY_GRACE_PERIOD > 0after Mongo connect (line 81) and stream bootstrap (line 87) wastes resources on external services before checking a simple config invariant.♻️ Suggested placement
Move the validation immediately after config parsing:
cfg, err := env.ParseAs[config]() if err != nil { slog.Error("parse config", "error", err) os.Exit(1) } + + if cfg.ValkeyKeyGracePeriod <= 0 { + slog.Error("VALKEY_KEY_GRACE_PERIOD must be a positive duration", + "valkey_key_grace_period", cfg.ValkeyKeyGracePeriod) + os.Exit(1) + } ctx := context.Background()Then remove lines 92–96.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@room-worker/main.go` around lines 92 - 96, Move the VALKEY_KEY_GRACE_PERIOD validation so it runs immediately after config parsing instead of after external setup: check cfg.ValkeyKeyGracePeriod > 0 right after the config is built/parsed (where the environment vars are loaded) and exit with an error if invalid, and then remove the later block that checks cfg.ValkeyKeyGracePeriod (the existing if that logs and calls os.Exit at the end of main). This ensures the validation occurs before creating Mongo connections or bootstrapping streams.inbox-worker/integration_test.go (1)
678-693: ⚡ Quick winUse the standard NATS testcontainer path in this integration test.
This new helper spins up an embedded server instead of the official
testcontainers-goNATS module already used elsewhere in the file. Reusing the container-based setup keeps dependency orchestration consistent with the repo's integration-test rules.As per coding guidelines, “Use
testcontainers-gowith official modules (mongodb,cassandra,nats) for real dependencies in integration tests”.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inbox-worker/integration_test.go` around lines 678 - 693, The helper startEmbeddedNATS spins up an in-process server; replace it with the repository's standard testcontainers-go NATS container usage: remove the natsserver.Options/NewServer/ns.Start flow in startEmbeddedNATS and instead create and start an official testcontainers nats container (using the same testcontainers NATS module pattern used elsewhere in this file), wait for readiness, connect using nats.Connect(ns.ClientURL()) (or the container's mapped host/port), and register t.Cleanup to terminate the container and close the client; update any imports and ensure the function still returns *nats.Conn and uses require.NoError/require.True as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 715-745: The docs currently instruct adding an optional Valkey
startup path checking cfg.ValkeyAddr and only wiring keyStore/keySender when it
is non-empty; update the instructions in room-worker/main.go to reflect the new
fail-fast behavior (Valkey is required at startup) instead of the old optional
branch: remove or rewrite the `if cfg.ValkeyAddr != ""` flow and document that
roomkeystore.NewValkeyStore must be initialized and that startup should exit if
it fails, and that roomkeysender.NewSender (using nc.NatsConn()) must always be
created and assigned to keySender; reference cfg.ValkeyAddr,
roomkeystore.NewValkeyStore, roomkeysender.NewSender and nc.NatsConn() so the
maintainer locates and replaces the optional wiring with the required, fail-fast
wiring.
In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Around line 429-434: Update the inbox-worker row in the config summary table
to include the ROOM_KEY_MAX_REDELIVER environment variable: add
`ROOM_KEY_MAX_REDELIVER` (describe default if applicable) alongside
`VALKEY_ADDR`, `VALKEY_PASSWORD`, `VALKEY_KEY_GRACE_PERIOD`, and
`ROOM_KEY_RPC_TIMEOUT` in the `inbox-worker` new env vars column so the table
matches the bounded-redelivery behavior described earlier; edit the table row
for `inbox-worker` in specs/2026-05-08-room-encryption-keys-design.md to list
`ROOM_KEY_MAX_REDELIVER` explicitly.
In `@inbox-worker/main.go`:
- Around line 266-285: Summary: startup lacks validation for
ROOM_KEY_RPC_TIMEOUT so a zero/negative cfg.RoomKeyRPCTimeout will make
inter-site key RPCs fail immediately. Fix: add a fail-fast check after the
existing validations to ensure cfg.RoomKeyRPCTimeout > 0; if not, call
slog.Error with a clear message (e.g., "ROOM_KEY_RPC_TIMEOUT must be a positive
duration" and include "room_key_rpc_timeout", cfg.RoomKeyRPCTimeout) and
os.Exit(1) before creating interSiteClient with newNatsInterSiteKeyClient,
mirroring the pattern used for ValkeyKeyGracePeriod and RoomKeyMaxRedeliver.
---
Outside diff comments:
In `@inbox-worker/handler.go`:
- Around line 168-184: The handler currently returns early when
memberEvt.Accounts is empty which prevents key rotation; change
Handler.handleMemberRemoved so that h.fetchAndStoreKey(ctx, evt.SiteID,
memberEvt.RoomID) is always invoked even if memberEvt.Accounts is empty, and
only call h.store.DeleteSubscriptionsByAccounts when len(memberEvt.Accounts) > 0
(preserve existing error handling for both calls). Ensure the final error
messages remain descriptive (retain the existing fmt.Errorf wrapping) and that
key fetch/store runs regardless of subscription deletion.
In `@room-worker/main.go`:
- Around line 173-195: The tracerShutdown and meterShutdown hooks are ordered
before nc.Drain/mongo disconnect/keyStore.Close, which can drop telemetry
produced during those operations; move the calls to tracerShutdown and
meterShutdown to run after nc.Drain(), mongoutil.Disconnect(ctx, mongoClient)
and keyStore.Close() in the hooks slice so telemetry is flushed only after NATS
drain, Mongo disconnect and key store close complete; keep the iter.Stop and
wg.Wait hooks earlier so work finishes before the client drains, and ensure
shutdown.Wait still receives the reordered hooks.
---
Duplicate comments:
In `@docs/superpowers/plans/2026-05-08-room-encryption-keys.md`:
- Around line 1476-1935: The document incorrectly instructs wiring a sender and
performing local Set/Rotate + fan-out of room keys; instead rewrite the section
to reflect the current design where inbox-worker must RPC-fetch keys from origin
(using InterSiteKeyClient / newNatsInterSiteKeyClient), call the Valkey-backed
SetWithVersion (roomkeystore.SetWithVersion or equivalent) to store the fetched
versioned key locally, and NOT act as an origin publisher or fan-out sender;
update references to Handler, replicateRoomKey, handleMemberAdded, and
handleMemberRemoved to describe fetching via GetRoomKey and using
SetWithVersion/Rotate semantics only, remove guidance to create or wire
roomkeysender for outbound key publishing, and adjust the tests/instructions to
expect RPC fetch + local SetWithVersion behavior (and note that origin
room-worker remains the sole RoomKeyEvent publisher).
- Around line 1172-1241: The test and handler currently treat a stale key
version (in processRemoveMember when h.keyStore.Get(ctx, req.RoomID) returns
pair.Version < req.NewKeyVersion) as a permanent error via
newPermanent/errPermanent; change that branch in processRemoveMember to return a
non-permanent error (e.g., a plain fmt.Errorf or a dedicated transient error
type) instead of newPermanent so the caller will NAK and retry the remove-member
operation; leave the other room-type and key-get error handling unchanged and
only replace the newPermanent(...) used when pair == nil || pair.Version <
req.NewKeyVersion.
In `@docs/superpowers/specs/2026-05-08-room-encryption-keys-design.md`:
- Around line 38-39: The bullet claiming inbox-worker performs local Set/Rotate
and independently fans out RoomKeyEvent is incorrect; update the scope to match
the implemented flow where inbox-worker performs an RPC to room-worker
(`chat.server.request.roomkey.{originSiteID}.get`) and uses SetWithVersion (not
Set/Rotate) so the origin `room-worker` remains the sole publisher of
user-facing RoomKeyEvent; change wording to reflect that inbox-worker only
fetches keypair bytes via RPC and applies them with SetWithVersion while leaving
event fan-out/versioning to the origin `room-worker` (keep references to
`inbox-worker`, `room-worker`, `RoomKeyEvent`, `Set`, `Rotate`, and
`SetWithVersion` to make the intended correction clear).
---
Nitpick comments:
In `@inbox-worker/integration_test.go`:
- Around line 678-693: The helper startEmbeddedNATS spins up an in-process
server; replace it with the repository's standard testcontainers-go NATS
container usage: remove the natsserver.Options/NewServer/ns.Start flow in
startEmbeddedNATS and instead create and start an official testcontainers nats
container (using the same testcontainers NATS module pattern used elsewhere in
this file), wait for readiness, connect using nats.Connect(ns.ClientURL()) (or
the container's mapped host/port), and register t.Cleanup to terminate the
container and close the client; update any imports and ensure the function still
returns *nats.Conn and uses require.NoError/require.True as before.
In `@room-worker/main.go`:
- Around line 92-96: Move the VALKEY_KEY_GRACE_PERIOD validation so it runs
immediately after config parsing instead of after external setup: check
cfg.ValkeyKeyGracePeriod > 0 right after the config is built/parsed (where the
environment vars are loaded) and exit with an error if invalid, and then remove
the later block that checks cfg.ValkeyKeyGracePeriod (the existing if that logs
and calls os.Exit at the end of main). This ensures the validation occurs before
creating Mongo connections or bootstrapping streams.
In `@room-worker/mock_publisher_test.go`:
- Around line 41-48: Update the misleading comment in stubRoomKeyStore.Get that
says it "returns a valid version-0 key" to explicitly state these are
synthetic/dummy byte sequences (all 0x04 and 0x05) and are NOT cryptographically
valid P-256 keys and must not be used for any crypto behavior; mention the
returned types (roomkeystore.VersionedKeyPair and roomkeystore.RoomKeyPair) so
readers know which stubbed values are non-cryptographic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f514e45a-046a-4bfa-9b7c-c1c9dfb4d6b0
📒 Files selected for processing (14)
docs/superpowers/plans/2026-05-08-room-encryption-keys.mddocs/superpowers/specs/2026-05-08-room-encryption-keys-design.mdinbox-worker/deploy/docker-compose.ymlinbox-worker/handler.goinbox-worker/handler_test.goinbox-worker/integration_test.goinbox-worker/main.gopkg/roomkeysender/roomkeysender_test.goroom-worker/deploy/docker-compose.ymlroom-worker/handler.goroom-worker/handler_test.goroom-worker/integration_test.goroom-worker/main.goroom-worker/mock_publisher_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- room-worker/deploy/docker-compose.yml
- room-worker/handler.go
- room-worker/handler_test.go
- inbox-worker/deploy/docker-compose.yml
- room-worker/integration_test.go
44f115c to
c7bff9d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/client-api.md (1)
830-830:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix inconsistent anchor reference.
This line still references
#5-error-envelope-reference, but the Error envelope section was renumbered to section 6. All other references in the document (lines 162, 250, 306, etc.) were correctly updated to#6-error-envelope-reference.🔗 Proposed fix
-See [Error envelope](`#5-error-envelope-reference`). Common errors: +See [Error envelope](`#6-error-envelope-reference`). Common errors:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/client-api.md` at line 830, Update the incorrect anchor reference "#5-error-envelope-reference" to "#6-error-envelope-reference" in the sentence "See [Error envelope](`#5-error-envelope-reference`). Common errors:" so it matches the renumbered Error envelope section and the other references in the document.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/client-api.md`:
- Around line 1840-1842: The fenced code block containing the NATS subject
pattern "chat.user.{account}.event.room.key" is missing a language identifier;
update that fenced block to include a language token (e.g., use ```text) so the
block becomes ```text ... ``` to satisfy markdownlint MD040 and improve
rendering consistency.
---
Outside diff comments:
In `@docs/client-api.md`:
- Line 830: Update the incorrect anchor reference "#5-error-envelope-reference"
to "#6-error-envelope-reference" in the sentence "See [Error
envelope](`#5-error-envelope-reference`). Common errors:" so it matches the
renumbered Error envelope section and the other references in the document.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 75a42379-f748-4132-8168-6d6004b4dbb1
📒 Files selected for processing (47)
broadcast-worker/mock_store_test.godocs/client-api.mddocs/superpowers/plans/2026-05-08-room-encryption-keys.mddocs/superpowers/specs/2026-05-08-room-encryption-keys-design.mdinbox-worker/consumer_config_test.goinbox-worker/deploy/docker-compose.ymlinbox-worker/handler.goinbox-worker/handler_test.goinbox-worker/integration_test.goinbox-worker/intersite_key.goinbox-worker/intersite_key_test.goinbox-worker/intersite_stubs_test.goinbox-worker/main.gomessage-worker/mock_store_test.gopkg/model/event.gopkg/model/member.gopkg/model/model_test.gopkg/roomkeymetrics/metrics.gopkg/roomkeymetrics/metrics_test.gopkg/roomkeysender/integration_test.gopkg/roomkeysender/roomkeysender.gopkg/roomkeysender/roomkeysender_test.gopkg/roomkeystore/adapter.gopkg/roomkeystore/doc.gopkg/roomkeystore/integration_test.gopkg/roomkeystore/roomkeystore.gopkg/roomkeystore/roomkeystore_test.gopkg/subject/subject.gopkg/subject/subject_test.goroom-service/handler.goroom-service/handler_test.goroom-service/integration_test.goroom-service/keygen.goroom-service/keygen_test.goroom-service/mock_store_test.goroom-service/store.goroom-service/store_mongo.goroom-worker/deploy/docker-compose.ymlroom-worker/handler.goroom-worker/handler_test.goroom-worker/integration_test.goroom-worker/main.goroom-worker/mock_publisher_test.goroom-worker/mock_store_test.goroom-worker/store.goroom-worker/store_mongo.gosearch-service/mock_store_test.go
✅ Files skipped from review due to trivial changes (7)
- pkg/roomkeystore/doc.go
- room-worker/deploy/docker-compose.yml
- message-worker/mock_store_test.go
- pkg/roomkeymetrics/metrics_test.go
- search-service/mock_store_test.go
- room-service/mock_store_test.go
- room-worker/mock_store_test.go
🚧 Files skipped from review as they are similar to previous changes (30)
- pkg/subject/subject_test.go
- room-service/keygen.go
- room-service/store.go
- pkg/subject/subject.go
- pkg/roomkeystore/roomkeystore.go
- room-service/integration_test.go
- pkg/roomkeystore/adapter.go
- pkg/model/event.go
- pkg/roomkeysender/roomkeysender.go
- pkg/roomkeystore/roomkeystore_test.go
- inbox-worker/deploy/docker-compose.yml
- room-worker/store_mongo.go
- pkg/roomkeysender/integration_test.go
- inbox-worker/intersite_key.go
- pkg/roomkeystore/integration_test.go
- pkg/model/member.go
- room-service/keygen_test.go
- inbox-worker/main.go
- pkg/model/model_test.go
- pkg/roomkeymetrics/metrics.go
- room-worker/mock_publisher_test.go
- inbox-worker/intersite_key_test.go
- inbox-worker/intersite_stubs_test.go
- room-worker/integration_test.go
- room-worker/handler.go
- room-service/store_mongo.go
- inbox-worker/integration_test.go
- room-worker/main.go
- room-worker/handler_test.go
- inbox-worker/handler_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/client-api.md`:
- Line 1836: Update the ambiguous statement "Each room has a P-256 keypair"
(line with that exact phrase) and the nearby lines about keys sent to initial
members and "(channels only)" key rotations (the sentences mentioning keys to
initial members and member add/remove rotation) to explicitly state the intended
behavior: either change the phrase to "Each channel room has a P-256 keypair" if
DM rooms do not receive keys, or add a clarifying note that all rooms currently
receive keys at creation for consistency but DM rooms use plaintext `message`
fields (not `encryptedMessage`) so clients may opt not to store DM keys, and
indicate if DM encryption is planned for a future release; ensure you update the
sections that reference `message` vs `encryptedMessage` in §4 to match this
decision so the documentation is consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
cdabf9d to
9c42a1d
Compare
9c42a1d to
9f1bdc1
Compare
End-to-end design for distributing P-256 key pairs per room. room-service generates keys on create and rotates on channel member removal. Origin room-worker fans out RoomKeyEvent to all members (cross-site routed by NATS supercluster). inbox-worker replicates keypairs into local Valkey so each site's broadcast-worker can encrypt locally. Key material never enters JetStream — cross-site replication uses a server-to-server NATS RPC. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
Three-part plan: room-service foundation (key gen + rotate), room-worker (consume canonical events + fan out RoomKeyEvent), inbox-worker (cross-site Valkey replication via RPC). TDD discipline, frequent commits, integration tests for end-to-end persistence and cross-site replication. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
Adds chat.server.request.roomkey.{siteID}.get subject used by inbox-worker
to pull a room's keypair from origin during cross-site replication.
https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
Adds NewKeyVersion to RemoveMemberRequest + MemberRemoveEvent for stamping the post-rotation version through the canonical event and outbox payload. Adds RoomType to RemoveMemberRequest so room-worker can assert channel-only without an extra GetRoom call. Adds RoomKeyGetRequest as the shared schema for the inter-site RPC request body. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…vent by value Send now accepts RoomKeyEvent by value, eliminating the silent shared-pointer mutation that would trample Timestamp across a fan-out loop (and prevent any data race if fan-out is ever parallelised). NatsPublisher exports the *nats.Conn → Publisher adapter that was previously duplicated in both worker mains. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
New package exporting five OTel instruments — FanoutErrors, RPCDuration, KeyGenerated, KeyRotated, ValkeyErrors (op-tagged). Consumers in room-service, room-worker, and inbox-worker emit metrics on success and failure paths so operators can dashboard key generation rate, RPC latency, fan-out failures, and per-op Valkey errors. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…topology Documents the public contract: scalar int versioning with current/prev slot, Lua atomicity for Rotate, single-master Valkey topology requirement, and federation responsibility (cross-site replication is inbox-worker's job). https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…air helper Extends the consumer-side RoomKeyStore interface with Set and Rotate so handlers can write/rotate keys. Adds the generateRoomKeyPair helper that emits a fresh P-256 keypair via crypto/ecdh. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…remove On create-room (all types), generate a P-256 keypair and write to Valkey before publishing the canonical event so the worker's Get gate sees the key. On channel member-remove, rotate the key (with Set fallback on ErrNoCurrentKey for pre-existing un-keyed rooms) and stamp NewKeyVersion on the canonical request payload. Dual-membership removals (target has both individual and org membership) skip rotation since the user remains via their org. Block non-channel rooms from member.remove. KeyGenerated/KeyRotated/ValkeyErrors counters emitted on respective paths. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
Adds the consumer-side RoomKeyStore interface (Get-only + Mongo ListByRoom now takes siteID for push-down filtering) and wires roomkeystore.NewValkeyStore + roomkeysender.NewSender in main.go, gated on VALKEY_ADDR being set. Emits a startup slog.Warn when key handling is disabled so partial deployments are visible. Wires otelutil.InitMeter for the metrics package. New mock_publisher_test.go supports capturing fan-out payloads in tests. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…all members processCreateRoom gates Mongo writes on keyStore.Get returning the pre-Set keypair, then fans out to every initial member via buildAndFanOutRoomKey. processAddMembers fans the current key out to newly-added accounts on channel rooms. processRemoveMember version-gates on req.NewKeyVersion (NAK on stale read so jetstream retries while Valkey propagation catches up), fetches survivors via ListByRoom, and calls fanOutRoomKeyToSurvivors. The outbox MemberRemoveEvent now carries NewKeyVersion. Fan-out targets every member account (local + remote) — NATS supercluster routes user-event subjects to the user's home site. Sentinel errors ErrRoomKeyNotFound and ErrRoomKeyStoreInternal exported for the RPC handler path. Defensive roomType guard uses req.RoomType with a backward-compat gate for empty field from pre-this-change senders. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
Drives processCreateRoom against real Valkey + Mongo + NATS containers and
asserts RoomKeyEvent is published on chat.user.{account}.event.room.key for
every member account. Exercises the version-gate, fan-out, and the
cross-site key-fetch RPC handler.
https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
natsInterSiteKeyClient issues chat.server.request.roomkey.{originSiteID}.get
via nats.Conn with X-Request-ID propagated through NATS headers, so trace
correlation survives the cross-site boundary. Timeout configurable via
ROOM_KEY_RPC_TIMEOUT (default 5s). RPCDuration histogram wraps every call.
main.go wires keystore + sender gated on VALKEY_ADDR; startup slog.Warn
when disabled. otelutil.InitMeter registered for metrics. Test stubs
(stubKeyStore, stubInterSiteClient, stubRoomKeyPublisher) reusable across
handler tests.
https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…s-site
Three handler extensions: handleRoomCreated calls replicateRoomKey (RPC
origin + Set local Valkey), handleMemberAdded calls replicateLocalKey
(cache hit short-circuits, miss falls back to RPC), and handleMemberRemoved
calls rotateLocalKey (RPC + Rotate with Set fallback for ErrNoCurrentKey).
No user-side fan-out — origin room-worker already publishes
chat.user.{account}.event.room.key for every member and NATS supercluster
routes to home sites. inbox-worker is purely state replication into local
Valkey so broadcast-worker on each site can encrypt locally. RPC failure
in the remove path NAKs so jetstream retries; reads vs writes are correctly
classified.
https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…e and cross-site replication room-service: drives handleCreateRoom against real Valkey and asserts the keypair is persisted at the expected hash key with non-empty bytes. inbox-worker: two-site test simulates an origin via an in-memory NATS responder, drives handleRoomCreated on the destination, and asserts the destination's Valkey now holds the replicated keypair. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
… deploys
Adds a new client-facing section documenting chat.user.{account}.event.room.key,
the RoomKeyEvent payload, client-side (roomId, version) → privateKey map
semantics, and the trigger scenarios (create, add-member, remove-member with
rotation). Deploy compose files set VALKEY_ADDR and VALKEY_KEY_GRACE_PERIOD
for both workers; inbox-worker also gets ROOM_KEY_RPC_TIMEOUT for operator
visibility.
https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
…tors inbox-worker/integration_test.go called NewHandler with 5 args after the fan-out removal trimmed it to 4. pkg/roomkeysender/integration_test.go passed *RoomKeyEvent after Send switched to by-value. Both compiled only under -tags=integration so the unit test gate missed them. Fixed 8 NewHandler call sites and the one Send call. https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
Cleanup pass surfaced by the final review: - Drop NatsPublisher adapter (*nats.Conn already satisfies Publisher) - Drop TestSender_DoesNotMutateInputTimestamp (tests a language guarantee) - Drop mongoInboxStore.ListByRoom (orphaned after fan-out removal) - Drop 5 vacuous pub.count() == 0 assertions in inbox-worker tests - Unexport room-key sentinel errors (no external consumers) - Drop dead error return from fanOutRoomKeyToSurvivors (always nil) - Drop unnecessary []model.User -> []*model.User conversion - Drop vanity TestMetrics_AreNonNil - Reword version-gate error message and "Batch-3" internal naming - Single-line replicateLocalKey doc; document VALKEY_KEY_GRACE_PERIOD; document the compose-file Valkey dependency - Fix KeyRotated/KeyGenerated metric semantics on Set-fallback path https://claude.ai/code/session_01MdBhQNQhw59g5PAmvstj7m
inbox-worker no longer calls chat.server.request.roomkey.{siteID}.get,
so the server-side handler and its supporting types are unreachable.
Removed:
- room-worker.NatsHandleGetRoomKey + handleGetRoomKey
- room-worker QueueSubscribe registration for the RPC subject
- model.RoomKeyGetRequest
- subject.ServerRoomKeyGet + its unit test
- errRoomKeyNotFound + errRoomKeyStoreInternal sentinels
- TestHandler_handleGetRoomKey
Clients only use PrivateKey to decrypt; PublicKey is server-side only (broadcast-worker reads it from Valkey). Stop populating PublicKey at the two fan-out call sites and mark the JSON tag omitempty so the field disappears from on-wire RoomKeyEvent payloads. Struct field is retained for any future server-side producer. Addresses review comments #10 and #12.
Closes the survivor decrypt-failure window: room-worker now does delete → fan-out new key → Valkey Rotate → publish system message, so survivors hold v+1 before broadcast-worker starts encrypting under it. - room-service stops calling keyStore.Rotate on member-remove; instead reads the current Valkey version and stamps it on the canonical event as RemoveMemberRequest.BaseKeyVersion. Dual-membership / no-deletion detection moves into room-worker (it owns the deletion decision). - room-worker reads currentPair on entry and skips the rotation block when Valkey is already ahead of req.BaseKeyVersion (redelivery guard for the crash-after-Rotate-ack-lost case). - room-worker generates the new keypair (via roomkeystore.GenerateKeyPair), fans it out to ListByRoom-post-delete survivors at predicted version v+1, then commits via keyStore.Rotate (Set fallback on ErrNoCurrentKey). - Moves generateRoomKeyPair from room-service to pkg/roomkeystore. - Removes MemberRemoveEvent.NewKeyVersion (no consumer left after the inter-site path was deleted). Renames RemoveMemberRequest.NewKeyVersion → BaseKeyVersion to match the new semantic. - Removes the now-unused room-service CountOrgOnlySubs + its index. Residual risks (accepted, documented in spec amendment): - Removed-user-read window (~10-100ms) between canonical publish and Mongo delete in room-worker. - Key-gen non-idempotence between fan-out and Rotate on JetStream redelivery (skip-version guard catches the after-Rotate case only).
…leness markers
- inbox-worker: remove unused Handler.siteID field and constructor param (12 call sites updated)
- inbox-worker: revert otelutil.InitMeter init and shutdown reorder; inbox-worker emits no metrics, so both were dead weight
- inbox-worker: condense Handler doc comment to a single line
- room-worker/main.go: collapse stale 3-line comment that still referenced the removed inter-site `chat.server.request.roomkey.{siteID}.get` RPC
- spec: add per-section "Stale — superseded by ... amendment" markers to the Remove-member scope bullet, the Remove-member data-flow section, the "Why rotate-first" rationale, and the cross-site key RPC handler subsection so individual sections aren't misleading on their own
https://claude.ai/code/session_013zd5nk3mukiVitkvvhniFi
The canonical remove-member event flows through `chat.room.canonical.{site}.member.remove`,
which is owned by the ROOMS stream (per pkg/stream/stream.go), not MESSAGES_CANONICAL.
MESSAGES_CANONICAL carries chat.msg.canonical.* for the message pipeline. Updated both
sequence diagrams.
Also spelled out the two concrete skip cases hidden behind `actually_deleted`:
- individual remove with dual org-membership (user stays via org sub)
- org remove where every target is also individually subscribed (accounts slice empty)
https://claude.ai/code/session_013zd5nk3mukiVitkvvhniFi
0cb8286 to
860dad0
Compare
…members Clients store room keys under a (roomID → key) map keyed by their existing subscriptions. If room.key arrives before subscription.update, the client has nowhere to store it and either drops it (security regression: subsequent messages in that room can't be decrypted) or has to buffer keys for unknown rooms (complexity the spec doesn't require). processAddMembers had the wrong order: buildAndFanOutRoomKey at line 823, SubscriptionUpdateEvent loop at line 827. Swapped. processCreateRoom's finishCreateRoom path (line 1198) already had the correct order. Added regression test TestHandler_ProcessAddMembers_PublishesSubscriptionUpdateBeforeRoomKey that wires both the regular publish callback and the keySender to a shared mockPublisher so the timeline can be asserted across event kinds. https://claude.ai/code/session_013zd5nk3mukiVitkvvhniFi
mliu33
left a comment
There was a problem hiding this comment.
Excellent work! Thanks! Just couple minor suggestions for optimizations
| roomkeymetrics.ValkeyErrors.Add(ctx, 1, metric.WithAttributes(attribute.String("op", "Get"))) | ||
| return fmt.Errorf("get room key: %w", err) | ||
| } | ||
| // Skip-rotation guard: a prior redelivery of this canonical event already rotated Valkey past req.BaseKeyVersion. |
There was a problem hiding this comment.
I actually can't think of the case where this might happen. Because Jetstream will not reliever unless you nak, and once the message is delivered, only one pod will receive the message.
Therefore, current version will always be less than or equal to req.BaseKeyVersion because if a prior redelivery already rotates the key, there won't be another redelivery happening.
If there is no such case, we should remove this check and also the one in room-service to reduce redundant valkey db call.
| roomID = req.RoomID | ||
|
|
||
| // Gate: key MUST exist before any Mongo write. | ||
| pair, err := h.keyStore.Get(ctx, req.RoomID) |
There was a problem hiding this comment.
One optimization is to pass room key down so that buildAndFanout does not need to make db call to fetch again
| roomkeymetrics.ValkeyErrors.Add(ctx, 1, metric.WithAttributes(attribute.String("op", "Rotate"))) | ||
| return fmt.Errorf("rotate room key: %w", err) | ||
| } | ||
| roomkeymetrics.KeyRotated.Add(ctx, 1) |
There was a problem hiding this comment.
In my opinion, I think there is not much benefit to track success rotate/generate key metrics. It's good to track error count, but not so much for success count as adding metric also costs some cpu time.
After rebasing onto main (which brought in #171 room encryption keys), two classes of test adaptations were needed: 1. #171 tests of processAddMembers / processRemoveIndividual now hit the PR's stricter name-field validation. Added EngName/ChineseName to user fixtures and added the GetUser expectation for the requester account. 2. PR tests of processAddMembers that build the Handler via struct literal now reach the unconditional buildAndFanOutRoomKey call added in #171. Added keyStore: testKeyStore, keySender: testKeySender to all such Handler inits so the no-op stubs satisfy the fan-out call. Also retired one non-UUID request ID literal ("req-add-members-ordering") that the PR's UUID-format validation now rejects. No production-code changes — purely test-side reconciliation between the two feature branches.
After rebasing onto main (which brought in #171 room encryption keys), two classes of test adaptations were needed: 1. #171 tests of processAddMembers / processRemoveIndividual now hit the PR's stricter name-field validation. Added EngName/ChineseName to user fixtures and added the GetUser expectation for the requester account. 2. PR tests of processAddMembers that build the Handler via struct literal now reach the unconditional buildAndFanOutRoomKey call added in #171. Added keyStore: testKeyStore, keySender: testKeySender to all such Handler inits so the no-op stubs satisfy the fan-out call. Also retired one non-UUID request ID literal ("req-add-members-ordering") that the PR's UUID-format validation now rejects. No production-code changes — purely test-side reconciliation between the two feature branches.
After rebasing onto main (which brought in #171 room encryption keys), two classes of test adaptations were needed: 1. #171 tests of processAddMembers / processRemoveIndividual now hit the PR's stricter name-field validation. Added EngName/ChineseName to user fixtures and added the GetUser expectation for the requester account. 2. PR tests of processAddMembers that build the Handler via struct literal now reach the unconditional buildAndFanOutRoomKey call added in #171. Added keyStore: testKeyStore, keySender: testKeySender to all such Handler inits so the no-op stubs satisfy the fan-out call. Also retired one non-UUID request ID literal ("req-add-members-ordering") that the PR's UUID-format validation now rejects. No production-code changes — purely test-side reconciliation between the two feature branches.
…ntent (#185) * docs(spec): add room-worker membership fixes design spec Spec for the room-worker membership filtering, system-message sender/content population, request-ID validation, fail-closed HasOrgRoomMembers handling, and DM participant fields work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): add room-worker membership fixes implementation plan Task-by-task plan covering the membership-fix work: filter individual room docs, populate sender/content on system messages, fail-closed HasOrgRoomMembers, X-Request-ID UUID validation, and DM participant fields (Tasks 11-17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(client-api): align sys-msg row and document Room.uids/accounts Align the system-message type row with the new msg/sender population and document the new Room.uids and Room.accounts DM-participant fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(model): add Room.UIDs/Accounts fields and BuildDMParticipants helper DM rooms now carry both the deterministic UID list and the matching account list as siblings, so clients can render DM members without an extra round-trip. BuildDMParticipants returns the canonically sorted pair for any two users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-worker): add UpdateDMParticipants store method \$set's uids/accounts on a dm/botDM room after the counterpart account has been resolved. Idempotent under JetStream redelivery; returns an error when the target room is missing (the caller treats this as a permanent failure rather than silently skipping). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(room-worker): filter individual room docs to actual members on create/add The MongoDB org-membership backfill was over-broad: it created individual sub-room docs for every member of the source org rooms, not just the users being added. processAddMembers now filters indiv docs to req.Users only, createRoom filters to ResolvedUsers ∪ {requester}, and the backfill gate fires only on the first-org transition so subsequent orgs don't replay it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-worker): populate sender and content on member system messages Previously member_added / member_removed / member_left system messages went out with bare member lists and no sender, leaving clients nothing to render the actor or the affected users with. The room-worker now fetches the requester (validating EngName / SectName / SectID along the way), harvests SectName from the unfiltered org members so sect details survive the indiv-doc filter, and renders Content via new sysmsg formatters. Single- and multi-form Content variants cover the individual path and the org-bearing path; a 1-member org expansion correctly stays on multi form so future org members are accounted for. Adds pkg/model.MessageTypeMemberRemoved and MessageTypeMemberLeft to name the two new sys-msg variants the room-worker now emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(room-worker): validation hardening — UUID request IDs and fail-closed org backfill Two defensive fixes in the same path: - processAddMembers and processCreateRoom now validate that the X-Request-ID header carries a well-formed UUID (v4 or v7) via idgen.IsValidUUID; non-UUIDs return a permanent error so log correlation stays usable. - HasOrgRoomMembers failures no longer fall through silently with the zero-value bool — defaulting hadOrgsBefore=false on error would spuriously trigger first-org backfill on a room that already has org docs, and the duplicate inserts would only be masked by the Mongo unique index. Surface the error so JetStream redelivery retries with fresh state. Comments on these branches trimmed to single-line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(room-worker): set DM participant fields on DM/botDM rooms DM and botDM rooms now persist matching uids/accounts pairs at room creation across all three create paths: - Async DM create (processCreateRoom, dm type) — calls UpdateDMParticipants after counterpart resolution. - Async botDM create (processCreateRoom, botDM type) — same path, bot account resolved like a regular user. - Sync DM create (handleSyncCreateDM) — sets UIDs/Accounts on the CreateRoom literal so the participants land in the same write as the room itself; on a dup-key (the DM existed already), the path is forward-only and does not backfill — the participants are immutable for a DM room, so no recovery work is needed. UpdateDMParticipants returns an error when no room matches the ID (room-not-found is permanent, not silently skipped). The channel create path is left untouched; a guard test pins that channels must omit UIDs/Accounts so omitempty drops them from BSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(room-worker): adapt tests for room-encryption-keys merge After rebasing onto main (which brought in #171 room encryption keys), two classes of test adaptations were needed: 1. #171 tests of processAddMembers / processRemoveIndividual now hit the PR's stricter name-field validation. Added EngName/ChineseName to user fixtures and added the GetUser expectation for the requester account. 2. PR tests of processAddMembers that build the Handler via struct literal now reach the unconditional buildAndFanOutRoomKey call added in #171. Added keyStore: testKeyStore, keySender: testKeySender to all such Handler inits so the no-op stubs satisfy the fan-out call. Also retired one non-UUID request ID literal ("req-add-members-ordering") that the PR's UUID-format validation now rejects. No production-code changes — purely test-side reconciliation between the two feature branches. * docs(plan,spec) + test(room-worker): align scope statements and extend integration tests Documentation: the plan's architecture summary and file-map preamble, and the spec's §2.6 closing line, all claimed there were no model or store interface changes — contradicting Tasks 11-12 (Room.UIDs/ Accounts + BuildDMParticipants, UpdateDMParticipants on SubscriptionStore). Updated the three lines so the scope statements match what the plan and spec actually deliver. Tests: the existing room-worker feature integration tests exercised the code paths for the four bug fixes but did not assert the user-facing artifacts those fixes produce. Extended: - TestSyncCreateDM_DM_PersistsRoomAndSubs + TestProcessCreateRoomDM- PersistsTwoSubsAndZeroMembers: assert room.UIDs and room.Accounts are persisted (sorted by uid, paired by index) — covers DM participant fields persistence. - TestProcessAddMembers_PublishesLocalInbox_Integration: assert the canonical members_added sys-message carries UserAccount=requester and formatter-rendered Content — covers empty-Content + missing- sender bugs on add. - TestProcessRemoveIndividual_PublishesLocalInbox_Integration: same assertion for the member_removed sys-message — covers empty-Content + missing-sender bugs on forced removal. Org-member duplication remains covered by handler_test.go unit tests; the existing AddMembers integration test uses only direct users so it doesn't exercise the duplication scenario. * feat(broadcast-worker): fan out DM/botDM mutation events via room.Accounts; skip bots fanOutMutationEvent previously called ListSubscriptions on every DM edit/delete to recover recipient accounts — but Room.Accounts (denormalized onto the room doc by room-worker for DM/botDM rooms) already carries the same data. Drop the Mongo round-trip and iterate room.Accounts directly. Merge the DM and BotDM cases (BotDM previously fell into the unknown-room-type default branch and logged a misleading warning) and skip bot accounts so the bot in a botDM does not receive its own edit/delete echo on a subject nothing listens on. Add a local isBot helper mirroring the predicate already present in message-gatekeeper and room-service; promotion to a shared pkg/botid is the right cleanup once a fourth duplicate is no longer acceptable. Tests: - existing DM edit/delete fan-out tests switched fixtures from a ListSubscriptions mock to Accounts on the Room struct (same coverage) - new TestHandleUpdated_BotDMRoom_SkipsBotAccount asserts the bot is filtered in a botDM room Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(room-worker): single-write DM/botDM create; dedupe displayName Addresses two PR #185 review comments from mliu: 1. processCreateRoom previously inserted the room and then ran a follow-up UpdateDMParticipants UpdateOne to backfill UIDs/Accounts after the counterpart user was resolved. That second Mongo round-trip is avoidable for new rooms: fetch the counterpart upfront, populate Room.UIDs/Accounts on the literal, and let CreateRoom persist the complete document in one write — matching the sync DM path's pattern. UpdateDMParticipants is removed from SubscriptionStore (interface + Mongo impl + mock); no callers remain. The duplicate-key replay branch keeps the existing identity-equality semantics (Type/SiteID/Name/CreatedBy must match); it no longer rewrites UIDs/Accounts onto the existing doc, since rooms created by this code path already have those fields and legacy rooms are out of scope for the hot path. The newly-similar processCreateRoomDM/BotDM helpers collapsed into the switch, mirroring the inline if-RoomTypeBotDM pattern handleSyncCreateDM already uses. 2. displayName(u) used to return TrimSpace(EngName + " " + ChineseName). For account-only users where EngName == ChineseName, that rendered as "Bob Bob". Short-circuit to a single rendering when the two fields are equal. Channels remain the only consumer of system-message content; DM/botDM paths don't render via this helper. Tests: - F1/F2 (DM/BotDM SetsParticipantFields) now capture the Room passed to CreateRoom and assert UIDs/Accounts on it, replacing the previous UpdateDMParticipants arg-pin - CollisionMismatchType test threads the upfront counterpart GetUser - TestFormatLeft_EngEqualsChineseRendersOnce pins the new dedupe contract - mocks regenerated; integration test compile-clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(room-worker): sys-message format follow-ups for PR #185 Three follow-ups from PR #185 review: 1. Drop validateUserNames + its five call sites (mliu PR comment). EngName and ChineseName are now treated as optional. displayName falls through to a single non-empty side, and finally to the Account string when both are blank — the rendered body never collapses to a bare quote pair. 2. Quote display names in every channel sys-message: `"Alice Anderson 王愛麗" added "Frank Fischer 費法蘭" to the channel` (and analogous shapes for added-multi / removed-user / removed-org / left). formatRemovedOrg quotes the SectName too. 3. Populate the sender envelope (Message.UserID + UserAccount) on the member_left / member_removed sys-messages emitted by processRemoveIndividual and processRemoveOrg. message-worker looks up the user by UserID to fill the Cassandra `sender` Participant column; previously these rows landed with sender = null. Self-leave reuses the already-fetched leaving user; forced-removal and org-removal fetch the requester via the existing store.GetUser (no new store method needed). Older Cassandra rows are intentionally not backfilled — only newly emitted messages get the populated sender. Also capitalise "A new room has been created" on the room_created sys-message body. Tests: - sysmsg_test: drop TestValidateUserNames; add TestDisplayName covering every fallback branch and TestFormatLeft_FallsBackToAccount / TestFormatAddedSingle_SingleNameSide; update existing formatter assertions for the new quoted shape. - handler_test: drop the four empty-name permanent-error tests (TestHandler_ProcessAddMembers_RequesterEmptyName, _AddedUserEmptyName, TestHandler_ProcessRemoveIndividual_EmptyName, TestProcessCreateRoom_RequesterEmptyName_ReturnsPermanent); update existing content assertions to the quoted format; mock GetUser on the requester for forced-remove / org-remove paths and assert sysMsg.UserID is set to the requester ID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(room-worker): update integration test assertions for quoted sys-msg shape Two integration assertions still expected the unquoted sys-message bodies; mirror the change from the prior commit so the live MongoDB+Cassandra container test agrees with the in-memory handler tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs.
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs.
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR. 1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the premise that "JetStream won't redeliver after rotate succeeds" misses three NAK sites in processRemoveIndividual (GetUser, sys-msg publish, cross-site outbox) and the symmetric ones in processRemoveOrg. Without the guard, redelivery re-runs rotateAndFanOut which calls GenerateKeyPair() — fresh random bytes — and survivors briefly offline during the second fan-out are stranded on the prior key. Tightened the inline comment to record the rationale; no behavior change. 2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789): accepted. Refactored to take *VersionedKeyPair as a parameter and threaded the gate-Get pair from processCreateRoom through processCreateRoomChannel + finishCreateRoom. Saves one Valkey round trip per channel/DM create. processAddMembers fetches the pair locally before fan-out so the contract stays uniform (caller owns the fetch). Kept the nil-check inside as a defensive guard. 3. KeyGenerated / KeyRotated success counters: dropped at every emit site (room-worker x3, room-service x1) and removed the declarations from pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors, KeyAbsentErrors) stay. Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7 Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #171 follow-up Finding 1: room-service stamped the current Valkey key version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped rotation when Valkey was already past that baseline. The reviewer pointed out that JetStream only redelivers on NAK and a single delivery only ever reaches one pod, so the post-rotate redelivery scenario the guard was shielding against can't actually arise — and the extra Valkey Get on the hot remove path is unnecessary work. Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest, delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember, drop the shouldRotate computation and parameter threading through processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path stays since rotation is genuinely a no-op when nothing was deleted). Spec + plan updated: Finding 1 is removed from Part 7 in both the design doc and the plan, the remaining two findings are renumbered, and the "Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion" test in room-service are deleted. The two ProcessRemoveIndividual content tests pick up a ListByRoom expectation because the rotate path now always runs.
* docs(spec): member-add improvements + DM dedup + CreatedBy/TargetUser cleanups
Five-part spec, perf-hardened per second /simplify pass.
Parts 1-3: org→individual silent no-op fix, deptId support with
prefer-dept-on-overlap, and CreateRoomDialog dedup-reply short-circuit.
New GetCapacityCheckPipeline + GetAddMemberCandidatesPipeline split
capacity-counting (3 stages, no room_members lookup) from full-candidate
resolution (4 stages, room_members lookup via member.id covered by the
existing unique index). New (deptId, account) index keeps both branches
of the extended $or index-covered. Enrichment _orgMatch pipeline stays
at 3 inner stages by emitting raw {name, tcName} and combining in Go;
combineWithFallback moves to pkg/displayfmt as the single source of
truth for both sys-msg and member-list display.
Part 4: drops Room.CreatedBy and rewrites the JetStream replay-equivalence
check to use a reusable reconcileRoomOnDuplicateKey helper. The new check
verifies (Type, SiteID) structural match and requester-sub-exists; this
fixes DM concurrent-create (today incorrectly rejected as "room ID
collision") while still catching real ID collisions (requester has no sub
in the unrelated existing room).
Part 5: drops never-written target_user column from all four Cassandra
message tables, the Go struct, history-service read-side baseColumns,
and the docs.
All new queries verified index-covered; performance/indexes table in
Part 2 enumerates every read path.
* docs(plan): implementation plan for member-add improvements + cleanups
14 tasks across five parts mirroring the spec, hardened per the second
/simplify pass.
Notable updates over the first draft:
- pkg/displayfmt shared package introduced in Task 7 — single source of
truth for CombineWithFallback, consumed by both room-worker (sys-msgs)
and room-service (member-list enrichment).
- Task 11 enrichment pipeline kept at 3 inner stages (vs. the earlier
6-stage draft) by emitting raw {name, tcName} and applying the
combine in Go after cursor.All.
- Task 13 (Room.CreatedBy removal) extracts the duplicate-key recovery
into a single Handler.reconcileRoomOnDuplicateKey helper used by both
processCreateRoom and the sync-DM path. Adds a "no extra writes"
pre/post snapshot assertion to guard against silent-dedup false
passes. Frontend and docs/client-api.md grep uses word-boundary regex
to avoid stripping the unrelated createdByAccount field.
- Task 14 (target_user removal) adds explicit test-fixture cleanup
steps covering pkg/model/cassandra/message_test.go and the two
history-service integration tests that today assert on .TargetUser.
TDD per task: failing test → minimal implementation → green → commit.
Each task is independently committable and reviewable.
* feat(pipelines): add GetCapacityCheckPipeline + GetAddMemberCandidatesPipeline
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* feat(model): add SectTCName, DeptID, DeptName, DeptTCName fields to User
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* feat(room-service): add (deptId, account) index for deptId-matching pipelines
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* feat(room-worker): add ListAddMemberCandidates with per-candidate flags
Adds AddMemberCandidate type, ListAddMemberCandidates to SubscriptionStore
interface and MongoStore implementation backed by GetAddMemberCandidatesPipeline,
returning per-user {hasSubscription, hasIndividualRoomMember} flags so the
handler can split candidates into needSub vs needIRM upgrade paths. Regenerates
mock and adds integration test covering the bug scenario (sub-without-IRM) plus
new-member and full-member states. Also includes room-service CountNewMembers
integration test from the preceding task that was not yet committed.
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* fix(chat-frontend): dedup reply navigates directly without summaries-wait
When the server replies with "dm already exists", the handleSubmit now
calls onCreated+onClose immediately and returns, bypassing setPendingRoom
entirely. This prevents the 3-second "taking longer than expected" banner
from firing on a BUCKETS_LOADED race where summaries is still empty at
the moment the dedup response lands.
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* refactor(cassandra): drop unused target_user column from message schema
The target_user column was never written anywhere; all reads returned NULL.
Removes the field from the Go struct, all four init CQL files, the
history-service baseColumns select lists, both docker-compose inline
schemas, the canonical docs, and all test fixtures/assertions.
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* feat(pipelines): match deptId alongside sectId in candidate $or
Extends matchCandidates (used by GetCapacityCheckPipeline and
GetAddMemberCandidatesPipeline) and GetNewMembersPipeline to include a
deptId branch in the $or filter so users whose org affiliation is stored
in deptId are picked up the same way as those in sectId.
Also updates the existing pipeline unit tests that asserted the old
$or clause count, and adds an integration test
(TestMongoStore_ListAddMemberCandidates_DeptMatching_Integration) that
proves both fields are matched correctly.
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* feat(room-service): enrichment prefers dept on overlap with Go-side combine, falls back to orgID
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* fix(room-worker): create individual room_members row on org→individual member.add
Rewrite processAddMembers around the per-candidate ListAddMemberCandidates
flags: split candidates into needSub (no subscription yet) and needIRM
(no individual room_members row yet, gated by writeIndividuals). This
unblocks the org→individual upgrade path — a user previously added via
org expansion (subscription + org room_members row, no individual row)
can now be re-added via req.Users and get the missing individual row
WITHOUT a duplicate subscription.
Remove ListNewMembers from the store interface and its MongoStore impl;
the new ListAddMemberCandidates fully supersedes it.
* refactor(room-worker): OrgMemberStatus carries (Name,TCName,IsDept); lookup via member.id
* feat(room-worker): processRemoveOrg uses dept-first tiebreak with displayOrg + orgID fallback
* refactor(room-worker): drop Room.CreatedBy, extract reconcileRoomOnDuplicateKey, use requester-sub-exists for replay equivalence
The JetStream duplicate-key recovery compared Type+SiteID+Name+CreatedBy
on the existing room, which was over-determined and incorrectly rejected
DM concurrent-create by counterpart: the loser's worker saw existing
CreatedBy != room.CreatedBy and returned "room ID collision" even though
the room IS the intended DM and both subscriptions were already written
by the winner.
Switch to Type+SiteID+requester-sub-exists. If the requester already has
a subscription in the existing room, the collision is either a legitimate
JetStream redelivery or the concurrent-create-equivalent case. Extract a
reconcileRoomOnDuplicateKey helper so both processCreateRoom and
handleSyncCreateDM share one source of truth.
CreatedBy was also exposed on the wire (json:"createdBy") and in the
frontend Room type but never read. Remove the field entirely from
pkg/model.Room and update the client-api docs (createdByAccount on
the create-request schema is a separate field and stays intact).
The two non-listed test files (room-service/integration_test.go,
inbox-worker/handler_test.go) had to drop CreatedBy fixture lines to
keep compilation green after the model-field removal.
* test: add pipeline unit tests + strengthen DM concurrent test; polish slog + pre-alloc + godoc
- pkg/pipelines/member_test.go: add TestGetCapacityCheckPipeline and
TestGetAddMemberCandidatesPipeline covering stage counts, $match shapes
for orgIDs-only / directAccounts-only / both, excludeAccount $ne, and
panic on empty roomID
- pkg/pipelines/member.go: add "Panics if roomID is empty" to godocs for
GetCapacityCheckPipeline and GetAddMemberCandidatesPipeline
- room-worker/handler.go processRemoveOrg: add requestID field to slog.Warn
for log correlation; pre-allocate needSub/needIRM slices with make(…, 0, cap)
- room-worker/store_mongo.go: add godoc to BulkCreateSubscriptions explaining
silent dup-key swallow, idempotency contract, and why it is required
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* fix: 4 review findings — reconcileRoomOnDuplicateKey idempotency, spurious sys-msg on upgrade, dept-blind membership lookups, room-service/worker display parity
- room-worker/handler.go reconcileRoomOnDuplicateKey: drop requester-must-be-
member check; structural (Type+SiteID) match is sufficient. The earlier
guard turned mid-write crashes (CreateRoom succeeded, worker died before
BulkCreateSubscriptions) into permanent failures that orphaned rooms in
Mongo. With the requester check removed, JetStream redelivery finds the
existing room, finishes the unfinished subscription writes, and the room
is no longer stuck. Updated both call sites (processCreateRoom and
handleSyncCreateDM) and the existing collision-mismatch tests; added a
channel mid-write-crash integration test.
- room-worker/handler.go processAddMembers: gate MemberAddEvent publish and
members_added sys-msg construction on len(actualAccounts)>0 || len(req.Orgs)>0.
The org→individual upgrade path (only needIRM non-empty) writes the missing
room_members row silently — no membership state changed for the room, so
emitting an empty member-add event and an "added members to the channel"
sys-msg with no actual members would mislead end users. Extended the
existing OrgToIndividualUpgrade unit test to assert neither event fires.
- room-service/store_mongo.go GetSubscriptionWithMembership and
room-worker/store_mongo.go GetUserWithMembership: extend the org-lookup
$let from {sectId} to {sectId, deptId} and the inner $match from a single
$eq on member.id to a $or over both. A user added via Orgs:["X"] whose
deptId is "X" (no sectId match) previously reported HasOrgMembership=false
— the symmetric bug to the GetOrgMembersWithIndividualStatus fix from
Part 1. Added a dept-only-match regression test in each integration suite.
- room-service/store_mongo.go getRoomMembers Go-side combine: switch from
unconditional dept-on-overlap to a two-pass dept-first-with-fallback that
mirrors room-worker's processRemoveOrg. A user with deptId="X" + empty
deptName plus a sibling with sectName="Engineering" now renders
"Engineering" instead of the orgID, restoring byte-identical output
between member-list and sys-msg paths. Added a regression test for this
exact case to TestMongoStore_ListRoomMembers_Enrich_Integration.
https://claude.ai/code/session_01Wn4D4EcfZ5nKAuF6vbSBsA
* feat(room-service): ListOrgMembers matches sectId OR deptId
Symmetric to the dept-aware membership lookups landed in e016623
(GetSubscriptionWithMembership, GetUserWithMembership) and to
GetOrgMembersWithIndividualStatus in room-worker. The org-row
"expand" RPC (chat.user.{a}.request.orgs.{orgId}.members) previously
returned errInvalidOrg for any org added by deptId, so the member
dialog rendered "Failed to load members" on dept-scoped org rows.
Both (sectId, account) and (deptId, account) indexes back the $or
so the query stays index-covered. Added two integration sub-tests
under TestMongoStore_ListOrgMembers_Integration covering the
distinct-sectId/deptId case and the sectId==deptId invariant case.
Updated docs/client-api.md and the branch spec
(2026-05-19-org-to-individual-membership-upgrade-design.md) to
reflect the broadened semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(sast): pin runtime toolchain on sast-vuln target
govulncheck shells out to `go list` at runtime; when the system Go
diverges from TOOLS_GO_TOOLCHAIN, the binary (built with 1.25) sees
a 1.26 source tree and panics with "file requires newer Go version".
Setting GOTOOLCHAIN at runtime makes the dispatcher pick the same
toolchain the binary was built against. Belt-and-suspenders even
when the system Go matches the pin — keeps the target robust to
future toolchain skews.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(room-service): reject phantom org IDs and accounts at request time
Both member.add and channel-create silently accepted orgs/users with no
backing user document. The candidates aggregation dropped them, so the
async-job reply reported success while the worker either wrote a
room_members row for a zero-user org (firing a spurious members_added
sys-msg) or skipped the targeted user entirely.
Gate at the room-service request boundary with two new store methods
(FindExistingOrgIDs, FindExistingAccounts) and two handler helpers
(validateOrgIDs, validateAccountsExist). errInvalidOrg / errUserNotFound
sentinels were already in sanitizeError's allow-list; clients now see a
clean synchronous error envelope and nothing reaches the canonical stream.
Also folds in two follow-ups from the prior review pass:
- Mongo testimage comment now records that the retired 4.4.15 guard
against partialFilterExpression operator-allow-list regressions has
been intentionally given up under the mongo:8 bump.
- ListOrgMembers integration test gains a "neither sectId nor deptId
matches" sub-test so dedupe across the broadened \$or is exercised.
Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 6
Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 15
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(room-worker): address PR #171 follow-up review findings
Three unresolved threads from @mliu33 on the merged room-encryption-keys PR.
1. shouldRotate guard (room-worker/handler.go:319): kept the condition; the
premise that "JetStream won't redeliver after rotate succeeds" misses
three NAK sites in processRemoveIndividual (GetUser, sys-msg publish,
cross-site outbox) and the symmetric ones in processRemoveOrg. Without
the guard, redelivery re-runs rotateAndFanOut which calls
GenerateKeyPair() — fresh random bytes — and survivors briefly offline
during the second fan-out are stranded on the prior key. Tightened the
inline comment to record the rationale; no behavior change.
2. buildAndFanOutRoomKey redundant Get (room-worker/handler.go:1789):
accepted. Refactored to take *VersionedKeyPair as a parameter and
threaded the gate-Get pair from processCreateRoom through
processCreateRoomChannel + finishCreateRoom. Saves one Valkey round
trip per channel/DM create. processAddMembers fetches the pair locally
before fan-out so the contract stays uniform (caller owns the fetch).
Kept the nil-check inside as a defensive guard.
3. KeyGenerated / KeyRotated success counters: dropped at every emit site
(room-worker x3, room-service x1) and removed the declarations from
pkg/roomkeymetrics. Error counters (FanoutErrors, ValkeyErrors,
KeyAbsentErrors) stay.
Spec: docs/superpowers/specs/2026-05-19-org-to-individual-membership-upgrade-design.md Part 7
Plan: docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md Tasks 16-19
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: apply branch-review findings — panic→error, backfill fail-hard, phantom-validation tests
- pkg/pipelines: convert GetCapacityCheckPipeline / GetAddMemberCandidatesPipeline
from panic-on-empty-roomID to (bson.A, error); library code should not crash
the worker on bad caller input. Extract bot/pseudo regex into package-level
const so GetNewMembersPipeline and matchCandidates can't drift.
- room-worker/processAddMembers: propagate backfill query errors instead of
log-and-continue. Log-and-continue would silently corrupt room_members
(existing subs would never get IRM rows). Retry is safe — subs are
already written so needSub is empty, hadOrgsBefore stays false until
BulkCreateRoomMembers commits, and the backfill re-runs cleanly.
- room-worker: requestID propagation on the two member-add publish-error logs
now inside the gated block; wrap bare returns after reconcileRoomOnDuplicateKey
and rotateAndFanOut; refine error-wrap messages on the dup-key fetch path.
Trim BulkCreateSubscriptions doc to one why-line. Project u.account on
GetSubscriptionAccounts.
- room-service: rename _orgRaw BSON tag to orgRaw, add omitempty across
orgRawDisplay siblings. Clarify validateOrgIDs / validateAccountsExist
godoc to record the fmt.Errorf wrap-with-input shape that callers see
via errors.Is.
- room-service tests: consolidate 5 sibling AddMembers phantom tests into a
single table-driven case; add parallel CreateRoomChannel phantom table so
the same gate is exercised on both paths; add store-error branch cases to
cover the FindExistingOrgIDs / FindExistingAccounts error propagation
path (was untested per CLAUDE.md §4).
- room-worker tests: add backfill-error fail-hard tests for both
GetSubscriptionAccounts and FindUsersByAccounts failure paths.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(make): drop COMPOSE_BAKE/COMPOSE_PARALLEL_LIMIT from `make up`
Removes the inline `COMPOSE_BAKE=false COMPOSE_PARALLEL_LIMIT=1` env-vars
from both branches of the `up` target. The Docker Compose CLI applies its
own defaults; pinning them here predates the local toolchain and is no
longer needed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix: resolve coderabbit findings on PR #212
- pkg/displayfmt/combine_test.go: switch to in-package tests (`package
displayfmt`) per CLAUDE.md §4 rule; drop the now-unused local import.
- room-worker/handler.go: wrap bare `return nil, reconcileErr` in
processSyncCreateDM's dup-key reconcile branch with sync-DM context
(CLAUDE.md §3 "Never return bare err").
- room-service/integration_test.go: make two dept-only test cases actually
dept-only by setting SectID to a non-matching value (was `dept-x`, so
the tests would have passed via sectId match alone — defeating the
intent of "dept-only branch is doing its job").
- room-worker/store.go: refresh stale doc on ListNewMembersForNewRoom —
refer to ListAddMemberCandidates (the current contract) instead of the
removed ListNewMembers.
- docs/superpowers: rename stale `_orgRaw` → `orgRaw` in both the plan
snippet and the spec narrative to match the renamed BSON tag; add
language identifiers (```sh) to the four shell-command fenced blocks
the reviewer called out in the plan.
Skipped intentionally:
- Makefile COMPOSE_BAKE finding: the user explicitly asked to REMOVE
COMPOSE_BAKE/COMPOSE_PARALLEL_LIMIT (commit 3df2b19); bot's suggestion
to set COMPOSE_BAKE=true contradicts that directive.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(client-api): document users/orgs/channels on Create Room
Close the consistency gap CodeRabbit flagged on PR #212: the Create
Room request schema was missing users/orgs/channels even though the
handler accepts them and the same section's error envelope mentions
"invalid org" / "user not found" rejections for those exact fields.
Add them to the request-body table (marked channel-only and optional,
matching the gate the worker enforces) and update the example JSON so
the schema, example, and error envelope agree.
* refactor(room-service,room-worker): drop unused single-row CRUD store methods
All production write paths now use Bulk* variants (BulkCreateSubscriptions,
BulkCreateRoomMembers); the single-row CreateSubscription, CreateRoom (room-
service only), and CreateRoomMember linger on the store interfaces with zero
live callers and are only kept alive by integration-test seeding. Remove them
from the interface + Mongo impl, point the test seed sites at the existing
mustInsertRoom / mustInsertSub helpers (direct collection inserts), and drop
the now-misleading "existing methods (invite flow)" section comment in
room-worker/store.go. Mocks regenerated; lint, unit tests, and integration
tests all green locally.
* refactor: drop shouldRotate guard and BaseKeyVersion per PR #171 review
PR #171 follow-up Finding 1: room-service stamped the current Valkey key
version into RemoveMemberRequest.BaseKeyVersion and room-worker skipped
rotation when Valkey was already past that baseline. The reviewer pointed
out that JetStream only redelivers on NAK and a single delivery only ever
reaches one pod, so the post-rotate redelivery scenario the guard was
shielding against can't actually arise — and the extra Valkey Get on the
hot remove path is unnecessary work.
Drop the whole machinery: remove BaseKeyVersion from RemoveMemberRequest,
delete the keyStore.Get-and-stamp block in room-service/handleRemoveMember,
drop the shouldRotate computation and parameter threading through
processRemoveOrg / processRemoveIndividual in room-worker, and unconditionally
rotate-and-fan-out (the existing `len(accounts) > 0` guard on the org path
stays since rotation is genuinely a no-op when nothing was deleted).
Spec + plan updated: Finding 1 is removed from Part 7 in both the design
doc and the plan, the remaining two findings are renumbered, and the
"Skip-rotation guard" test in room-worker plus the "StampsBaseKeyVersion"
test in room-service are deleted. The two ProcessRemoveIndividual content
tests pick up a ListByRoom expectation because the rotate path now always
runs.
* docs: drop stale createdBy / createdByAccount from Room and Create Room
Neither field exists on the wire. Room.CreatedBy was removed from the
model earlier in this PR (Task 13), and createdByAccount was never a
field on CreateRoomRequest — the creator's account is taken from the
{account} segment of the create-room subject by ParseRoomCreateSubject
and never read from the request body.
- docs/client-api.md: remove the createdByAccount row from the Create
Room request-body table and from the example JSON; add a one-liner
pointing readers at the subject-derived creator.
- docs/superpowers/spec.md: remove the stale CreatedBy row from the
Room schema table in section 4.1.
- docs/superpowers/plans/2026-05-19-member-add-improvements-plan.md:
rewrite Step 11/12 — the original plan instructed the grep to spare
createdByAccount on the assumption it was a real field; both names
should be stripped.
Historical design/plan docs (2026-03..2026-05-07) that record the
field as it existed at those points in time are left intact.
* fix: address remaining CodeRabbit findings on PR #212
room-worker/handler.go (Major):
Backfill in processAddMembers fed FindUsersByAccounts the existing
subscriber accounts and assumed the result covered every requested
one. When a stale subscription points at a deleted user document the
call returns a subset; the buggy code committed room_members for the
rows it got, BulkCreateRoomMembers flipped hadOrgsBefore=true, and
no future redelivery could ever repair the missing IRM rows (backfill
only fires on the first-org transition). Add a fail-hard guard that
compares returned accounts against the requested set and returns
newPermanent("backfill user %s not found ...") before any write. New
regression test in handler_test.go locks in the contract.
docs/client-api.md:
Replace the stale "creating a room only enrolls the owner" line — it
contradicts the request schema introduced earlier this PR, which now
documents users/orgs/channels enrollment on create.
docs/superpowers/specs/2026-05-19-...-upgrade-design.md:
Reconcile contradictory stage-count claims for the _orgMatch $lookup
inner pipeline. The later "grows to 6 stages" sentence was wrong;
verified against room-service/store_mongo.go:520-552 — the pipeline
is 3 stages ($match → $addFields → $group), same as before. The
$group merely widens its accumulator set.
pkg/testutil/testimages/testimages.go + docker-local/compose.deps.yaml:
Replace the floating "mongo:8" / "valkey/valkey:8-alpine" tags with
patch-pinned mongo:8.2.9 / valkey/valkey:8.1.7-alpine so testcontainers
can't drift across patch releases between CI runs.
* docs(client-api): widen userCount note for create-with-initial-members
The Room schema table still claimed userCount is "1 immediately after
creation (the owner)", but the same section now documents users / orgs /
channels enrollment at create time, so the count can land >1. Rewrite
the note to match.
* refactor(room-service): fold FindExistingOrgIDs dedupe into one pass
Per @mliu33's review on room-service/store_mongo.go:770. The previous
implementation iterated sectIDs, deptIDs, then a separate map → slice
loop (three passes). Build the result slice directly inside the
sectIDs / deptIDs loops, using the map only as a dedupe set, so we
make two passes instead of three and skip the map → slice copy.
Behavior unchanged; existing integration test covers it.
* perf(room-service): collapse FindExistingOrgIDs into one Mongo round-trip
Per @mliu33's review: the previous implementation issued two Distinct
calls (one on sectId, one on deptId) and then merged + deduped in Go.
That's two network round-trips to Mongo per room.member.add and
channel-create request, which run on every gate check.
Rewrite as a single aggregate using $unionWith. The main pipeline does
the sectId distinct, $unionWith folds in the deptId distinct, and a
terminal $group dedupes the union server-side. Each $match sits at the
top of its own pipeline so the (sectId, account) and (deptId, account)
compound indexes both still cover their respective branches —
explain() against the integration container confirms DISTINCT_SCAN on
both indexes (Mongo even applies $groupByDistinctScan), no COLLSCAN.
Net: 1 round-trip instead of 2, same index coverage, less Go-side
allocation (no more dedup map + intermediate slices). Added a regression
test for the overlap case (same orgID matched via sectId on one user +
deptId on another) that the existing suite didn't pin down.
* revert: drop $unionWith aggregation in FindExistingOrgIDs
Restore the two parallel Distinct calls from before 95f7bd7. The
$unionWith version was benchmarked at ~8.5% wall-clock improvement
(885µs → 810µs per call against a local Mongo) for one fewer round-trip,
but at the cost of ~55% more Go-side allocations per call and a deeper
dependency on Mongo's aggregation optimizer. With the gain that small
and the call rate at this gate not on the critical hot path, the
simpler two-Distinct form is the better trade: less code, less BSON
marshaling, version-agnostic from Mongo 4.4 onward, and explicit
debuggability (each Distinct's error names which field failed).
Verified TestMongoStore_FindExistingOrgIDs_Integration passes on both
mongo:8.2.9 (current default) and mongo:4.4.30, including the new
overlap-dedup case added in the previous commit. Comment on the
function records why we landed back here so a future reader doesn't
re-do the round-trip migration without fresh perf numbers.
* perf(pipelines): project $lookup sub-results to {_id: 1}
Per @mliu33's review on pkg/pipelines/member.go (lines 112 and 135).
All four $lookup sub-pipelines in this file end with $match + $limit:1,
then the outer pipeline only reads $size of the resulting array to
check emptiness — it never touches any field of the joined doc. Mongo
was still fetching and shipping the full subscription / room_members
document into _sub / _irm / existingSub on every match.
Append $project {_id: 1} after each $limit so each entry in the join
array drops to ~30 bytes instead of the full doc. Same emptiness check
($eq: []) and same $size semantics work unchanged. Affects:
- GetNewMembersPipeline (existingSub lookup)
- GetCapacityCheckPipeline (_sub lookup)
- GetAddMemberCandidatesPipeline (_sub and _irm lookups)
No behavior change; existing unit + integration tests cover all three
pipelines (TestMongoStore_FindExistingOrgIDs, ListAddMemberCandidates,
CountNewMembers) and remain green.
* refactor(pipelines): drop GetCapacityCheckPipeline — duplicate of GetNewMembersPipeline
Per @mliu33's review on room-service/store_mongo.go:300. The two
functions produced byte-identical pipelines for roomID != "" (modulo
cosmetic variable names: existingSub/userAccount vs _sub/acct).
GetNewMembersPipeline already handles both cases natively — it returns
just the $match stage when roomID is empty, and adds the $lookup +
not-already-subscribed $match when roomID is non-empty.
- pkg/pipelines/member.go: delete GetCapacityCheckPipeline. Refactor
GetNewMembersPipeline to use the shared matchCandidates() helper for
the $match stage (DRY with GetAddMemberCandidatesPipeline).
- pkg/pipelines/member_test.go: delete TestGetCapacityCheckPipeline.
Behavior is fully covered by the existing TestGetNewMembersPipeline*
cases (both branches).
- room-service/store_mongo.go: CountNewMembers no longer needs the
roomID-empty bifurcation — it now calls GetNewMembersPipeline
unconditionally with the supplied roomID.
- docs/superpowers/specs/2026-05-19-...-upgrade-design.md: spec rewritten
to reflect the consolidation. The plan doc (a historical record of
Task 1 as originally executed) is left intact.
* refactor(room-worker): single-pass dept-first tiebreak in processRemoveOrg
Per @mliu33's review on room-worker/handler.go:562. The previous two-pass
logic scanned `members` once for a dept row and then (only if none was
found) scanned it again for the first sect row. OrgMemberStatus already
carries the `IsDept` flag, so we can do this in a single pass: stash the
first sect candidate as we scan, then let any encountered dept row
override it (and short-circuit further iteration). After the loop, fall
back to the stashed sect candidate only if no dept row appeared.
No behavior change — every TestHandler_ProcessRemoveOrg_DeptFirstTiebreak
subcase (all sect, all dept, mixed dept-wins, all-names-empty → orgID
fallback) still passes. Saves one full pass over the members slice on
the org-remove hot path.
* fix(room-worker): preserve sub when removed org has a sibling covering the same user
Multi-org overlap bug introduced by Part 2 of this PR (dept-aware org
matching). Before: orgs matched only on sectId, so a user could be
reachable in a room via at most one org. After Part 2: orgs match
sectId OR deptId, so a user with both fields set can be covered by two
org rows simultaneously (one matching their sectId, one matching their
deptId).
processRemoveOrg's old toRemove rule only checked HasIndividualMembership:
any org-only user was scheduled for sub deletion. With multi-org overlap,
this silently orphans the user's sub even though the sibling org row
still claims them as a member of the room. They lose notifications,
can't see new messages, and effectively get booted — visible only on
the next list-members call where the sibling org still expands to them.
Fix is two-layer:
- room-worker/store.go: add HasOtherOrgMembership bool to OrgMemberStatus.
- room-worker/store_mongo.go: GetOrgMembersWithIndividualStatus gains a
second $lookup against room_members. For each candidate, search for
another org row whose member.id matches their sectId or deptId
(excluding the orgID being removed). Same $project: {_id: 1} trick
as the other lookups to keep the join entry small.
- room-worker/handler.go: processRemoveOrg skips members where
HasOtherOrgMembership=true from toRemove. Org row deletion + count
reconcile + sys-msg still fire (the org WAS removed); no rotate, no
per-account events when toRemove is empty.
Tests:
- Handler unit test (TestHandler_ProcessRemoveOrg_OtherOrgCovers_PreservesSub):
mock returns alice with HasOtherOrgMembership=true; asserts
DeleteSubscriptionsByAccounts and ListByRoom are never called.
- Pipeline integration test
(TestMongoStore_GetOrgMembersWithIndividualStatus_OtherOrgCovers_Integration):
alice has sectId=X + deptId=Y, both X and Y as org rows; queries org
X and asserts alice.HasOtherOrgMembership=true (Y still covers her)
while carol — sectId=X only — has HasOtherOrgMembership=false.
Both tests were red before the corresponding fix and green after; existing
TestHandler_ProcessRemoveOrg* and TestMongoStore_GetOrgMembersWithIndividualStatus*
cases still pass.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Summary
Wires the existing
pkg/roomkeystore(Valkey-backed key storage) andpkg/roomkeysender(NATS key delivery) libraries into the room lifecycle. Every room now gets a P-256 keypair generated at create time, replicated to every participating site at origin's exact version, and pushed to every member's NATS subject so clients can decrypt messages encrypted bybroadcast-worker. Channel member removal rotates the key so the removed user can no longer decrypt messages sent after their removal.Spec:
docs/superpowers/specs/2026-05-08-room-encryption-keys-design.mdPlan:
docs/superpowers/plans/2026-05-08-room-encryption-keys.mdArchitecture
Three-stage handoff per the established codebase pattern:
room-serviceSet). Rotate on channel member-remove (Rotate, withSetfallback for un-keyed rooms). Skip rotation when no subscriptions are actually deleted (dual-membership individual remove; org-remove with all org members also individually subscribed).room-worker(origin)RoomKeyEventto every room member — local and remote — viaroomkeysender.Send. NATS supercluster routeschat.user.{account}.event.room.keyto home sites. Servechat.server.request.roomkey.{siteID}.getRPC for remote inbox-workers.inbox-worker(remote)SetWithVersion. NoRotate, no user-sideSend— origin already published, supercluster handles routing.Key material never enters JetStream. Cross-site replication is request/reply over core NATS.
Security properties
broadcast-workerencrypts under the new key. The about-to-be-removed user holds only the old private key and cannot decrypt anything published after rotation. Test invariant enforced via flag-in-DoAndReturn(rotate must complete before publish callback fires).SetWithVersion. Without this, remote sites would track per-site versions that diverge from origin, andbroadcast-workerenvelopes on the remote would carry a version no client (which received origin'sRoomKeyEvent) holds. Redelivered events no-op when local version is at or beyond origin's.skipKeyRotationto make intent explicit.room-serviceoriginates rotations, so a scalarintversion is sufficient for ordering. Documented at the version gate.RotateLua script does not work across Redis Cluster slots). Documented inpkg/roomkeystore/doc.go.inbox-worker's cross-site client maps the wire-format "room key not found" reply back to anerrRoomKeyAbsentsentinel so downstream callers canerrors.Isagainst the permanent-miss case (as distinguished from a transient RPC error).Observability
Seven new OTel meter instruments in
pkg/roomkeymetrics:room_key_generated_totalroom_key_rotated_totalroom_key_fanout_errors_totalSendfailure (worker + inbox)room_key_rpc_duration_secondsnatsInterSiteKeyClient.GetRoomKeyroom_key_valkey_errors_total{op}Get/Set/Rotate/GetMany/SetWithVersionat call sitesroom_key_absent_totalroom_key_replication_terminated_totalConfiguration
VALKEY_ADDRis a hard startup requirement on every worker that touches keys. The earlier "optional withslog.Warn" mode produced a silent split-brain (rooms created in Mongo with no key flowing to clients) and — once cross-site versions had to match origin's exactly — made cross-site messages undecryptable. Failing loudly at startup is honest about what the system requires.VALKEY_ADDRroom-service,room-worker,inbox-workerVALKEY_PASSWORD""VALKEY_KEY_GRACE_PERIOD24h> 0at startup. TTL on the:prevslot after rotation.ROOM_KEY_RPC_TIMEOUTinbox-worker5s> 0at startup. Per-call timeout on the cross-site key-fetch RPC.ROOM_KEY_MAX_REDELIVERinbox-worker10> 0at startup. JetStream redelivery cap before terminating (Ack + log) to free the consumer when origin is unreachable.Failure modes
pair.Version < req.NewKeyVersion) → transient error → NAK + retry. Valkey propagation race, not a malformed event.errRoomKeyAbsent) → permanent error →AsyncJobResulterror; clients retry the create. Preserved across the inter-site RPC for inbox-worker callers.ROOM_KEY_MAX_REDELIVER.ListByRoomsurvivor lookup failure (room-worker remove path) → NAK + retry. Previously logged-and-acked, which could leave a room on a key no survivor received.buildAndFanOutRoomKeyfailure (room-worker create path) → NAK + retry. Previously logged-and-acked, which could persist a room without delivering the initial key event.Sender.Sendper-account failure → logged, fan-out continues (best-effort to remaining members).RotatereturnsErrNoCurrentKey→ fall back toSet(un-keyed room being remove-keyed for the first time).member_removedwith emptyAccountson a remote site → no Mongo delete, but still fetches the rotated key from origin (otherwise localbroadcast-workerwould keep encrypting under an older version).mongoInboxStore.BulkCreateSubscriptionsusesbulkWritewithUpdateOne + $setOnInsertkeyed on(roomId, u.account), so existing read-state (LastSeenAt,Alert) is preserved and replicateLocalKey is always re-attempted on retry.Test plan
-race(make test)make lint, 0 issues)-tags=integration(go build -tags integration ./...)dockerd --storage-driver=vfs(local sandbox; CI runs overlay2):inbox-worker— including cross-site key replication, idempotent upsert, FilterScopingroom-worker— create-room key fan-out, remove-member survivor fan-outroom-service— Valkey persistence on createpkg/roomkeystore— Lua atomicity + grace period + newSetWithVersionround-trippkg/roomkeysender— TypeScript client tests correctly skip on vfs viaskipOnVFSdocs/client-api.md§5 (Server-Pushed Events)room-worker/deploy/,inbox-worker/deploy/) carry the new env vars;VALKEY_ADDRdocumented as requiredChanges since initial review
Three rounds of CodeRabbit review feedback addressed across the PR's lifetime:
Round 1 (22 findings) — initial review. Key bugs fixed:
replicateLocalKey/fetchAndStoreKeyerrors are now propagated (not log-and-Ack);room-workersurvivor-fanoutListByRoomfailure now NAKs; create-roombuildAndFanOutRoomKeyfailure now NAKs;room-worker/store_mongo.ListByRoomwraps errors with operation context;room-service.handleRemoveMember's overloadedtargetIsDualMembershipflag renamed toskipKeyRotation; new compound index(rid, member.type, member.account)added toroom_membersto backCountOrgOnlySubs. Test bugs: rotate-before-publish ordering enforced;inbox-workercross-site test asserts preserved origin version. Plus docs + nitpicks (spec ownership table, fenced-code language tags, full-slice assertions, etc.).Architectural change —
VALKEY_ADDRpromoted from "optional withslog.Warn" to a hard startup requirement on both workers. Eliminated dead nil-guard code paths ininbox-workerandroom-worker(production deps are guaranteed non-nil). Spec, plan, anddeploy/docker-compose.ymlcomments updated to match.Round 2 (6 findings) — idempotent subscription upserts:
mongoInboxStore.BulkCreateSubscriptionsswitched fromInsertMany-with-dup-filter tobulkWritewithUpdateOne + $setOnInsert, preserving existing read-state on redelivery. Plus assertion polish (deep-copy non-mutation snapshot, full-slice key comparisons,require.NoErroron test-setup marshals) and stale plan paragraphs flagged.Round 3 (3 findings + repo hygiene) —
inbox-worker.handleMemberRemovedno longer short-circuits on emptyAccounts(still fetches the rotated key from origin); shutdown hook order corrected so OTel exporters flush AFTERnc.Drain/ mongo /keyStore.Close;ROOM_KEY_RPC_TIMEOUTvalidated at startup; integration test's embedded NATS server replaced with the testcontainers-go NATS module to match repo convention. Plan doc snippets rewritten to reflect the shipped fail-fast wiring. Three pre-existing repo mock-file inconsistencies (broadcast-worker, message-worker, search-service) regenerated somake generateis idempotent on the merged tree.New
roomkeystore.SetWithVersionAPI — added becauseSetalways writes version=0 andRotateincrements by 1, so neither preserves origin's exact version. The new method writes the current slot at a caller-supplied version, used exclusively byinbox-workerfor cross-site replication. Other consumers (room-service,broadcast-worker,history-service) declare their own narrower interfaces and are unaffected.Known follow-ups (intentionally deferred)
ListByRoomafter a delete scans all subs in the room for fan-out. Fine at current room sizes; pre-computing survivor lists from the delete result is an optimization for very large channels.Generated by Claude Code
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Documentation