Skip to content

Add Crr Cascade capabilities to backbeat crr replication#2747

Open
SylvainSenechal wants to merge 3 commits into
development/9.5from
improvement/BB-767
Open

Add Crr Cascade capabilities to backbeat crr replication#2747
SylvainSenechal wants to merge 3 commits into
development/9.5from
improvement/BB-767

Conversation

@SylvainSenechal
Copy link
Copy Markdown
Contributor

@SylvainSenechal SylvainSenechal commented Jun 3, 2026

Issue: BB-767

Related PRs :
Arsenal : scality/Arsenal#2628
Cloudserver : scality/cloudserver#6179
CloudserverClient : scality/cloudserverclient#24
S3utils : scality/s3utils#395

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 3, 2026

Hello sylvainsenechal,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 3, 2026

Incorrect fix version

The Fix Version/s in issue BB-767 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.5.0

Please check the Fix Version/s of BB-767, or the target
branch of this pull request.

Comment thread package.json Outdated
"@aws-sdk/client-sts": "^3.921.0",
"@aws-sdk/credential-providers": "^3.921.0",
"@scality/cloudserverclient": "^1.0.8",
"@scality/cloudserverclient": "file:../cloudserverclient/scality-cloudserverclient-v1.0.9.tgz",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scality/cloudserverclient points to a local file path (file:../cloudserverclient/scality-cloudserverclient-v1.0.9.tgz). This will break CI and other developers' builds. Must be changed to a proper registry version or git-pinned tag before merge.

Suggested change
"@scality/cloudserverclient": "file:../cloudserverclient/scality-cloudserverclient-v1.0.9.tgz",
"@scality/cloudserverclient": "^1.0.9",

— Claude Code

const ObjectMDLocation = require('arsenal').models.ObjectMDLocation;
const { errors, jsutil, models, versioning } = require('arsenal');
const ObjectMDLocation = models.ObjectMDLocation;
const { decode, checkCrrCascadeEvent } = versioning.VersionID;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to 8.3.9, but checkCrrCascadeEvent and decode are imported from versioning.VersionID. Neither checkCrrCascadeEvent nor the getMicroVersionId() method (called on source/dest entries throughout this PR) appear to exist in arsenal 8.3.9 or in backbeat's own models. The arsenal dependency likely needs a version bump for this PR to work.

— Claude Code

if (err.ObjNotFound || err.name === 'ObjNotFound') {
return cbOnce(err);
}
if (err.$metadata?.httpStatusCode === 409) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any 409 from the destination is assumed to be a cascade-stale scenario and the replication is silently marked COMPLETED. If cloudserver ever returns 409 for a different reason, the object would never be replicated. Consider checking for a more specific signal (e.g. a response body field or custom error code) rather than relying solely on the HTTP status code.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

  • package.json:54 — @scality/cloudserverclient uses a local file path (file:../cloudserverclient/...). Must be changed to a proper registry or git-pinned reference before merge.
    - ReplicateObject.js:6 — checkCrrCascadeEvent and getMicroVersionId() do not appear to exist in arsenal 8.3.9. Arsenal version bump likely needed.
    - ReplicateObject.js:743 — Any 409 from destination putMetadata is assumed to be cascade-stale and marked COMPLETED. Consider using a more specific signal to avoid silently skipping replication if 409 is returned for other reasons.

    Review by Claude Code

@SylvainSenechal SylvainSenechal marked this pull request as ready for review June 3, 2026 16:15
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can already check this pr, but should really be reviewed after all the other cascade prs, as changes in these pr would also mean changes here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can functional tests instead of just these,
But waiting for Arsenal/cloudserver to be merged, as it will be easier to make these tests (functional tests in backbeat rely on an image of cloudserver)

Comment thread package.json Outdated
"@aws-sdk/client-sts": "^3.921.0",
"@aws-sdk/credential-providers": "^3.921.0",
"@scality/cloudserverclient": "^1.0.8",
"@scality/cloudserverclient": "file:../cloudserverclient/scality-cloudserverclient-v1.0.9.tgz",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scality/cloudserverclient is pinned to a local file path (file:../cloudserverclient/scality-cloudserverclient-v1.0.9.tgz). This will break CI and other developers' builds. Before merging, this needs to be changed to a proper npm version (e.g., ^1.0.9) or a git tag reference consistent with the other git-based deps in this project.

Suggested change
"@scality/cloudserverclient": "file:../cloudserverclient/scality-cloudserverclient-v1.0.9.tgz",
"@scality/cloudserverclient": "^1.0.9",

— Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 82.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.58%. Comparing base (c52fbcc) to head (3237f9e).
⚠️ Report is 2 commits behind head on development/9.5.

Files with missing lines Patch % Lines
extensions/replication/tasks/ReplicateObject.js 82.00% 9 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/replication/tasks/ReplicateObject.js 89.87% <82.00%> (-1.34%) ⬇️

... and 7 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.22% <ø> (ø)
Core Library 80.71% <ø> (-0.27%) ⬇️
Ingestion 70.63% <ø> (-0.61%) ⬇️
Lifecycle 79.09% <ø> (+0.03%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 60.02% <82.00%> (+0.24%) ⬆️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.5    #2747      +/-   ##
===================================================
- Coverage            74.73%   74.58%   -0.16%     
===================================================
  Files                  199      199              
  Lines                13650    13691      +41     
===================================================
+ Hits                 10201    10211      +10     
- Misses                3439     3470      +31     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.10% <0.00%> (-0.03%) ⬇️
api:routes 8.92% <0.00%> (-0.03%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.05% <0.00%> (-1.07%) ⬇️
ingestion 12.47% <0.00%> (-0.10%) ⬇️
lib 7.71% <0.00%> (-0.08%) ⬇️
lifecycle 18.84% <0.00%> (-0.16%) ⬇️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
unit 51.63% <82.00%> (+0.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

entry: destEntry.getLogInfo(),
});
return doneOnce(null, partAlreadyAtDest);
default: {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cascadeLoopDetected or cascadeDataComplete is returned as an error from a part, this error flows through the retry wrapper in _getAndPutPart. The BackbeatTask.retry() method checks err.retryable === undefined and then inspects err.code, err.name, and err.message to detect network errors. Since these cascade signals are plain objects ({}), none of those properties exist, so no mutation occurs today. But if a future change accidentally adds such a property to these sentinels, retry() would mutate the shared singleton (setting retryable = true) and corrupt all subsequent cascade checks. Consider using Object.freeze() on the sentinel objects at line 33-35 to prevent accidental mutation.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

  • package.json:54 — @scality/cloudserverclient uses a local file: path that will break CI and other developers. Must be changed to a proper npm or git-tag reference before merge.
    - extensions/replication/tasks/ReplicateObject.js:33-35 — Cascade sentinel objects (cascadeLoopDetected, cascadeDataComplete, partAlreadyAtDest) are plain {} singletons that pass through BackbeatTask.retry(), which mutates errors. Consider Object.freeze() to prevent accidental mutation.

    The cascade logic itself (putData/putMetadata detection, allPartsAlreadyAtDest propagation through the waterfall, _handleReplicationOutcome treating cascade signals as COMPLETED) is sound. The retry wrapper correctly passes cascade signals through without retrying (no retryable property). Tests cover the key scenarios well.

    Review by Claude Code

partResults.length > 0 &&
partResults.every(result => result === partAlreadyAtDest);
const destLocations = allPartsAlreadyAtDest ? [] :
(partResults || []).filter(result => result && result !== partAlreadyAtDest);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: when err is truthy (line 434 returns early), destLocations is used in the _deleteOrphans call, but destLocations was computed on line 432-433 with the !err guard already making allPartsAlreadyAtDest false. The filter on line 433 still runs (partResults || []).filter(...) which accesses partResults — but after an error, partResults from mapLimitWaitPendingIfError may contain partial results mixed with partAlreadyAtDest sentinel values. These sentinel objects (empty {}) will pass the result && result !== partAlreadyAtDest filter and be passed to _deleteOrphans, which tries to read .key and .dataStoreName from them. Since they're empty objects, they'll produce undefined keys — likely harmless because _deleteOrphans filters out falsy .key, but worth a defensive check or comment explaining why this is safe.

— Claude Code

Comment thread package.json
"@smithy/node-http-handler": "^3.3.3",
"JSONStream": "^1.3.5",
"arsenal": "git+https://github.com/scality/arsenal#8.3.9",
"arsenal": "git+https://github.com/scality/Arsenal#2c429ab35a5ac82c3dafa5a0296a49a23a9c8a4a",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal dependency is pinned to a raw commit hash (2c429ab) instead of a release tag. All other git-based deps in this file use tags (e.g. breakbeat#v1.0.3, bucketclient#8.2.4). Pin to a tagged release once the arsenal PR is merged to keep the dependency auditable and consistent with the project convention.

— Claude Code

Comment thread package.json
"@aws-sdk/client-sts": "^3.921.0",
"@aws-sdk/credential-providers": "^3.921.0",
"@scality/cloudserverclient": "^1.0.8",
"@scality/cloudserverclient": "file:./scality-cloudserverclient-v1.0.9.tgz",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vendoring a .tgz into the repo adds a binary blob to git history that can never be garbage-collected. Prefer publishing @scality/cloudserverclient@1.0.9 to the registry (or using a git tag reference like the other deps) and pin to it here. If vendoring is intentional for CI reasons, it should be called out in the PR description.

— Claude Code

entry: destEntry.getLogInfo(),
});
return doneOnce(null, partAlreadyAtDest);
default: {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace on this line (after the opening brace {). Minor, but some CI lint checks catch this.

— Claude Code

// update location, replication status and put metadata in
// target bucket
(location, next) => {
(location, allPartsAlreadyAtDest, next) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _processQueueEntryRetryFull, the second waterfall stage now passes allPartsAlreadyAtDest as the mdOnly argument to _putMetadata. This means a full retry where all parts happened to already exist at the destination will send ReplicationContent: 'METADATA' in the putMetadata call. Verify this is the intended behavior — it changes the semantics from "always full replication on retry" to "maybe metadata-only on retry if cascade detected all parts present." If a retry was triggered because metadata was stale but data was fine, this seems correct; if the retry was triggered for a data integrity reason, skipping the data write could mask the issue.

— Claude Code

{
method: 'ReplicateObject._getAndPutPartOnce',
entry: destEntry.getLogInfo(),
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cascadeLoopDetected is passed as the err argument via doneOnce(cascadeLoopDetected). This sentinel is an empty object {}, so upstream retry logic in BackbeatTask._retry will see it as a truthy error and may retry the operation before _handleReplicationOutcome gets to check for it. Confirm that _getAndPutPart (the retry wrapper around _getAndPutPartOnce) won't retry on this sentinel — if it does, the loop detection is bypassed and the part is retried unnecessarily.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

  • ReplicateObject.js:602 — Cascade sentinel objects (cascadeLoopDetected, cascadeDataComplete) are passed as err to callbacks. Verify that BackbeatTask._retry won't retry on these truthy-but-non-error sentinels, which would defeat the loop/stale detection.
    - ReplicateObject.js:433 — On error path in _getAndPutData, destLocations may contain partial results. The sentinel objects ({}) pass the filter and could reach _deleteOrphans with undefined keys. Likely harmless due to downstream filtering, but worth a defensive check.
    - ReplicateObject.js:1014 — _processQueueEntryRetryFull now passes allPartsAlreadyAtDest as mdOnly to _putMetadata, changing retry-full semantics from always-full to conditionally-metadata-only. Confirm this is intentional.
    - package.json:57 — Arsenal pinned to a raw commit hash instead of a tag. Should be updated to a tag once the Arsenal PR merges.
    - package.json:54 — scality-cloudserverclient-v1.0.9.tgz vendored as a binary blob in git. Prefer a registry or git-tag reference.
    - ReplicateObject.js:573 — Trailing whitespace.

    Review by Claude Code

Comment thread package.json
"@smithy/node-http-handler": "^3.3.3",
"JSONStream": "^1.3.5",
"arsenal": "git+https://github.com/scality/arsenal#8.3.9",
"arsenal": "git+https://github.com/scality/Arsenal#2c429ab35a5ac82c3dafa5a0296a49a23a9c8a4a",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to a raw commit hash (2c429ab...) instead of a semantic version tag. Per project conventions, git-based deps (arsenal, vaultclient, etc.) should pin to tags (e.g. #8.x.y). Commit hashes are opaque — it's unclear which features/fixes are included, and there's no semver contract. This also makes it harder for reviewers and operators to reason about what changed.

— Claude Code

Comment thread package.json
"@aws-sdk/client-sts": "^3.921.0",
"@aws-sdk/credential-providers": "^3.921.0",
"@scality/cloudserverclient": "^1.0.8",
"@scality/cloudserverclient": "file:./scality-cloudserverclient-v1.0.9.tgz",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A binary .tgz is committed to the repo and referenced via file: in package.json. This permanently inflates the git history (binary blobs can't be diffed or garbage-collected), prevents code review of the dependency contents, and bypasses the normal package registry workflow. Consider publishing @scality/cloudserverclient@1.0.9 to the npm registry (or a private registry) and referencing it as a versioned dependency instead.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

  • package.json:57 — Arsenal pinned to a raw commit hash instead of a semantic version tag. Git-based deps should use tags per project conventions.
  • package.json:54 — @scality/cloudserverclient vendored as a binary .tgz committed to the repo. Should be published to a registry.

The cascade replication logic in ReplicateObject.js (the ExistingMicroVersionId switch, MicroVersionIdExists / 409 handling in putMetadata, and the _handleReplicationOutcome cascade signal handling) is well-structured: sentinel objects flow correctly through the retry wrapper without being retried, the orphan cleanup path correctly filters out partAlreadyAtDest sentinels, and the test coverage is thorough across all cascade branches.

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants