Skip to content

Add data integrity tools (checksum manifest builder and comparator), update README, and modernize package#49

Draft
jordanpadams wants to merge 3 commits into
mainfrom
pdc162
Draft

Add data integrity tools (checksum manifest builder and comparator), update README, and modernize package#49
jordanpadams wants to merge 3 commits into
mainfrom
pdc162

Conversation

@jordanpadams

@jordanpadams jordanpadams commented Mar 24, 2026

Copy link
Copy Markdown
Member

🗒️ Summary

This PR adds a full data integrity verification workflow for S3 bucket migrations and improves the existing CLI tooling:

New tools:

  • pdc-build-checksum-manifest — generates a CSV manifest of S3 objects with checksums (prefers CRC64NVME, falls back to others); supports --resume-from for incremental runs
  • pdc-compare-manifests — compares two manifests by matching (size, checksum_type, checksum_value) tuples; identifies objects in OLD missing from NEW, handles multipart ETag detection, outputs missing/unverifiable/weak CSVs

Improvements:

  • Expanded README with full usage docs for all CLI tools including pdc-s3-download, pdc-build-checksum-manifest, and pdc-compare-manifests
  • Improved compare_s3_manifests CLI help strings (explicit output file paths, default values) and added progress status output throughout main()
  • Added scripts/csv-converter.py utility script
  • Added CLAUDE.md project guidance file
  • Replaced deprecated pkg_resources with importlib.resources in __init__.py
  • Registered new entry points in setup.cfg

🤖 AI Assistance Disclosure

  • No AI assistance used
  • AI used for light assistance (e.g., suggestions, refactoring, documentation help, minor edits)
  • AI used for moderate content generation (AI generated some code or logic, but the developer authored or heavily revised the majority)
  • AI generated substantial portions of this code

Estimated % of code influenced by AI: 80%

⚙️ Test Data and/or Report

Manual verification:

  • Ran pdc-build-checksum-manifest --help and against a sample S3 bucket
  • Ran pdc-compare-manifests against real migration manifests; confirmed correct identification of missing, unverifiable, and weak-checksum objects
  • Ran pdc-compare-manifests --help to confirm updated help text and progress output at each stage

♻️ Related Issues

🤓 Reviewer Checklist

Reviewers: Please verify the following before approving this pull request.

Documentation and PR Content

  • Documentation: README, Wiki, or inline documentation (Sphinx, Javadoc, Docstrings) have been updated to reflect these changes.
  • Issue Traceability: The PR is linked to a valid GitHub Issue
  • PR Title: The PR title is "user-friendly" clearly identifying what is being fixed or the new feature being added, that if you saw it in the Release Notes for a tool, you would be able to get the gist of what was done.

Security & Quality

  • SonarCloud: Confirmed no new High or Critical security findings.
  • Secrets Detection: Verified that the Secrets Detection scan passed and no sensitive information (keys, tokens, PII) is exposed.
  • Code Quality: Code follows organization style guidelines and best practices for the specific language (e.g., PEP 8, Google Java Style).

Testing & Validation

  • Test Accuracy: Verified that test data is accurate, representative of real-world PDS4 scenarios, and sufficient for the logic being tested.
  • Coverage: Automated tests cover new logic and edge cases.
  • Local Verification: (If applicable) Successfully built and ran the changes in a local or staging environment.

Maintenance

  • Backward Compatibility: Confirmed that these changes do not break existing downstream dependencies or API contracts (or that breaking changes are clearly documented).

- Clarify --missing-output, --unverifiable-output, --weak-output help
  strings to indicate they are output file paths (with defaults shown)
- Add status prints throughout main() for loading, indexing, comparing
  (with 10% interval progress), and writing each output file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jordanpadams jordanpadams added the enhancement New feature or request label Mar 24, 2026
@jordanpadams jordanpadams self-assigned this Mar 24, 2026
@jordanpadams jordanpadams added the enhancement New feature or request label Mar 24, 2026
- Add pdc-build-checksum-manifest and pdc-compare-manifests CLI entry points
- Add build_s3_checksum_manifest.py for generating S3 checksum manifests
- Add compare_s3_manifests.py improvements (multipart ETag handling, better docs)
- Add scripts/csv-converter.py utility
- Add CLAUDE.md project guidance file
- Expand README with full usage docs for all tools
- Replace deprecated pkg_resources with importlib.resources in __init__.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jordanpadams jordanpadams changed the title Improve compare_s3_manifests CLI help text and add progress status output Add data integrity tools (checksum manifest builder and comparator), update README, and modernize package Apr 1, 2026
- Add missing docstrings across data_integrity modules and s3_download
- Fix mypy: explicit boto3.Session kwargs instead of **dict unpack
- Fix mypy: add unreachable raise after retry loops for missing return
- Fix D212/D205 docstring formatting in build_s3_checksum_manifest
- Fix D301: use r-string for module docstring with backslashes in s3_download
- Fix B950: wrap long help string in s3_download
- Drop Python 3.9 support; require >= 3.12; update tox envlist and classifiers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant