Add data integrity tools (checksum manifest builder and comparator), update README, and modernize package#49
Draft
jordanpadams wants to merge 3 commits into
Draft
Add data integrity tools (checksum manifest builder and comparator), update README, and modernize package#49jordanpadams wants to merge 3 commits into
jordanpadams wants to merge 3 commits into
Conversation
- Clarify --missing-output, --unverifiable-output, --weak-output help strings to indicate they are output file paths (with defaults shown) - Add status prints throughout main() for loading, indexing, comparing (with 10% interval progress), and writing each output file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add pdc-build-checksum-manifest and pdc-compare-manifests CLI entry points - Add build_s3_checksum_manifest.py for generating S3 checksum manifests - Add compare_s3_manifests.py improvements (multipart ETag handling, better docs) - Add scripts/csv-converter.py utility - Add CLAUDE.md project guidance file - Expand README with full usage docs for all tools - Replace deprecated pkg_resources with importlib.resources in __init__.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add missing docstrings across data_integrity modules and s3_download - Fix mypy: explicit boto3.Session kwargs instead of **dict unpack - Fix mypy: add unreachable raise after retry loops for missing return - Fix D212/D205 docstring formatting in build_s3_checksum_manifest - Fix D301: use r-string for module docstring with backslashes in s3_download - Fix B950: wrap long help string in s3_download - Drop Python 3.9 support; require >= 3.12; update tox envlist and classifiers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🗒️ Summary
This PR adds a full data integrity verification workflow for S3 bucket migrations and improves the existing CLI tooling:
New tools:
pdc-build-checksum-manifest— generates a CSV manifest of S3 objects with checksums (prefers CRC64NVME, falls back to others); supports--resume-fromfor incremental runspdc-compare-manifests— compares two manifests by matching(size, checksum_type, checksum_value)tuples; identifies objects in OLD missing from NEW, handles multipart ETag detection, outputs missing/unverifiable/weak CSVsImprovements:
pdc-s3-download,pdc-build-checksum-manifest, andpdc-compare-manifestscompare_s3_manifestsCLI help strings (explicit output file paths, default values) and added progress status output throughoutmain()scripts/csv-converter.pyutility scriptCLAUDE.mdproject guidance filepkg_resourceswithimportlib.resourcesin__init__.pysetup.cfg🤖 AI Assistance Disclosure
Estimated % of code influenced by AI: 80%
⚙️ Test Data and/or Report
Manual verification:
pdc-build-checksum-manifest --helpand against a sample S3 bucketpdc-compare-manifestsagainst real migration manifests; confirmed correct identification of missing, unverifiable, and weak-checksum objectspdc-compare-manifests --helpto confirm updated help text and progress output at each stage♻️ Related Issues
🤓 Reviewer Checklist
Reviewers: Please verify the following before approving this pull request.
Documentation and PR Content
Security & Quality
Testing & Validation
Maintenance