Skip to content

docs(review-pr): Pass C v1.1 — fenced-body heuristic + doc refinements (4.5.0)#44

Merged
lklimek merged 22 commits into
mainfrom
feat/passc-v1.1
Jun 3, 2026
Merged

docs(review-pr): Pass C v1.1 — fenced-body heuristic + doc refinements (4.5.0)#44
lklimek merged 22 commits into
mainfrom
feat/passc-v1.1

Conversation

@lklimek
Copy link
Copy Markdown
Owner

@lklimek lklimek commented May 29, 2026

Why this PR exists

Problem: review-pr Pass C (promise verification) extracts ## Summary / ## Out of scope with column-0-anchored regexes and leaves several judgment calls undocumented — so it silently misses sections in some PR bodies and different reviewers resolve ambiguities inconsistently.

What breaks without it: Author wraps the whole PR description in a ``` code fence (common when pasting from an editor that auto-wraps) → every section header sits inside the fence → Pass C's regex matches nothing → the promise check silently no-ops, reporting a clean pass it never actually performed. Plus: compound titles, the "undocumented change" threshold, and multi-heading precedence were all left to per-run judgment.

Blocking relationship: Originally the top of the stack (feat/passc-v1.1feat/report-pipeline-severityfeat/reviewer-rigor), depending on the pr_audit report_type and finding_section.verdict enum from PR-A (#42). #41, #42, and #43 are now all merged, so this PR has been retargeted to main (which carries those schema additions) and main is merged in — it stands alone and is MERGEABLE. The body itself follows the "Why this PR exists" template (#43, dogfood).

What was done

All in skills/review-pr/SKILL.md (Pass C), doc-only — no schema changes:

  • Fenced-body heuristic — detect a wholly-fenced body, strip + dedent before the section regexes; if still nothing parses, emit one low-confidence INFO "PR body unparseable" finding instead of silently skipping.
  • F-1 clean-pass shape — emit findings: [] plus one INFO "PR self-description verified" so a clean pass leaves visible evidence it ran.
  • F-2 compound titles — split on commas/em-dashes, verify each topic independently, flag off-target only on majority miss.
  • F-3 "undocumented change" — keep the ≥50-LOC trigger; define "mentioned" = keyword overlap with ≥1 Summary bullet or field-ownership row.
  • F-4 — cross-reference report-format §code_snippets for allowed language values instead of hard-coding diff.
  • F-5 — Summary-heading precedence: ## Summary > ### Summary > ## What changed, first match wins.
  • F-7 wiring — optionally set the section verdict (PASS/FAIL/NEEDS_REVIEW) and metadata.report_type: pr_audit (both from PR-A, now in main).

Resolves TODOs 8c17d019 + the 4463333d doc gaps (F-1…F-5, F-7 wiring).

Testing

  • pytest tests/401 passed (whole suite, post-merge of main)
  • black scripts/ tests/ + ruff check scripts/ tests/ clean
  • claude plugin validate → exit 0
  • New tests/test_review_pr_passc.py (grep-asserts all six SKILL additions + a parser-mirror proving dedent exposes ## Summary); new tests/fixtures/pr-promises/synthetic-fenced.md.

Breaking changes

None. Documentation change only.

🤖 Co-authored by Claudius the Magnificent AI Agent

claude added 10 commits May 28, 2026 12:08
Stream C of reviewer-rigor PR. Bundles three housekeeping items:
- skills/release/SKILL.md: quote frontmatter description to fix PyYAML parse error (claude plugin validate now clean)
- coding-best-practices added to skills frontmatter on 6 agents missing it (architect, claudius, qa, security, writer, ux)
- orchestrator skills (grumpy-review, check-pr-comments) note BP preload requirement for spawned agents
- tests/test_skill_frontmatter.py: regression guard parses every frontmatter via PyYAML
- tests/test_bp_load_audit.py: asserts curated agent set loads BP
- .claude-plugin/plugin.json: 4.1.7 -> 4.2.0
- CHANGELOG.md: [4.2.0] entry covering the full PR (Streams A + B + C)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…all-tree inspection

Pipeline plumbing for the upcoming reviewer call-tree walk (Stream A of reviewer-rigor PR).
- schemas/review-report.schema.json: CALL added to id pattern; call_tree added to category enum and matrix
- scripts/consolidate_reports.py: CATEGORY_PREFIX maps call_tree -> CALL-
- scripts/generate_review_report.py: CATEGORY_LABELS includes call_tree
- skills/report-format/SKILL.md: ID prefix table + category list extended
- tests: schema happy/reject paths, consolidator id assignment + matrix population, renderer column/chip coverage
- fixture: tests/fixtures/reports/v3-call-tree.json

No skill-body changes (Stream B), no agent BP edits (Stream C), no version bump (Stream C does final bump after all streams land).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… lint

Stream B of reviewer-rigor PR. Two reviewer-rigor threads:

Call-tree inspection:
- skills/grumpy-review/references/call-tree-walk.md: NEW authoritative methodology — outcome contract, when-to-run trigger, rank-and-walk-top-10 algorithm, tool-probe order, walk caps (depth 5, callers 200, 60s/function), per-caller judgement rubric, finding emission shape with mandatory `Walked via: <tool>` line.
- skills/grumpy-review/SKILL.md: new "Call-tree inspection" subsection delegating to the reference doc.
- skills/review-pr/SKILL.md: inheritance note (delegated via grumpy-review).
- skills/check-pr-comments/SKILL.md: trigger walk when a resolved comment references a function modified in resolution commits.

Ephemeral-ID enforcement:
- skills/coding-best-practices/SKILL.md: new Cross-Cutting Rule banning transient review-finding IDs in committed artifacts; documented allow-list (ADR/RFC/CWE/CVE/OWASP/GHSA/GH refs/TODO/test-spec IDs); rule-of-thumb (regenerated JSON IDs = forbidden, committed-doc IDs = fine).
- scripts/lint_ephemeral_ids.py: NEW deterministic lint. Imports CATEGORY_PREFIX from consolidate_reports to prevent drift. JSON or text output, --diff mode for unified-diff + lines, always exit 0.
- skills/grumpy-review/SKILL.md + skills/review-pr/SKILL.md: wire the lint step, instruct LLM to dismiss in-skill example matches (no markers per design decision).
- allowed-tools updates for the walk + lint binaries on the three reviewer skills.

Tests:
- tests/test_lint_ephemeral_ids.py: positives across all 13 consolidator prefixes, negatives across allow-list (ADR/RFC/CWE/CVE/OWASP/GHSA/etc.), word-boundary edge cases, --diff mode parsing, prefix-source-of-truth assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- lint_ephemeral_ids: skip '\ No newline at end of file' diff marker
  so it no longer advances new_line (off-by-one on added matches after
  a no-newline final line); add regression test
- grumpy-review SKILL §3: add call_tree category + CALL- prefix to the
  generic finding-format block for consistency with the call-tree section
- call-tree-walk: fix ranking example to emit producer risk/impact/scope
  floats instead of coordinator-derived severity (producer-contract compliance)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR descriptions now open with a "Why this PR exists" section (problem,
reproduction/threat scenario, blocking relationship) ahead of What/Testing/
Breaking/Checklist/Attribution. The skeleton lives in one place —
git-and-github §Creating a PR — and push delegates to it (no duplication).
Pinned by tests/test_pr_body_template.py. Bumps plugin to 4.4.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ive bits

Producers (esp. check-pr-comments) emit per-finding risk/impact/scope floats
but no integer severity and never route through the coordinator, so findings
rendered INFO and severity_counts/severity_category_matrix stayed zero. Fix it
where it bites — at render time — and tidy up the schema while we're in here.

Fix 1 (HIGH): new scripts/severity_util.py is the single source of truth for
the OWASP band table, per-finding severity derivation, and the stats/matrix
builder. consolidate_reports imports from it and keeps the legacy private
names (_derive_overall/_derive_severity_int) as re-export aliases. The renderer
gains a normalization pass (markdown/html/triage/pdf): it derives each finding's
integer severity from the floats when absent and recomputes severity_counts +
severity_category_matrix when the supplied counts are missing or all-zero —
never clobbering non-zero supplied counts. triage inherits the fix via
_build_html_context.

Fix 2 (MED): schema accepts optional finding author_type (bot|human) so
check-pr-comments output validates.

Fix 3 (MED): permalink commit now derives from `git rev-parse @{u}` with a
HEAD fallback (check-pr-comments, report-format, grumpy-review) so links resolve
on GitHub instead of 404-ing on an unpushed HEAD.

Fix 4: schema 3.1.0 additive bits — schema_version accepts 3.0.0 and 3.1.0,
metadata.report_type adds pr_audit, finding_section gains optional verdict
(PASS|FAIL|NEEDS_REVIEW). severity stays optional on findings.

Schema -> 3.1.0, plugin -> 4.3.0. New tests: test_severity_util.py, schema
additive cases, render regression (floats-only report renders HIGH + non-zero
counts), re-export parity, and an end-to-end producer->validate->render check
in the shell pipeline. 357 passed; black/ruff clean; plugin validates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tests

Completes PR-A. The earlier commit shipped schema 3.1.0, severity_util, and the
consolidate_reports re-export; this lands the renderer-side fix plus the
remaining producer/test/doc changes.

- generate_review_report.py: add _normalize_report() — derive each finding's
  integer severity from risk/impact/scope when absent, and recompute
  severity_counts + severity_category_matrix only when supplied counts are
  missing/all-zero (never clobbering non-zero ones). Invoked from
  render_markdown, _build_html_context (html + triage), and render_pdf.
  triage_server.py inherits via the triage format, no change needed.
- check-pr-comments / report-format / grumpy-review: permalink commit now uses
  git rev-parse @{u} with a git rev-parse HEAD fallback so links resolve on
  GitHub; check-pr-comments schema_version bumped to 3.1.0 (3.0.0 still accepted).
- consolidate_reports.py: drop unused build_severity_stats import.
- tests: test_severity_util.py (band mapping via derive_severity_int for exact
  boundaries — the float mean rounds at edges), test_render_v3.py regression
  (floats-only report renders HIGH + non-zero counts in md/html, supplied counts
  preserved), test_report_pipeline.sh end-to-end producer->validate->render,
  test_consolidate_reports asserts against cr.SCHEMA_VERSION not a frozen literal.
- CHANGELOG: 4.3.0 section.

384 tests pass; black/ruff clean; plugin validates; shell pipeline 26/0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The coordinator has its own compute_statistics; only the renderer uses the
shared build_severity_stats. Removing the dead import clears the last ruff F401.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tighten review-pr Pass C (promise verification) per TODOs 8c17d019 and
4463333d (F-1..F-5, F-7). DOC-ONLY in skills/review-pr/SKILL.md plus tests;
no schema changes (PR-A landed pr_audit + finding_section.verdict already).

- Fenced-body unwrap: strip+dedent a wholly-fenced PR body before the
  column-0 Summary/Out-of-scope regexes; emit one INFO "PR body unparseable"
  finding instead of silently skipping when nothing parses.
- Clean-pass shape: findings:[] plus one INFO "PR self-description verified".
- Compound titles: split on commas/em-dashes, verify per-topic, majority-hits.
- Undocumented-change: keep >=50 LOC trigger, define "mentioned" precisely
  (keyword overlap with a Summary bullet or field-ownership-table row).
- Summary-heading precedence: ## Summary > ### Summary > ## What changed.
- Optional finding_section.verdict (PASS/FAIL/NEEDS_REVIEW) + report_type.
- code_snippets language: cross-ref report-format §code_snippets, not hard-coded.

Tests: new tests/test_review_pr_passc.py (grep-assert the documented rules +
a parser-mirror of the fenced-body dedent that exposes ## Summary) and
tests/fixtures/pr-promises/synthetic-fenced.md (self-describing, carries the
standard <!-- expected --> annotation). Full suite: 358 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Documents Pass C v1.1 of the review-pr skill: closes ambiguities reviewers were resolving inconsistently and adds a fenced-body unwrap heuristic so PR descriptions wrapped in a single code fence are no longer silently skipped. Pure documentation/test/version change — no schema or executable behavior is introduced. Builds on the schema/enum additions landed in PR-A (pr_audit, finding_section.verdict).

Changes:

  • Pass C SKILL doc: fenced-body unwrap + "PR body unparseable" fallback; clean-pass shape with an explicit INFO marker; compound-title majority rule; precise "mentioned" definition for undocumented changes; Summary-heading precedence; optional verdict/report_type: pr_audit wiring; cross-reference for snippet language.
  • New doc-regression test (tests/test_review_pr_passc.py) with grep-asserts and a parser-mirror of the fenced-body unwrap, plus a synthetic-fenced.md fixture.
  • Bumps plugin manifest to 4.5.0 and adds the corresponding CHANGELOG entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
skills/review-pr/SKILL.md Adds Pass C v1.1 rules (fenced-body unwrap, clean-pass shape, compound titles, undocumented-change definition, heading precedence, verdict/report_type wiring, language cross-ref).
tests/test_review_pr_passc.py Grep-asserts the documented Pass C v1.1 rules and mirrors the documented fenced-body unwrap against a synthetic fenced body.
tests/fixtures/pr-promises/synthetic-fenced.md New Pass C fixture demonstrating a wholly fenced PR body with the standard <!-- expected --> annotation.
CHANGELOG.md Adds a 4.5.0 — 2026-05-29 entry summarising the Pass C v1.1 changes.
.claude-plugin/plugin.json Bumps the plugin version from 4.3.0 to 4.5.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread skills/review-pr/SKILL.md Outdated
Comment thread skills/review-pr/SKILL.md Outdated
claude added 2 commits June 3, 2026 13:12
Bring the squash-merged PR #41 (commit 7fcc279, final reviewer-rigor incl.
2nd+3rd Copilot passes) into the severity-pipeline branch.

#42 was cut from #41 at b76bf72 (first pass only), so its branch carried a
stale copy of #41. Resolution rule: pure-#41 content -> take main (final,
thrice-reviewed); #42 severity-pipeline additions -> keep ours; mixed files
-> combine. 8 conflicts resolved:

- .claude-plugin/plugin.json: ours (4.3.0)
- CHANGELOG.md: combined (ours 4.3.0 above main's final 4.2.0; NNN
  placeholders, no LSP, lint-clean)
- schemas/review-report.schema.json: ours 3.1.0 superset + main's CALL-/
  call_tree id-prefix description
- skills/check-pr-comments,grumpy-review,review-pr SKILL.md: combined
  (main's gtags/tree-sitter allowed-tools + $BASE_BRANCH; ours @{u} permalink)
- skills/grumpy-review/references/call-tree-walk.md: main (LSP tier removed)
- tests/test_schema_v3_strict.py: ours (union, #42 TestV31AdditiveFields)

consolidate_reports.py + generate_review_report.py merged byte-identical to
ours (call_tree rendering and on-the-fly severity never overlapped). Also
normalizes black formatting on 6 pre-existing not-black-clean test files
(pure style, no semantic change). 379 tests pass; plugin validate clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three valid catches in the severity-pipeline code:

- generate_review_report: _normalize_summary_statistics rebuilt
  severity_counts + severity_category_matrix but left total_findings
  stale, so the HTML KPI could disagree with the rebuilt counts. Realign
  total_findings from the rebuilt stats (same all-zero/missing branch, so
  a hand-supplied non-zero stats block is still never touched).
- generate_review_report: corrected the _normalize_summary_statistics
  docstring — the real trigger is "counts missing/all-zero AND >=1 finding
  exists" (floatless findings count as INFO), not "a finding carries a
  derivable severity".
- consolidate_reports: ACCEPTED_SCHEMA_VERSIONS now derives from the
  schema's schema_version enum (single read shared with SCHEMA_VERSION)
  instead of a hard-coded set, so it can't drift on the next schema bump.
  _read_schema_version() preserved as a thin newest-entry wrapper.

Regression guards: anti-drift assert (ACCEPTED_SCHEMA_VERSIONS == set of
schema enum) and a total_findings-realignment test that fails without the
fix. 382 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three follow-on findings on the on-the-fly severity normalization:

- generate_review_report: _normalize_finding_severities now computes and
  stores overall_severity (mean of risk/impact/scope) when missing/invalid,
  then derives the integer band from that value — instead of filling only
  the integer severity. Producer-shape reports now populate the Markdown
  (overall=...) suffix and the HTML data-overall "Sort: overall" key. A
  finding with a valid int severity but no overall_severity still gets
  overall_severity filled; both already-valid values are left untouched.
- generate_review_report: import derive_overall/derive_severity_int from
  severity_util (derive_finding_severity stays in severity_util, still used
  by _effective_severity; the renderer no longer imports it).
- consolidate_reports: remove the now-dead _read_schema_version() wrapper —
  SCHEMA_VERSION reads _SCHEMA_VERSIONS[-1] directly since ACCEPTED_SCHEMA_
  VERSIONS became schema-enum-derived.

Two regression tests pin the overall_severity fill (floats-only finding +
int-severity-without-overall). 384 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Base automatically changed from feat/report-pipeline-severity to main June 3, 2026 12:06
claude added 6 commits June 3, 2026 14:13
# Conflicts:
#	.claude-plugin/plugin.json
#	CHANGELOG.md
# Conflicts:
#	.claude-plugin/plugin.json
#	CHANGELOG.md
PR-C's own test file (from f10308d) failed black --check in this
environment — assert line-wrapping only, no logic change. Surfaced
during the reviewer-rigor/severity-pipeline merges; brings the
git diff main...HEAD set fully clean under black.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	.claude-plugin/plugin.json
#	CHANGELOG.md
Rename ambiguous loop var l->ln and apply black to #44's new Pass C
regression test, carried in with the feat/pr-why-template merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	.claude-plugin/plugin.json
#	CHANGELOG.md
@lklimek lklimek marked this pull request as ready for review June 3, 2026 12:45
@lklimek lklimek requested a review from Copilot June 3, 2026 12:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread tests/test_review_pr_passc.py Outdated
…+ tighten fence parser

Three valid Copilot findings on PR #44:

- Pass C's "PR self-description verified" (clean pass) and "PR body
  unparseable" markers were labeled INFO but specified scope=1.0. With
  scope pinned at 1.0 the mean (risk+impact+scope)/3 floors at 1/3, so the
  band table can never reach INFO (needs <0.1) — both derived to MEDIUM. Fix
  per the check-pr-comments RESOLVED convention: informational Pass C
  findings use scope=0.0 (no actionable diff work). "verified" -> 0.1/0.1/0.0
  -> INFO (band 1); "unparseable" relabeled LOW -> 0.2/0.2/0.0 -> LOW (band
  2). Documented the Pass C scope=0.0 exception (scope=1.0 applies only to
  axis 1-3 mismatches).
- Tightened the fenced-body unwrap's closing-fence check (SKILL spec + the
  test's parser-mirror): the closer must be the same fence character, length
  >= the opener, and contain only fence chars — previously a 3-char closer
  unwrapped a 4-char opener and "```python" trailing text was accepted.

Guards added: a band-derivation test pinning the documented floats to the
severity_util bands (INFO/LOW), and fence cases for the 4-vs-3 mismatch and
trailing-text-after-fence. 408 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lklimek lklimek requested a review from Copilot June 3, 2026 13:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread CHANGELOG.md Outdated
Copilot follow-on on PR #44: the 4.5.0 CHANGELOG entry still described the
"PR body unparseable" Pass C fallback as INFO, but the SKILL fix relabeled
it LOW (scope=0.0, risk≈impact≈0.2 -> mean ≈0.133 -> band 2). Align the
changelog. The "PR self-description verified" finding correctly stays INFO.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread tests/test_review_pr_passc.py Outdated
Comment thread skills/review-pr/SKILL.md Outdated
…t comment

Copilot's 4th pass on PR #44 flagged that the scope=1.0 rationale overstated
"MEDIUM". Verified ground truth: at scope=1.0, {0.1,0.1,1.0} derives to LOW(2)
(IEEE-754 puts (0.1+0.1+1.0)/3 = 0.39999... just under the 0.4 cutoff), and
{0.2,0.2,1.0} to MEDIUM(3). The explanatory text in both the SKILL exception
note and the test comment wrongly said both inflate to MEDIUM.

- SKILL: reword the exception note to state the real means/bands (LOW + MEDIUM).
- test: rename test_scope_one_would_inflate_to_medium ->
  test_scope_one_lifts_above_documented_bands; correct the comment and document
  the IEEE-754 boundary so the intentional ==2 isn't mistaken for a typo.

Assertions unchanged (both verified correct). The Copilot complaint that the
==2 assertion "will fail" was a false positive — it's the genuine float result.
408 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lklimek lklimek merged commit b2a43b2 into main Jun 3, 2026
2 checks passed
@lklimek lklimek deleted the feat/passc-v1.1 branch June 3, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants