feat(skill): introduce CONCEPTS.md as shared vocabulary substrate#838
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4225fa13d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Adds an eval suite that tests whether ce-sessions findings preserve terminology resolution context — specifically, whether distinctive coined terms and their resolution rationale survive the session-historian synthesis step intact. Four test cases with ground truth from recently merged PRs: - synthesis-gate-recovery (PR #822) — distinctive term recovery - mode-headless-semantic-alignment (PR #813) — multi-piece nuance - tangential-term-recovery — indexing-gap test - near-miss-false-positive — discriminating-power test Two-stage grader: programmatic substring match per criticality tier, plus LLM-graded context preservation. Variance protocol: 3 runs per eval. This suite was built during PR #838's design exploration to validate a load-bearing assumption (that ce-sessions findings could feed ce-compound Phase 2.4's vocabulary scan). That assumption was ultimately retired in favor of doc-and-conversation-only scanning, so the suite is not load-bearing for PR #838. Kept as future infrastructure for validating ce-sessions's behavior as the skill evolves — e.g., when changing the session-historian synthesis prompt or adjusting scan-window defaults. Iteration-1 results (executed via skill-creator framework, captured to /tmp/compound-engineering/ce-sessions/evals/iteration-1/) showed ce-sessions preserved terminology strongly across all 4 evals with 100% must-tier recall and 0% stddev — but this is a capability test of the skill in isolation, not a test of any specific integration.
Adds a domain-vocabulary artifact maintained as a side effect of compounding. CONCEPTS.md is the substrate that learnings cite — entities, named processes, and status concepts with project-specific precise meaning. Lazy creation, opportunistic AGENTS.md discoverability, no user prompts. Ownership model: - ce-compound and ce-compound-refresh create and maintain the file. Both also surface CONCEPTS.md to AGENTS.md/CLAUDE.md on first creation via the existing Discoverability Check, so future agents discover the file. - ce-brainstorm and ce-plan are contributors only — they add to or refine CONCEPTS.md when terms surface, but skip writes entirely when the file doesn't exist. Avoids speculative bootstrapping from pre-implementation work. - ce-learnings-researcher reads CONCEPTS.md as grounding before keyword extraction so result distillation uses canonical terminology. ce-compound and ce-compound-refresh both bundle a concepts-vocabulary.md reference with inclusion criteria, format rules, and an illustrative example. ce-brainstorm and ce-plan intentionally do not — they learn format from the existing file's contents. Plugin AGENTS.md gains a note that the two reference copies must stay in sync.
Step 6's amend/follow-up commit logic only mentioned step 4 (docs/solutions discoverability edit). When step 4 produces no edit but step 5 (the new CONCEPTS.md discoverability path) does, the new instruction-file change would be left out of the commit sequence and end up as a dirty worktree or an omitted edit. Cover both edit paths in step 6.
External test surfaced two structural failures in ce-compound that an
LLM orchestrator can hit even when following the skill text:
1. ce-sessions return read as a terminus. Phase 1's parallel block ended
on three subagents, then ce-sessions ran synchronously as the final
input. Phase 2 said "WAIT for all Phase 1 subagents" -- which an LLM
could read as not including the skill call. The agent emitted
ce-sessions's output to the user and stopped.
Fix: add a forward-edge sentence at the end of step 4 ("ce-sessions
is the final Phase 1 input, not a workflow stop"), and broaden the
Phase 2 WAIT line to "all Phase 1 inputs" with an explicit note that
ce-sessions counts despite being a skill rather than a subagent.
2. Phase 2.4's "skip entirely if no terms qualify" let agents vibe-judge
"nothing qualifies" from the inline criteria teaser and skip reading
references/concepts-vocabulary.md entirely -- the opposite of the
stated intent.
Fix: invert the phase so "First, read the reference" is the
unconditional opener, drop the inline criteria teaser (per the
no-duplication-with-references principle), and replace the silent-
skip path with a visible "Vocabulary capture: scanned, no qualifying
terms" outcome the agent must record.
Propagated the Phase 2.4 fix to ce-compound-refresh's Phase 4.5 -- same
structural risk, same shared reference, both phases introduced on this
branch. Tightened both success-output templates from the ambiguous
"skipped (no qualifying terms)" to the unambiguous "scanned, no
qualifying terms" so the audit signal cannot be confused with "didn't
bother to check".
ce-brainstorm Phase 1.4 and ce-plan §5 gap-fill are contributors to
CONCEPTS.md but neither loads concepts-vocabulary.md, so the criteria
preventing implementation details from creeping in lived only where
the contributors couldn't see them. Add an inline negative-framing
line to both ("domain entities, named processes, and status concepts
with project-specific meaning only — not file paths, class names, or
implementation decisions"). Also drop rationale tails that did not
change agent behavior at runtime.
Users may type "create my CONCEPTS.md" without an existing learning corpus, particularly in cold repos. Previously this had no clean routing path — ce-compound's description didn't match the request, so the main agent ad-hoc'd a response. Update ce-compound's description to declare CONCEPTS.md as a stated responsibility, and add a short intercept block near the top of the skill body. The block redirects without performing a bootstrap: explains the accretion model, notes that cold-start codebase scans are intentionally unsupported (the qualifying bar is judgmental), and offers three real next steps — run ce-compound on a real learning, ce-compound-refresh on an existing corpus, or hand-edit directly.
ce-compound Phase 2.4 and ce-compound-refresh Phase 4.5 establish the glossary-only rule for CONCEPTS.md but only apply it prospectively to new entries. Existing drift (file paths, class names, function signatures, status/owner metadata) survived every run. Add active correction at two scopes matched to each skill's character. ce-compound fixes opportunistically — only entries being touched or adjacent to them — because compound is not an audit. ce-compound-refresh runs a full sweep as Phase 4.5 step 6 because refresh is an audit. Extend the refresh report's CONCEPTS.md line to surface the scrubbed count alongside added and refined.
When ce-compound or ce-compound-refresh first creates CONCEPTS.md, write a short preamble at the top explaining what the file is, how it accretes, and what it isn't (glossary only, not a spec or scratchpad). Visible prose under the # Concepts heading so both humans browsing the rendered file and agents reading the raw file see the same framing — an HTML comment would have hidden the model from human readers on GitHub for no real gain.
The "at least one qualifying term" gate in ce-compound Phase 2.4 and ce-compound-refresh Phase 4.5 step 3 could allow a permissive agent to seed CONCEPTS.md from a routine bug fix that only surfaced class or table names dressed up as entities. The criteria in concepts-vocabulary.md are correct but judgmental, and lenience at the creation moment seeds a thin file the team didn't actually need. Add an explicit "hold the qualifying bar conservatively at creation" rule to both skills. Borderline terms defer to a later run with stronger signal. The conservatism is quality, not count — the asymmetric-trap defense against minimum-count gating is preserved. Updates to an existing file continue to follow normal criteria.
After comparing against grill-with-docs (third-party skill for a similar artifact), sharpen how CONCEPTS.md is framed across the plugin and close a terminology-capture gap. In references/concepts-vocabulary.md (both copies): - Lead with "Be opinionated" as the file's stance. - Replace the enumerated "What never appears" list with the principle "The file stands on its own" — one mental test that subsumes the existing exclusions and extends to cases we hadn't enumerated. - Add aliases-per-entry format (*Avoid: X, Y*) so retired synonyms ride alongside their canonical term. - Tighten "Per entry" to one-sentence base definition; explicit second-paragraph allowance for non-obvious behavioral rules only. - Add optional Relationships section when structure is load-bearing. - Rename "Resolved ambiguities" to "Flagged ambiguities." In ce-brainstorm Phase 1.1: reframe CONCEPTS.md as the project's authoritative vocabulary (was: shared domain vocabulary that anchors terms here). Carries authority across the whole session without needing to restate "use canonical names" at every downstream phase. In ce-compound Phase 2.4: extend the vocabulary scan to include ce-sessions findings when Full mode runs. Session findings carry terminology resolution context from prior brainstorm, plan, and work dialogues; without this, that context was being pulled in for research but ignored at capture time. Also replace "scratchpad" with "catch-all" across four locations — clearer naming of the failure mode (dumping ground for things that don't fit elsewhere).
Earlier in this branch, Phase 2.4's vocabulary scan was extended to include ce-sessions findings as a third input. Architectural review surfaced two problems with that wiring: - ce-compound's payload to ce-sessions includes a "directly relevant to this specific problem; ignore unrelated work" filter rule, which actively suppresses the tangential context where vocabulary often lives. The filter is correct for fix-context retrieval but wrong for vocabulary capture — the two needs pull in opposite directions. - Wiring named external sources into Phase 2.4 creates maintenance debt: every new research input (future Slack research, Linear context, etc.) requires updating the scan input list. Revert to scanning only the new doc and the surrounding conversation. Both are always available to the orchestrating agent — no plumbing, no filter-rule mismatch. Conversation catches mid-dialogue vocabulary resolutions that didn't make the doc; the doc captures terms the writer judged worth recording. Terms that emerged only in non-conversation sources (research subagents, ce-sessions) flow into Phase 2.4 indirectly via the doc-writer's synthesis, which is the right level of curation. If external-source vocabulary mining ever becomes a real need, design it as a dedicated dispatch with a vocabulary-tuned payload, not as a Phase 2.4 scan input.
Adds an eval suite that tests whether ce-sessions findings preserve terminology resolution context — specifically, whether distinctive coined terms and their resolution rationale survive the session-historian synthesis step intact. Four test cases with ground truth from recently merged PRs: - synthesis-gate-recovery (PR #822) — distinctive term recovery - mode-headless-semantic-alignment (PR #813) — multi-piece nuance - tangential-term-recovery — indexing-gap test - near-miss-false-positive — discriminating-power test Two-stage grader: programmatic substring match per criticality tier, plus LLM-graded context preservation. Variance protocol: 3 runs per eval. This suite was built during PR #838's design exploration to validate a load-bearing assumption (that ce-sessions findings could feed ce-compound Phase 2.4's vocabulary scan). That assumption was ultimately retired in favor of doc-and-conversation-only scanning, so the suite is not load-bearing for PR #838. Kept as future infrastructure for validating ce-sessions's behavior as the skill evolves — e.g., when changing the session-historian synthesis prompt or adjusting scan-window defaults. Iteration-1 results (executed via skill-creator framework, captured to /tmp/compound-engineering/ce-sessions/evals/iteration-1/) showed ce-sessions preserved terminology strongly across all 4 evals with 100% must-tier recall and 0% stddev — but this is a capability test of the skill in isolation, not a test of any specific integration.
9afd410 to
e5b096e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5b096ecec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Accretion alone captures only the peripheral terms that surface through friction; the stable-central nouns a project is built around never break, so they never appear in a learning and never get defined. The live output (PR #896) proved it: a scoped run produced Beta skill / Confidence anchor / Autofix class with zero core domain nouns, so the captured terms dangled against undefined siblings. Reverse the no-cold-start-scan stance and seed proactively: - concepts-vocabulary.md: add accretion-vs-seeding framing and a Seed goal expressed as goal + qualifying bar (codebase sets the count, no fixed number); distinguish in-scope seeding from repo-wide bootstrap; require entries to lean only on defined siblings or general English; sharpen the no-current-config rule (state behavior, not threshold numbers). - ce-compound: stop refusing CONCEPTS.md bootstrap — redirect standalone requests to ce-compound-refresh; seed the learning's area at creation; widen the opportunistic-fix pass into a bounded coherence-neighborhood refresh. - ce-compound-refresh: standalone bootstrap now asks "create the concept map" vs "run a refresh cycle" (create does the repo-wide seed); creation seeds in-scope core nouns; every run reconciles in-scope core nouns as a safety net for stable-central terms. Both concepts-vocabulary.md copies kept byte-identical.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3980a2b117
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Seed CONCEPTS.md from the project's declared domain model across both the compound-engineering plugin and the converter/CLI (coding-tutor excluded): 17 core domain nouns in four clusters -- plugin parts, conversion, compound engineering, and review/workflow vocabulary. Follows the updated vocabulary rules: state behavior not config values (no anchor numbers or enum-string lists), no file/class names, cross-referenced siblings, synonyms folded in. Surface the new file in AGENTS.md so agents discover it.
The preamble claimed the file purely "accretes," but creation now seeds core domain vocabulary first. Reword to "Seeded with core domain vocabulary, then accretes …" so the file describes how it was actually built. Applied to both SKILL bootstrap blocks and the generated glossary.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5ad7f6a53
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
CONCEPTS.md and its AGENTS.md discoverability line move to PR #838, which introduces the seeding skill change that produces the repo-wide glossary. This branch's copy predates that fix (the lumpy 3-entry version); dropping it here so it doesn't collide with #838 at merge. This PR now scopes to the skill-design learnings refresh.
Inherited from the stacked base; the canonical CONCEPTS.md and its AGENTS.md discoverability line move to PR #838, which introduces the seeding skill change that produces it. Dropping the pre-seed-fix copy here so this filename-normalization PR doesn't reintroduce it at merge.
The Full Mode critical_requirement enumerated the orchestrator's writes as "solution doc + instruction-file edit," omitting the CONCEPTS.md create/ update that Phase 2.4 performs. That conflicting signal could lead the model to skip or under-report the vocabulary write. Reframe "primary output is ONE file" as "primary deliverable," and list CONCEPTS.md alongside the instruction-file edit as a maintenance side effect. Fix the same omission in the Common Mistakes table.
Four reconciliations from PR review of the seed redesign: - ce-compound-refresh: the standalone "Create CONCEPTS.md" path said to seed and "stop," bypassing Phase 5 — it now enters the commit flow so the new file + discoverability edit aren't left uncommitted. - ce-compound-refresh: Phase 4.5's "must surface created CONCEPTS.md" note contradicted the headless no-instruction-edit boundary; made it mode-aware (interactive edits with consent; headless reports a recommendation). - ce-compound: Phase 2.4 claimed it runs "in every mode," but Lightweight never reached it. Added an update-only vocabulary-capture step to the Lightweight flow (refines an existing CONCEPTS.md; defers seeding/bootstrap to a Full run) and reconciled the surrounding "one file" wording. - ce-brainstorm: moved vocabulary capture from Phase 1.4 (before approaches/ synthesis) to Phase 3.5 (after the requirements doc), so it captures the final resolved terminology rather than pre-approach guesses.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1e7b8b2091
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…mode The lightweight update-only vocabulary step can refine an existing CONCEPTS.md, but lightweight never runs the full Discoverability Check and its output only tipped about docs/solutions/. A lightweight run could update the glossary while leaving it undiscoverable in AGENTS.md/CLAUDE.md. Add a parallel CONCEPTS.md discoverability tip (lightweight tips, doesn't edit instruction files — a Full run owns that).
…guard The relocated vocabulary-capture heading reused "Phase 3.5", which tests/pipeline-review-contract.test.ts bans — ce-brainstorm deliberately dropped a forced "Phase 3.5" document-review step in favor of the opt-in Phase 4 menu, and the test guards against its return. Demote the section to a "#### Vocabulary Capture — after the requirements doc" subsection of Phase 3; same post-doc timing, no collision with the guard.
What this PR does
Adds
CONCEPTS.md— a repo-root glossary of the words this codebase uses in a specific way (domain entities, named processes, status concepts where two engineers might disagree if you didn't pin them down) — and wires it into the plugin's skills so it's seeded from the codebase at creation and then accretes from real work, rather than living as a separate documentation project.Why it matters
CONCEPTS.md is a compounding-knowledge play in three dimensions:
How it works
Seeding, not just accretion. At creation,
ce-compound/ce-compound-refreshseed the area's core domain nouns — sized by the codebase, held to the qualifying bar — so the file names what the project is from the start, instead of filling only with the peripheral terms that happen to surface through friction. A scoped run seeds its area; an explicit "create CONCEPTS.md" request seeds the repo-wide model. (This PR's rootCONCEPTS.mdis that output: 17 entries across plugin / conversion / compound-engineering / review clusters.)Creation is concentrated. Only
ce-compoundandce-compound-refreshcreate the file, each writing a visible preamble that teaches the artifact's role; borderline terms defer to a later run rather than seeding a weak entry.Contributors add, don't create.
ce-brainstormandce-planrefine entries when terms resolve during dialogue or planning, but skip writes entirely when the file doesn't exist.Readers ground in it.
ce-learnings-researcherreads it before keyword extraction;ce-brainstormandce-planmap user-offered synonyms to the canonical names.Self-correcting.
ce-compoundrefreshes the coherence neighborhood of any entry it touches (fix glossary violations + drift, on evidence already in hand);ce-compound-refreshruns a full scrub and reconciles in-scope core nouns every run. The refresh summary reports seeded / added / refined / reconciled / scrubbed counts.Cold-start is supported. Typing "create my CONCEPTS.md" routes from
ce-compoundtoce-compound-refresh, which asks whether to build the concept map or run a refresh cycle — no ad-hoc thin file.Quality discipline.
concepts-vocabulary.md(duplicated across the two creators, kept in sync) leads with "be opinionated" (pick the canonical term, retire synonyms as aliases) and "the file stands on its own" (each entry teaches its concept without the codebase or external context, and avoids current-config values that drift).Files changed
CONCEPTS.md(root)AGENTS.md(root)CONCEPTS.md(discoverability)ce-compound/SKILL.mdce-compound-refresh; Phase 2.4 seeds in-scope core nouns at creation + coherence-neighborhood refresh; preamble on bootstrap; CONCEPTS lines in reportsce-compound/references/concepts-vocabulary.mdce-compound-refresh/SKILL.mdce-compound-refresh/references/concepts-vocabulary.mdce-compound's (no cross-skill references per plugin AGENTS.md)agents/ce-learnings-researcher.mdCONCEPTS.mdbefore keyword extractionce-brainstorm/SKILL.mdCONCEPTS.mdas authoritative vocabulary; Phase 1.4 contributor-only capture (glossary-only boundary)ce-plan/SKILL.mdCONCEPTS.mdin the planning-context step; plans with canonical termsplugins/compound-engineering/AGENTS.mdconcepts-vocabulary.mdcopies in syncce-sessions/evals/{README.md, evals.json, grader.md}ce-sessions(auxiliary — built during design exploration, not load-bearing for this PR)Test plan
bun run release:validateclean (43 agents, 38 skills); CItest+pr-titlechecks greenfrontmatter+skill-shell-safetysuites pass (the local CLI-install suite failures are pre-existing and network/sandbox-dependent — they pass in CI)skill-creatorskill; feedback applied on over-prescription and driftCONCEPTS.mdhere is the seed output (17 coherent entries), critiqued and iteratedce-sessionsterminology-preservation evals (auxiliary): 100% must-tier recall, 0% stddev across 4 evals × 3 runsce-compoundon a real learning: confirm scoped seed + discoverability mentionce-brainstorm/ce-planground when the file exists, skip create when it doesn't