Add CSI coarse-sense layer as an add-on over frozen lexen-v1#1
Add CSI coarse-sense layer as an add-on over frozen lexen-v1#1vassiliphilippov wants to merge 4 commits into
Conversation
Ships CSI (Coarse Sense Inventory; Lacerra et al., AAAI 2020, CC BY-NC-SA 4.0)
as a second, public coarse-sense layer alongside the native in-record Glite
layer, so coarse-sense results are reproducible under a third-party inventory
no lexEN author controls.
- coarsenings/csi/: forward map (sense_key -> CSI composite concept), aliases,
coverage summary, details.json, description.md.
- coarsenings/csi/csi_layer.jsonl: per-item sidecar (keyed by item_id), the
`csi` block in the same shape as item["glite"]; join on item_id.
- scripts/apply_csi.py: rebuilds the sidecar (stdlib only) and asserts it
reproduces the existing Glite block byte-for-value first (structural parity).
- LICENSE + CHANGELOG: CC BY-NC-SA 4.0 attribution; additive-only note.
Additive only: data/lexen-v1/{items.jsonl,dataset.json} and
sources/manifest.json are unchanged (byte-identical), the release content hash
is unchanged, and lexen-verify-release --release lexen-v1 still passes. No new
dataset version.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 10 minutes and 54 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis pull request adds a CSI (Coarse Sense Inventory) coarsening layer as an additive sidecar to the frozen Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
CI `verify` runs ruff check, ruff format --check, and mypy src scripts tests:
- E741/SIM115/PTH123/E501: read items via `with ITEMS.open()` (rename l->line),
write the sidecar via `SIDECAR.open("w")`, wrap the final print.
- ruff format: reflow load_mapping/csi calls and the print.
- mypy [type-arg]: annotate the JSON-record helpers (local `type JsonObject =
dict[str, Any]`, matching src/lexen/models.py; precise dict types on label /
cand_mappings). Style only -- generated csi_layer.jsonl unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b45dc79 to
b54d9f7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/apply_csi.py (1)
30-41: 💤 Low valueEnsure map file existence is enforced.
The
load_mapping()function silently opensmap_pathwithout existence check or error handling. If the file is missing or inaccessible, it will raise an unhandledFileNotFoundErrorat line 32. While this may be intentional (fail-fast), consider adding an explicit check with a clear error message, especially since the alias path is explicitly checked at line 38.Optional: Add explicit error handling
def load_mapping(map_path: Path, alias_path: Path) -> dict[str, str]: if not map_path.exists(): raise FileNotFoundError(f"Mapping file not found: {map_path}") mapping: dict[str, str] = {} with map_path.open(encoding="utf-8") as fh: for line in fh: line = line.strip() if line: row = json.loads(line) mapping[row["sense_key"]] = row["concept_id"] if alias_path.exists(): for a in json.loads(alias_path.read_text())["aliases"]: mapping[a["source_sense_key"]] = a["concept_id"] return mapping
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 300fca9d-a321-47f9-b971-7a1f0f872d05
📒 Files selected for processing (10)
CHANGELOG.mdLICENSEcoarsenings/README.mdcoarsenings/csi/csi_layer.jsonlcoarsenings/csi/description.mdcoarsenings/csi/details.jsoncoarsenings/csi/files/csi_report_aliases.jsoncoarsenings/csi/files/mapping_coverage.jsoncoarsenings/csi/files/wordnet_sense_key_to_csi_concept.jsonlscripts/apply_csi.py
📜 Review details
🧰 Additional context used
🪛 ast-grep (0.43.0)
scripts/apply_csi.py
[info] 107-107: use jsonify instead of json.dumps for JSON output
Context: json.dumps({"item_id": it["item_id"], "csi": block})
Note: Security best practice.
(use-jsonify)
🔇 Additional comments (2)
scripts/apply_csi.py (1)
85-99: Fidelity regression logic is excellent; assert message is clear.The byte-for-byte reproduction check (lines 92–99) is a robust safeguard that proves the CSI builder uses the same structural logic as the release-time Glite coarsener. The logic is sound:
- Rebuilds glite block from mapping
- Compares with stored glite block
- Fails if any mismatch, with up-to-3 sample IDs for debugging
This is exactly the kind of verification needed for a frozen-release additive layer. No concerns.
coarsenings/csi/files/mapping_coverage.json (1)
12-2067: No verification issue found; the mapping_coverage.json metadata is consistent.The
missing_required_sense_keysarray contains exactly 2,050 items, which correctly matches the calculated difference: 9,833 required keys minus 7,783 mapped keys = 2,050 unmapped keys. The metadata fields are accurate and internally consistent. No discrepancy exists.> Likely an incorrect or invalid review comment.
| This layer is an **add-on resource over the frozen `lexen-v1` release**: it does not modify | ||
| `data/lexen-v1/items.jsonl` or the release content hash. Per-item CSI labels are shipped as the | ||
| sidecar `data/lexen-v1/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block. |
There was a problem hiding this comment.
Fix the CSI sidecar output path in description.md.
Line 17 states the sidecar is shipped at data/lexen-v1/csi_layer.jsonl, but according to the actual builder script (scripts/apply_csi.py line 22) and coarsenings/README.md line 24, the sidecar path is coarsenings/csi/csi_layer.jsonl. This path mismatch will confuse users who consult the documentation.
Update line 17 to specify the correct path: coarsenings/csi/csi_layer.jsonl.
Suggested fix
-Per-item CSI labels are shipped as the
-sidecar `data/lexen-v1/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.
+Per-item CSI labels are shipped as the
+sidecar `coarsenings/csi/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.| def record(item: dict, m: dict[str, str], mapping_id: str, schema_version: str) -> dict: | ||
| rows = cand_mappings(item["candidate_sense_keys"], m) | ||
| review = item.get("review") or {} | ||
| ann = review.get("annotators") | ||
| reviewers = None | ||
| if isinstance(ann, dict): | ||
| reviewers = {a: label([str(k) for k in ann[a]["chosen_sense_keys"]], m) for a in ANNOTATORS} | ||
| L = item["labels"] | ||
| return { | ||
| "candidate_concept_ids": sorted({r["concept_id"] for r in rows}), | ||
| "candidate_sense_mappings": rows, | ||
| "labels": { | ||
| "lexen_gold": label(L["lexen_gold"]["sense_keys"], m), | ||
| "maru2022": label(L["maru2022"]["sense_keys"], m), | ||
| "raganato_original": label(L["raganato_original"]["sense_keys"], m), | ||
| }, | ||
| "mapping_id": mapping_id, | ||
| "reviewers": reviewers, | ||
| "schema_version": schema_version, | ||
| "unmapped_prefix": UNMAPPED_PREFIX, | ||
| } |
There was a problem hiding this comment.
Guard against missing annotator keys in review block.
At line 66, the code accesses ann[a] for each annotator a in ANNOTATORS without checking whether the key exists in the ann dict. If a reviewer is missing from item["review"]["annotators"], a KeyError will be raised. While the code at line 65 checks isinstance(ann, dict), it does not validate that all expected annotators are present.
Consider either:
- Using
ann.get(a)and skipping missing annotators, or - Validating that all three annotators are present before processing.
Suggested fix: Use `.get()` with fallback
reviewers = None
if isinstance(ann, dict):
reviewers = {}
for a in ANNOTATORS:
if a in ann:
reviewers[a] = label([str(k) for k in ann[a]["chosen_sense_keys"]], m)From a styleguide review of the changed file (output of csi_layer.jsonl is byte-identical; verified by re-running and by the glite fidelity assert): - keyword-only args on record()/load_mapping() — removes the silent swap hazard on the two adjacent str params (mapping_id, schema_version) and the two Paths. - assert items is non-empty after load; positive assertion message. - encoding="utf-8" on both opens + read_text (was relying on platform default). - explicit types on module constants; rename single-letter locals (m->mapping, L->labels, it->item, ann->review_annotators, a->annotator/alias). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-existing flowmark line-wrapping debt (also fails on origin/main); the CSI PR does not touch these files, but the `verify` job's `flowmark --check README.md docs/` step blocks the PR until they are reformatted. Whitespace/wrapping only, no content change. Drop this commit if you prefer a separate cleanup PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Ships CSI (Coarse Sense Inventory; Lacerra et al., AAAI 2020; CC BY-NC-SA 4.0) as a second, public coarse-sense layer alongside the native in-record Glite layer — so coarse-sense results on lexEN-v1 are reproducible under a third-party inventory that no lexEN author controls. This directly answers reviewer concern that the coarse story depends on an author-controlled inventory.
What's added
coarsenings/csi/files/— forward map (wordnet_sense_key_to_csi_concept.jsonl, 7,783 keys), aliases, andmapping_coverage.json.coarsenings/csi/csi_layer.jsonl— per-item sidecar keyed byitem_id; each line is acsiblock in the same shape asitem["glite"]. Consumers join onitem_id.coarsenings/csi/{details.json,description.md}+coarsenings/README.md— provenance, license, composite-label reduction noted as a modification per ShareAlike.scripts/apply_csi.py— stdlib-only rebuilder; asserts structural parity against the existing Glite block before writing.LICENSE+CHANGELOG.md— CC BY-NC-SA 4.0 attribution; additive-only disclosure.Frozen-release guarantee
Additive only.
data/lexen-v1/{items.jsonl,dataset.json}andsources/manifest.jsonare byte-identical, the release content hash is unchanged, andlexen-verify-release --release lexen-v1still passes. No new dataset version — CSI lives outside the verified release surface in top-levelcoarsenings/.Coverage
CSI maps 79% of the 9,788 candidate sense keys but 97% of gold senses; uncovered keys fall back to an
unmapped:singleton, exactly like Glite.Cross-repo consistency
The map content is byte-identical (same SHA-256) to the CSI map vendored in the SenseBench leaderboard PR.
🤖 Generated with Claude Code