Add CSI coarse-sense layer as an add-on over frozen lexen-v1 by vassiliphilippov · Pull Request #1 · GliteTech/lexen

vassiliphilippov · 2026-06-16T21:29:40Z

Summary

Ships CSI (Coarse Sense Inventory; Lacerra et al., AAAI 2020; CC BY-NC-SA 4.0) as a second, public coarse-sense layer alongside the native in-record Glite layer — so coarse-sense results on lexEN-v1 are reproducible under a third-party inventory that no lexEN author controls. This directly answers reviewer concern that the coarse story depends on an author-controlled inventory.

What's added

coarsenings/csi/files/ — forward map (wordnet_sense_key_to_csi_concept.jsonl, 7,783 keys), aliases, and mapping_coverage.json.
coarsenings/csi/csi_layer.jsonl — per-item sidecar keyed by item_id; each line is a csi block in the same shape as item["glite"]. Consumers join on item_id.
coarsenings/csi/{details.json,description.md} + coarsenings/README.md — provenance, license, composite-label reduction noted as a modification per ShareAlike.
scripts/apply_csi.py — stdlib-only rebuilder; asserts structural parity against the existing Glite block before writing.
LICENSE + CHANGELOG.md — CC BY-NC-SA 4.0 attribution; additive-only disclosure.

Frozen-release guarantee

Additive only. data/lexen-v1/{items.jsonl,dataset.json} and sources/manifest.json are byte-identical, the release content hash is unchanged, and lexen-verify-release --release lexen-v1 still passes. No new dataset version — CSI lives outside the verified release surface in top-level coarsenings/.

Coverage

CSI maps 79% of the 9,788 candidate sense keys but 97% of gold senses; uncovered keys fall back to an unmapped: singleton, exactly like Glite.

Cross-repo consistency

The map content is byte-identical (same SHA-256) to the CSI map vendored in the SenseBench leaderboard PR.

🤖 Generated with Claude Code

Ships CSI (Coarse Sense Inventory; Lacerra et al., AAAI 2020, CC BY-NC-SA 4.0) as a second, public coarse-sense layer alongside the native in-record Glite layer, so coarse-sense results are reproducible under a third-party inventory no lexEN author controls. - coarsenings/csi/: forward map (sense_key -> CSI composite concept), aliases, coverage summary, details.json, description.md. - coarsenings/csi/csi_layer.jsonl: per-item sidecar (keyed by item_id), the `csi` block in the same shape as item["glite"]; join on item_id. - scripts/apply_csi.py: rebuilds the sidecar (stdlib only) and asserts it reproduces the existing Glite block byte-for-value first (structural parity). - LICENSE + CHANGELOG: CC BY-NC-SA 4.0 attribution; additive-only note. Additive only: data/lexen-v1/{items.jsonl,dataset.json} and sources/manifest.json are unchanged (byte-identical), the release content hash is unchanged, and lexen-verify-release --release lexen-v1 still passes. No new dataset version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-16T21:30:13Z

Warning

Review limit reached

@vassiliphilippov, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 10 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c870be1-cb57-4606-b825-39398a8412ae

📥 Commits

Reviewing files that changed from the base of the PR and between b45dc79 and 15a39c1.

📒 Files selected for processing (3)

README.md
docs/selection.md
scripts/apply_csi.py

📝 Walkthrough

Walkthrough

This pull request adds a CSI (Coarse Sense Inventory) coarsening layer as an additive sidecar to the frozen lexen-v1 release. New artifacts under coarsenings/csi/ include details.json (dataset manifest with checksummed file references), files/csi_report_aliases.json (sense-key alias mappings), and files/mapping_coverage.json (aggregate coverage statistics plus a full enumeration of unmapped required sense keys). A new scripts/apply_csi.py script builds the coarsenings/csi/csi_layer.jsonl sidecar, first asserting byte-for-byte reproduction of each item's existing Glite block before emitting CSI output. Documentation is added via coarsenings/README.md, coarsenings/csi/description.md, an extended LICENSE section, and a CHANGELOG.md entry. No existing lexen-v1 artifacts are modified.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description clearly explains the CSI layer addition, its purpose, coverage, and frozen-release guarantee, directly addressing the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Destructive Migrations	✅ Passed	The custom check targets Django migrations (destructive schema changes, backward compatibility, locks, etc.), but this PR contains no Django code, migration files, or database schema changes—only d...
Mobile Localization String Quality Review	✅ Passed	Mobile localization string check is not applicable: no mobile/ directory exists in the repository and no mobile files are modified in this PR, which only updates lexen-v1 dataset documentation and...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CI `verify` runs ruff check, ruff format --check, and mypy src scripts tests: - E741/SIM115/PTH123/E501: read items via `with ITEMS.open()` (rename l->line), write the sidecar via `SIDECAR.open("w")`, wrap the final print. - ruff format: reflow load_mapping/csi calls and the print. - mypy [type-arg]: annotate the JSON-record helpers (local `type JsonObject = dict[str, Any]`, matching src/lexen/models.py; precise dict types on label / cand_mappings). Style only -- generated csi_layer.jsonl unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

scripts/apply_csi.py (1)
30-41: 💤 Low value

Ensure map file existence is enforced.

The load_mapping() function silently opens map_path without existence check or error handling. If the file is missing or inaccessible, it will raise an unhandled FileNotFoundError at line 32. While this may be intentional (fail-fast), consider adding an explicit check with a clear error message, especially since the alias path is explicitly checked at line 38.
Optional: Add explicit error handling
def load_mapping(map_path: Path, alias_path: Path) -> dict[str, str]:
    if not map_path.exists():
        raise FileNotFoundError(f"Mapping file not found: {map_path}")
    mapping: dict[str, str] = {}
    with map_path.open(encoding="utf-8") as fh:
        for line in fh:
            line = line.strip()
            if line:
                row = json.loads(line)
                mapping[row["sense_key"]] = row["concept_id"]
    if alias_path.exists():
        for a in json.loads(alias_path.read_text())["aliases"]:
            mapping[a["source_sense_key"]] = a["concept_id"]
    return mapping

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 300fca9d-a321-47f9-b971-7a1f0f872d05

📥 Commits

Reviewing files that changed from the base of the PR and between ce8956e and b45dc79.

📒 Files selected for processing (10)

CHANGELOG.md
LICENSE
coarsenings/README.md
coarsenings/csi/csi_layer.jsonl
coarsenings/csi/description.md
coarsenings/csi/details.json
coarsenings/csi/files/csi_report_aliases.json
coarsenings/csi/files/mapping_coverage.json
coarsenings/csi/files/wordnet_sense_key_to_csi_concept.jsonl
scripts/apply_csi.py

📜 Review details

🧰 Additional context used

🪛 ast-grep (0.43.0)

scripts/apply_csi.py

[info] 107-107: use jsonify instead of json.dumps for JSON output
Context: json.dumps({"item_id": it["item_id"], "csi": block})
Note: Security best practice.

(use-jsonify)

🔇 Additional comments (2)

scripts/apply_csi.py (1)

85-99: Fidelity regression logic is excellent; assert message is clear.

The byte-for-byte reproduction check (lines 92–99) is a robust safeguard that proves the CSI builder uses the same structural logic as the release-time Glite coarsener. The logic is sound:

Rebuilds glite block from mapping

Compares with stored glite block

Fails if any mismatch, with up-to-3 sample IDs for debugging

This is exactly the kind of verification needed for a frozen-release additive layer. No concerns.
coarsenings/csi/files/mapping_coverage.json (1)
12-2067: No verification issue found; the mapping_coverage.json metadata is consistent.

The missing_required_sense_keys array contains exactly 2,050 items, which correctly matches the calculated difference: 9,833 required keys minus 7,783 mapped keys = 2,050 unmapped keys. The metadata fields are accurate and internally consistent. No discrepancy exists.
			> Likely an incorrect or invalid review comment.

coderabbitai · 2026-06-17T07:31:05Z

+This layer is an **add-on resource over the frozen `lexen-v1` release**: it does not modify
+`data/lexen-v1/items.jsonl` or the release content hash. Per-item CSI labels are shipped as the
+sidecar `data/lexen-v1/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the CSI sidecar output path in description.md.

Line 17 states the sidecar is shipped at data/lexen-v1/csi_layer.jsonl, but according to the actual builder script (scripts/apply_csi.py line 22) and coarsenings/README.md line 24, the sidecar path is coarsenings/csi/csi_layer.jsonl. This path mismatch will confuse users who consult the documentation.

Update line 17 to specify the correct path: coarsenings/csi/csi_layer.jsonl.

Suggested fix

-Per-item CSI labels are shipped as the -sidecar `data/lexen-v1/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block. +Per-item CSI labels are shipped as the +sidecar `coarsenings/csi/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.

coderabbitai · 2026-06-17T07:31:05Z

+def record(item: dict, m: dict[str, str], mapping_id: str, schema_version: str) -> dict:
+    rows = cand_mappings(item["candidate_sense_keys"], m)
+    review = item.get("review") or {}
+    ann = review.get("annotators")
+    reviewers = None
+    if isinstance(ann, dict):
+        reviewers = {a: label([str(k) for k in ann[a]["chosen_sense_keys"]], m) for a in ANNOTATORS}
+    L = item["labels"]
+    return {
+        "candidate_concept_ids": sorted({r["concept_id"] for r in rows}),
+        "candidate_sense_mappings": rows,
+        "labels": {
+            "lexen_gold": label(L["lexen_gold"]["sense_keys"], m),
+            "maru2022": label(L["maru2022"]["sense_keys"], m),
+            "raganato_original": label(L["raganato_original"]["sense_keys"], m),
+        },
+        "mapping_id": mapping_id,
+        "reviewers": reviewers,
+        "schema_version": schema_version,
+        "unmapped_prefix": UNMAPPED_PREFIX,
+    }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against missing annotator keys in review block.

At line 66, the code accesses ann[a] for each annotator a in ANNOTATORS without checking whether the key exists in the ann dict. If a reviewer is missing from item["review"]["annotators"], a KeyError will be raised. While the code at line 65 checks isinstance(ann, dict), it does not validate that all expected annotators are present.

Consider either:

Using ann.get(a) and skipping missing annotators, or

Validating that all three annotators are present before processing.

Suggested fix: Use `.get()` with fallback

reviewers = None if isinstance(ann, dict): reviewers = {} for a in ANNOTATORS: if a in ann: reviewers[a] = label([str(k) for k in ann[a]["chosen_sense_keys"]], m)

From a styleguide review of the changed file (output of csi_layer.jsonl is byte-identical; verified by re-running and by the glite fidelity assert): - keyword-only args on record()/load_mapping() — removes the silent swap hazard on the two adjacent str params (mapping_id, schema_version) and the two Paths. - assert items is non-empty after load; positive assertion message. - encoding="utf-8" on both opens + read_text (was relying on platform default). - explicit types on module constants; rename single-letter locals (m->mapping, L->labels, it->item, ann->review_annotators, a->annotator/alias). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pre-existing flowmark line-wrapping debt (also fails on origin/main); the CSI PR does not touch these files, but the `verify` job's `flowmark --check README.md docs/` step blocks the PR until they are reformatted. Whitespace/wrapping only, no content change. Drop this commit if you prefer a separate cleanup PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vassiliphilippov force-pushed the csi-coarsening-add-on branch from b45dc79 to b54d9f7 Compare June 17, 2026 07:27

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

vassiliphilippov and others added 2 commits June 17, 2026 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CSI coarse-sense layer as an add-on over frozen lexen-v1#1

Add CSI coarse-sense layer as an add-on over frozen lexen-v1#1
vassiliphilippov wants to merge 4 commits into
mainfrom
csi-coarsening-add-on

vassiliphilippov commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review limit reached

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vassiliphilippov commented Jun 16, 2026

Summary

What's added

Frozen-release guarantee

Coverage

Cross-repo consistency

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading