Skip to content

Add CSI coarse-sense layer as an add-on over frozen lexen-v1#1

Open
vassiliphilippov wants to merge 4 commits into
mainfrom
csi-coarsening-add-on
Open

Add CSI coarse-sense layer as an add-on over frozen lexen-v1#1
vassiliphilippov wants to merge 4 commits into
mainfrom
csi-coarsening-add-on

Conversation

@vassiliphilippov

Copy link
Copy Markdown
Contributor

Summary

Ships CSI (Coarse Sense Inventory; Lacerra et al., AAAI 2020; CC BY-NC-SA 4.0) as a second, public coarse-sense layer alongside the native in-record Glite layer — so coarse-sense results on lexEN-v1 are reproducible under a third-party inventory that no lexEN author controls. This directly answers reviewer concern that the coarse story depends on an author-controlled inventory.

What's added

  • coarsenings/csi/files/ — forward map (wordnet_sense_key_to_csi_concept.jsonl, 7,783 keys), aliases, and mapping_coverage.json.
  • coarsenings/csi/csi_layer.jsonl — per-item sidecar keyed by item_id; each line is a csi block in the same shape as item["glite"]. Consumers join on item_id.
  • coarsenings/csi/{details.json,description.md} + coarsenings/README.md — provenance, license, composite-label reduction noted as a modification per ShareAlike.
  • scripts/apply_csi.py — stdlib-only rebuilder; asserts structural parity against the existing Glite block before writing.
  • LICENSE + CHANGELOG.md — CC BY-NC-SA 4.0 attribution; additive-only disclosure.

Frozen-release guarantee

Additive only. data/lexen-v1/{items.jsonl,dataset.json} and sources/manifest.json are byte-identical, the release content hash is unchanged, and lexen-verify-release --release lexen-v1 still passes. No new dataset version — CSI lives outside the verified release surface in top-level coarsenings/.

Coverage

CSI maps 79% of the 9,788 candidate sense keys but 97% of gold senses; uncovered keys fall back to an unmapped: singleton, exactly like Glite.

Cross-repo consistency

The map content is byte-identical (same SHA-256) to the CSI map vendored in the SenseBench leaderboard PR.

🤖 Generated with Claude Code

Ships CSI (Coarse Sense Inventory; Lacerra et al., AAAI 2020, CC BY-NC-SA 4.0)
as a second, public coarse-sense layer alongside the native in-record Glite
layer, so coarse-sense results are reproducible under a third-party inventory
no lexEN author controls.

- coarsenings/csi/: forward map (sense_key -> CSI composite concept), aliases,
  coverage summary, details.json, description.md.
- coarsenings/csi/csi_layer.jsonl: per-item sidecar (keyed by item_id), the
  `csi` block in the same shape as item["glite"]; join on item_id.
- scripts/apply_csi.py: rebuilds the sidecar (stdlib only) and asserts it
  reproduces the existing Glite block byte-for-value first (structural parity).
- LICENSE + CHANGELOG: CC BY-NC-SA 4.0 attribution; additive-only note.

Additive only: data/lexen-v1/{items.jsonl,dataset.json} and
sources/manifest.json are unchanged (byte-identical), the release content hash
is unchanged, and lexen-verify-release --release lexen-v1 still passes. No new
dataset version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@vassiliphilippov, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 10 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c870be1-cb57-4606-b825-39398a8412ae

📥 Commits

Reviewing files that changed from the base of the PR and between b45dc79 and 15a39c1.

📒 Files selected for processing (3)
  • README.md
  • docs/selection.md
  • scripts/apply_csi.py
📝 Walkthrough

Walkthrough

This pull request adds a CSI (Coarse Sense Inventory) coarsening layer as an additive sidecar to the frozen lexen-v1 release. New artifacts under coarsenings/csi/ include details.json (dataset manifest with checksummed file references), files/csi_report_aliases.json (sense-key alias mappings), and files/mapping_coverage.json (aggregate coverage statistics plus a full enumeration of unmapped required sense keys). A new scripts/apply_csi.py script builds the coarsenings/csi/csi_layer.jsonl sidecar, first asserting byte-for-byte reproduction of each item's existing Glite block before emitting CSI output. Documentation is added via coarsenings/README.md, coarsenings/csi/description.md, an extended LICENSE section, and a CHANGELOG.md entry. No existing lexen-v1 artifacts are modified.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly explains the CSI layer addition, its purpose, coverage, and frozen-release guarantee, directly addressing the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Destructive Migrations ✅ Passed The custom check targets Django migrations (destructive schema changes, backward compatibility, locks, etc.), but this PR contains no Django code, migration files, or database schema changes—only d...
Mobile Localization String Quality Review ✅ Passed Mobile localization string check is not applicable: no mobile/ directory exists in the repository and no mobile files are modified in this PR, which only updates lexen-v1 dataset documentation and...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

CI `verify` runs ruff check, ruff format --check, and mypy src scripts tests:
- E741/SIM115/PTH123/E501: read items via `with ITEMS.open()` (rename l->line),
  write the sidecar via `SIDECAR.open("w")`, wrap the final print.
- ruff format: reflow load_mapping/csi calls and the print.
- mypy [type-arg]: annotate the JSON-record helpers (local `type JsonObject =
  dict[str, Any]`, matching src/lexen/models.py; precise dict types on label /
  cand_mappings). Style only -- generated csi_layer.jsonl unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
scripts/apply_csi.py (1)

30-41: 💤 Low value

Ensure map file existence is enforced.

The load_mapping() function silently opens map_path without existence check or error handling. If the file is missing or inaccessible, it will raise an unhandled FileNotFoundError at line 32. While this may be intentional (fail-fast), consider adding an explicit check with a clear error message, especially since the alias path is explicitly checked at line 38.

Optional: Add explicit error handling
def load_mapping(map_path: Path, alias_path: Path) -> dict[str, str]:
    if not map_path.exists():
        raise FileNotFoundError(f"Mapping file not found: {map_path}")
    mapping: dict[str, str] = {}
    with map_path.open(encoding="utf-8") as fh:
        for line in fh:
            line = line.strip()
            if line:
                row = json.loads(line)
                mapping[row["sense_key"]] = row["concept_id"]
    if alias_path.exists():
        for a in json.loads(alias_path.read_text())["aliases"]:
            mapping[a["source_sense_key"]] = a["concept_id"]
    return mapping

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 300fca9d-a321-47f9-b971-7a1f0f872d05

📥 Commits

Reviewing files that changed from the base of the PR and between ce8956e and b45dc79.

📒 Files selected for processing (10)
  • CHANGELOG.md
  • LICENSE
  • coarsenings/README.md
  • coarsenings/csi/csi_layer.jsonl
  • coarsenings/csi/description.md
  • coarsenings/csi/details.json
  • coarsenings/csi/files/csi_report_aliases.json
  • coarsenings/csi/files/mapping_coverage.json
  • coarsenings/csi/files/wordnet_sense_key_to_csi_concept.jsonl
  • scripts/apply_csi.py
📜 Review details
🧰 Additional context used
🪛 ast-grep (0.43.0)
scripts/apply_csi.py

[info] 107-107: use jsonify instead of json.dumps for JSON output
Context: json.dumps({"item_id": it["item_id"], "csi": block})
Note: Security best practice.

(use-jsonify)

🔇 Additional comments (2)
scripts/apply_csi.py (1)

85-99: Fidelity regression logic is excellent; assert message is clear.

The byte-for-byte reproduction check (lines 92–99) is a robust safeguard that proves the CSI builder uses the same structural logic as the release-time Glite coarsener. The logic is sound:

  • Rebuilds glite block from mapping
  • Compares with stored glite block
  • Fails if any mismatch, with up-to-3 sample IDs for debugging

This is exactly the kind of verification needed for a frozen-release additive layer. No concerns.

coarsenings/csi/files/mapping_coverage.json (1)

12-2067: No verification issue found; the mapping_coverage.json metadata is consistent.

The missing_required_sense_keys array contains exactly 2,050 items, which correctly matches the calculated difference: 9,833 required keys minus 7,783 mapped keys = 2,050 unmapped keys. The metadata fields are accurate and internally consistent. No discrepancy exists.

			> Likely an incorrect or invalid review comment.

Comment on lines +15 to +17
This layer is an **add-on resource over the frozen `lexen-v1` release**: it does not modify
`data/lexen-v1/items.jsonl` or the release content hash. Per-item CSI labels are shipped as the
sidecar `data/lexen-v1/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the CSI sidecar output path in description.md.

Line 17 states the sidecar is shipped at data/lexen-v1/csi_layer.jsonl, but according to the actual builder script (scripts/apply_csi.py line 22) and coarsenings/README.md line 24, the sidecar path is coarsenings/csi/csi_layer.jsonl. This path mismatch will confuse users who consult the documentation.

Update line 17 to specify the correct path: coarsenings/csi/csi_layer.jsonl.

Suggested fix
-Per-item CSI labels are shipped as the
-sidecar `data/lexen-v1/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.
+Per-item CSI labels are shipped as the
+sidecar `coarsenings/csi/csi_layer.jsonl`, keyed by `item_id`, mirroring the in-record `glite` block.

Comment thread scripts/apply_csi.py Outdated
Comment on lines +60 to +80
def record(item: dict, m: dict[str, str], mapping_id: str, schema_version: str) -> dict:
rows = cand_mappings(item["candidate_sense_keys"], m)
review = item.get("review") or {}
ann = review.get("annotators")
reviewers = None
if isinstance(ann, dict):
reviewers = {a: label([str(k) for k in ann[a]["chosen_sense_keys"]], m) for a in ANNOTATORS}
L = item["labels"]
return {
"candidate_concept_ids": sorted({r["concept_id"] for r in rows}),
"candidate_sense_mappings": rows,
"labels": {
"lexen_gold": label(L["lexen_gold"]["sense_keys"], m),
"maru2022": label(L["maru2022"]["sense_keys"], m),
"raganato_original": label(L["raganato_original"]["sense_keys"], m),
},
"mapping_id": mapping_id,
"reviewers": reviewers,
"schema_version": schema_version,
"unmapped_prefix": UNMAPPED_PREFIX,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against missing annotator keys in review block.

At line 66, the code accesses ann[a] for each annotator a in ANNOTATORS without checking whether the key exists in the ann dict. If a reviewer is missing from item["review"]["annotators"], a KeyError will be raised. While the code at line 65 checks isinstance(ann, dict), it does not validate that all expected annotators are present.

Consider either:

  1. Using ann.get(a) and skipping missing annotators, or
  2. Validating that all three annotators are present before processing.
Suggested fix: Use `.get()` with fallback
reviewers = None
if isinstance(ann, dict):
    reviewers = {}
    for a in ANNOTATORS:
        if a in ann:
            reviewers[a] = label([str(k) for k in ann[a]["chosen_sense_keys"]], m)

vassiliphilippov and others added 2 commits June 17, 2026 08:34
From a styleguide review of the changed file (output of csi_layer.jsonl is
byte-identical; verified by re-running and by the glite fidelity assert):
- keyword-only args on record()/load_mapping() — removes the silent swap hazard
  on the two adjacent str params (mapping_id, schema_version) and the two Paths.
- assert items is non-empty after load; positive assertion message.
- encoding="utf-8" on both opens + read_text (was relying on platform default).
- explicit types on module constants; rename single-letter locals (m->mapping,
  L->labels, it->item, ann->review_annotators, a->annotator/alias).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-existing flowmark line-wrapping debt (also fails on origin/main); the CSI PR
does not touch these files, but the `verify` job's `flowmark --check README.md
docs/` step blocks the PR until they are reformatted. Whitespace/wrapping only,
no content change. Drop this commit if you prefer a separate cleanup PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant