Skip to content

Add Li2026 (Triple-N) macaque NSD neural benchmark#2412

Open
KartikP wants to merge 14 commits into
masterfrom
kp/li2026-benchmark
Open

Add Li2026 (Triple-N) macaque NSD neural benchmark#2412
KartikP wants to merge 14 commits into
masterfrom
kp/li2026-benchmark

Conversation

@KartikP

@KartikP KartikP commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the Li2026 (Triple-N) vision benchmark: large-scale macaque Neuropixels recordings (5 monkeys, 90 sessions) to the 1,000 NSD Shared1000 images — the same images as Allen2022 — from Li, Bao et al., Nature Neuroscience 2026 (doi:10.1038/s41593-026-02322-z).

What's added

  • Data plugin brainscore_vision/data/li2026/: Li2026 stimulus set (1,000 NSD images, stimulus_id aligned to Allen2022), Li2026 static assembly, and Li2026.temporal assembly (10 ms bins, 0–300 ms). Reproducible raw→assembly build scripts under data_packaging/.
  • Benchmark plugin brainscore_vision/benchmarks/li2026/: 12 identifiers
    • Li2026.{V1,V2,V4,IT}-pls
    • Li2026.{V1,V2,V4,IT}-ridge
    • Li2026.{V1,V2,V4,IT}-temporal-pls

Data

Macaque IT (pooled) plus early visual V1/V2/V4, Neuropixels responses to natural NSD scenes. The static response is the mean firing rate in a fixed 70–170 ms post-onset window applied uniformly to every unit (MajajHong2015 convention), so scores are comparable to other Brain-Score primate-IT benchmarks. Units are selected at split-half reliability > 0.4 measured at that same 70–170 ms window (reliability_window): IT 21,059 / V1 2,305 / V2 2,523 / V4 3,414. Regions are assigned from the dataset's anatomical area table; IT pooled at session level, V1/V2/V4 at recording-position level.

Ceiling

The noise ceiling is the median across neuroids of the 70–170 ms split-half Spearman-Brown reliability (reliability_window) — the reliability of the exact response being scored — with a bootstrap standard error. The paper's best-window reliability_best is retained as a provenance coord but is ~0.08 higher, so it is intentionally not used as the ceiling. explained_variance normalizes r²/ceiling per the Brain-Score convention.

Validation

data_packaging/notebooks/li2026_validation.ipynb regenerates the paper's figures from the packaged assemblies (load_dataset/load_stimulus_set, not the raw files), proving the S3 round-trip is lossless and the packaged metadata is sufficient:

  • Fig 1f reliability by region: reliable-unit counts exact vs the paper (V1 2556 / V2 2625 / V4 3559 / IT 26,700 at the paper's best window)
  • Fig 2f cross-area similarity: within-category r 0.38 ≫ cross-category 0.03
  • Fig 3d response-type clusters: early areas fast-transient, IT later/sustained
  • Fig 3e population RSA over time: structured temporal evolution
  • Fig 5 AlexNet vs MPNet encoding: visual features predict IT better than semantic (AlexNet 0.32 > MPNet 0.22; LVR 0.91 < 1), matching the paper

Fig 2g (trial-noise covariance) is intentionally not reproduced — it needs trial-resolved data, which this trial-averaged package does not carry.

Notes

  • Both pls and ridge static variants are registered; happy to converge on a single headline metric per maintainer preference.
  • Stimuli reuse the NSD Shared1000 images, with stimulus_id matching Allen2022 for cross-dataset comparison.

KartikP added 14 commits June 13, 2026 13:15
Stimulus set + NeuronRecordingAssembly for 5-macaque Neuropixels recordings on the 1000 NSD shared images; stimulus_id aligned to Allen2022.
8 neural benchmarks with non-stratified CV and reliability-based ceiling; tests pass.
Per-image PSTHs rebinned to 10ms bins (0-300ms), 85 sessions; spantime-pls metric with reliability-based ceiling. Tests pass.
Re-ran temporal build on all 90 GoodUnit files (was 85; 4 truncated downloads skipped). V4 restored 1119->3559; region labels now match static (IT 26700, V1 2556, V2 2625, V4 3559). New assembly sha1/version. Tests assert restored counts.
Ceiling now carries attrs['error'] (bootstrap SE of the median over
neuroids) so BenchmarkInstance.ceiling_error is populated; center via
nanmedian. Add url to the bibtex so the benchmark reference is created.
test_ceiling asserts the scalar/error/raw DB-write contract.
build_li2026_static.py and build_li2026_temporal.py reconstruct the
assemblies from the raw ScienceDB release (Processed/ + GoodUnit/ +
exclude_area.xls), documenting the region join, the GoodUnit<->Processed
(date, unit-count)+spikepos matching, the best-window/PSTH extraction, and
the Allen2022-aligned stimulus mapping. The package_* scripts upload their
output. The static build reproduces Li2026_stimulus_set.csv exactly.
li2026_validation.ipynb independently reproduces the paper's
dataset-statistics figures (1f/2f/2g/3d/3e) and the AlexNet-vs-MPNet
encoding result from the packaged data (no brainscore), confirming the
packaging and encoding methodology. README embeds the figures with a
paper-vs-reproduction comparison.
Reworked li2026_validation.ipynb to regenerate every figure from
load_dataset('Li2026'/'Li2026.temporal') and load_stimulus_set('Li2026')
-- the S3 assemblies, not the raw .mat files. Reproducing the paper's
figures from the packaged data proves the round-trip is lossless and the
assembly carries sufficient metadata. Results match the raw-data run
exactly (IT counts 26700, 2f 0.38/0.02, 3e 0.89/0.26, AlexNet IT 0.33,
LVR 0.93). Dropped Fig 2g (needs trial-resolved data this package omits)
and the 1f median NaN (now nanmedian).
matplotlib boxplot renders nothing for an array containing any NaN, so
V1/V4/IT (which have a few units with undefined reliability) showed no
box while V2 (no NaNs) did. Filter non-finite values before plotting;
figure regenerated from the packaged static assembly. All four regions
now render (V1 0.70, V2 0.80, V4 0.80, IT 0.60).
…elationship

What's packaged, where the coords come from, how Triple-N's 1-based
tn_index relates to NSD's sharedix and to Allen2022's stimulus_id, and
the v1->v2->v3 sequencing plan. Captures the cross-study comparison
story (Allen2022/Hebart2023 share image identifiers) so future
maintainers don't have to rederive it from the packaging scripts.
…re parity

The original static assembly was the upstream response_best matrix --
mean rate in each unit's individually-best window. That choice is
appropriate for Li et al.'s unit-characterization analyses (reliability,
SNR, selectivity) and is cross-validated for those, but it is not the
convention Brain-Score's other primate-IT static benchmarks use
(MajajHong2015 et al. all apply a single fixed 70-170 ms window
uniformly to every unit). Mixing the two on the same leaderboard makes
Li2026 scores non-comparable with MajajHong/Sanghavi/etc and likely
inflates them via per-unit window selection bias.

The paper itself falls back to fixed/binned windows for its cross-
population analyses (Fig 4 RSA uses 20 ms peak-aligned bins; Fig 5
encoding models use a fixed peak time lag), so this rebuild is
consistent with how the paper handles cross-population questions.

Pipeline:
* build_li2026_static_70_170ms.py averages the temporal assembly over
  the ten 10-ms bins covering 70-170 ms post-onset, preserves all
  neuroid + presentation coords, tags time_bin_start/end on the single
  output bin.
* package_li2026_static_70_170ms.py uploads the new file to S3.
* __init__.py now points Li2026 at the new version_id/sha1; the
  Li2026.temporal entry is unchanged.
* README documents the methodology change and updates the sequencing
  table (v2 is now shipped; v1 best-window kept only as reference).

Reliable-IT n=26700 unchanged (unit selection still on reliability_best
> 0.4). Round-trip verified end-to-end from S3.
…tatic

Fixes two regressions from the fixed-window static rebuild:

1. Ceiling now matches the scored response. Recompute split-half SB
   reliability AT 70-170ms from raw rasters (build_li2026_reliability_70_170ms.py);
   the best-window reliability_best was ~0.08 higher, making the ceiling
   optimistic and scores systematically low. The static assembly now carries
   reliability_window (70-170ms) alongside reliability (best-window, kept for
   paper provenance); benchmark.py selects and ceils on reliability_window,
   matching MajajHong/Allen2022. Static benchmark bumped to version=2 so the
   DB recomputes the ceiling and re-scores.

2. Restored arealabel/snr/best_time/tn_index, dropped when the static was
   derived from the temporal assembly (which lacks them); merged back by
   neuroid_id. Fixes the validation notebook's Fig 2f dependency.

Window-matched reliable counts: IT 21059, V1 2305, V2 2523, V4 3414
(paper best-window IT 26700 preserved in the reliability coord). New static
version_id/sha1 in __init__.py; tests updated.
Re-ran the packaging-validation notebook on the v2 (fixed-window) static.
Fig 2f and Fig 5 now reflect the 70-170ms response: within/cross 0.38/0.03,
AlexNet IT 0.32 > MPNet IT 0.22, LVR 0.91 (visual > semantic holds). Fig 1f
(reliability, paper counts) and Fig 3d/3e (temporal) unchanged. README notes
the benchmark selects/ceils on reliability_window (IT 21059) while Fig 1f
reproduces the paper's best-window counts (26700).
Revert the version bump -- the benchmark has never been in the DB, so
version=1 is correct for the first/only BenchmarkInstance. Fix the README
to drop the version=2 note and clarify that the benchmark selects on the
70-170ms reliability_window (IT ~21.1k) while ~26.7k is the paper's
best-window count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant