Add Li2026 (Triple-N) macaque NSD neural benchmark by KartikP · Pull Request #2412 · brain-score/vision

KartikP · 2026-06-15T18:16:30Z

Summary

Adds the Li2026 (Triple-N) vision benchmark: large-scale macaque Neuropixels recordings (5 monkeys, 90 sessions) to the 1,000 NSD Shared1000 images — the same images as Allen2022 — from Li, Bao et al., Nature Neuroscience 2026 (doi:10.1038/s41593-026-02322-z).

What's added

Data plugin brainscore_vision/data/li2026/: Li2026 stimulus set (1,000 NSD images, stimulus_id aligned to Allen2022), Li2026 static assembly, and Li2026.temporal assembly (10 ms bins, 0–300 ms). Reproducible raw→assembly build scripts under data_packaging/.
Benchmark plugin brainscore_vision/benchmarks/li2026/: 12 identifiers
- Li2026.{V1,V2,V4,IT}-pls
- Li2026.{V1,V2,V4,IT}-ridge
- Li2026.{V1,V2,V4,IT}-temporal-pls

Data

Macaque IT (pooled) plus early visual V1/V2/V4, Neuropixels responses to natural NSD scenes. The static response is the mean firing rate in a fixed 70–170 ms post-onset window applied uniformly to every unit (MajajHong2015 convention), so scores are comparable to other Brain-Score primate-IT benchmarks. Units are selected at split-half reliability > 0.4 measured at that same 70–170 ms window (reliability_window): IT 21,059 / V1 2,305 / V2 2,523 / V4 3,414. Regions are assigned from the dataset's anatomical area table; IT pooled at session level, V1/V2/V4 at recording-position level.

Ceiling

The noise ceiling is the median across neuroids of the 70–170 ms split-half Spearman-Brown reliability (reliability_window) — the reliability of the exact response being scored — with a bootstrap standard error. The paper's best-window reliability_best is retained as a provenance coord but is ~0.08 higher, so it is intentionally not used as the ceiling. explained_variance normalizes r²/ceiling per the Brain-Score convention.

Validation

data_packaging/notebooks/li2026_validation.ipynb regenerates the paper's figures from the packaged assemblies (load_dataset/load_stimulus_set, not the raw files), proving the S3 round-trip is lossless and the packaged metadata is sufficient:

Fig 1f reliability by region: reliable-unit counts exact vs the paper (V1 2556 / V2 2625 / V4 3559 / IT 26,700 at the paper's best window)
Fig 2f cross-area similarity: within-category r 0.38 ≫ cross-category 0.03
Fig 3d response-type clusters: early areas fast-transient, IT later/sustained
Fig 3e population RSA over time: structured temporal evolution
Fig 5 AlexNet vs MPNet encoding: visual features predict IT better than semantic (AlexNet 0.32 > MPNet 0.22; LVR 0.91 < 1), matching the paper

Fig 2g (trial-noise covariance) is intentionally not reproduced — it needs trial-resolved data, which this trial-averaged package does not carry.

Notes

Both pls and ridge static variants are registered; happy to converge on a single headline metric per maintainer preference.
Stimuli reuse the NSD Shared1000 images, with stimulus_id matching Allen2022 for cross-dataset comparison.

Stimulus set + NeuronRecordingAssembly for 5-macaque Neuropixels recordings on the 1000 NSD shared images; stimulus_id aligned to Allen2022.

8 neural benchmarks with non-stratified CV and reliability-based ceiling; tests pass.

Per-image PSTHs rebinned to 10ms bins (0-300ms), 85 sessions; spantime-pls metric with reliability-based ceiling. Tests pass.

Re-ran temporal build on all 90 GoodUnit files (was 85; 4 truncated downloads skipped). V4 restored 1119->3559; region labels now match static (IT 26700, V1 2556, V2 2625, V4 3559). New assembly sha1/version. Tests assert restored counts.

Ceiling now carries attrs['error'] (bootstrap SE of the median over neuroids) so BenchmarkInstance.ceiling_error is populated; center via nanmedian. Add url to the bibtex so the benchmark reference is created. test_ceiling asserts the scalar/error/raw DB-write contract.

build_li2026_static.py and build_li2026_temporal.py reconstruct the assemblies from the raw ScienceDB release (Processed/ + GoodUnit/ + exclude_area.xls), documenting the region join, the GoodUnit<->Processed (date, unit-count)+spikepos matching, the best-window/PSTH extraction, and the Allen2022-aligned stimulus mapping. The package_* scripts upload their output. The static build reproduces Li2026_stimulus_set.csv exactly.

li2026_validation.ipynb independently reproduces the paper's dataset-statistics figures (1f/2f/2g/3d/3e) and the AlexNet-vs-MPNet encoding result from the packaged data (no brainscore), confirming the packaging and encoding methodology. README embeds the figures with a paper-vs-reproduction comparison.

Reworked li2026_validation.ipynb to regenerate every figure from load_dataset('Li2026'/'Li2026.temporal') and load_stimulus_set('Li2026') -- the S3 assemblies, not the raw .mat files. Reproducing the paper's figures from the packaged data proves the round-trip is lossless and the assembly carries sufficient metadata. Results match the raw-data run exactly (IT counts 26700, 2f 0.38/0.02, 3e 0.89/0.26, AlexNet IT 0.33, LVR 0.93). Dropped Fig 2g (needs trial-resolved data this package omits) and the 1f median NaN (now nanmedian).

matplotlib boxplot renders nothing for an array containing any NaN, so V1/V4/IT (which have a few units with undefined reliability) showed no box while V2 (no NaNs) did. Filter non-finite values before plotting; figure regenerated from the packaged static assembly. All four regions now render (V1 0.70, V2 0.80, V4 0.80, IT 0.60).

…elationship What's packaged, where the coords come from, how Triple-N's 1-based tn_index relates to NSD's sharedix and to Allen2022's stimulus_id, and the v1->v2->v3 sequencing plan. Captures the cross-study comparison story (Allen2022/Hebart2023 share image identifiers) so future maintainers don't have to rederive it from the packaging scripts.

…re parity The original static assembly was the upstream response_best matrix -- mean rate in each unit's individually-best window. That choice is appropriate for Li et al.'s unit-characterization analyses (reliability, SNR, selectivity) and is cross-validated for those, but it is not the convention Brain-Score's other primate-IT static benchmarks use (MajajHong2015 et al. all apply a single fixed 70-170 ms window uniformly to every unit). Mixing the two on the same leaderboard makes Li2026 scores non-comparable with MajajHong/Sanghavi/etc and likely inflates them via per-unit window selection bias. The paper itself falls back to fixed/binned windows for its cross- population analyses (Fig 4 RSA uses 20 ms peak-aligned bins; Fig 5 encoding models use a fixed peak time lag), so this rebuild is consistent with how the paper handles cross-population questions. Pipeline: * build_li2026_static_70_170ms.py averages the temporal assembly over the ten 10-ms bins covering 70-170 ms post-onset, preserves all neuroid + presentation coords, tags time_bin_start/end on the single output bin. * package_li2026_static_70_170ms.py uploads the new file to S3. * __init__.py now points Li2026 at the new version_id/sha1; the Li2026.temporal entry is unchanged. * README documents the methodology change and updates the sequencing table (v2 is now shipped; v1 best-window kept only as reference). Reliable-IT n=26700 unchanged (unit selection still on reliability_best > 0.4). Round-trip verified end-to-end from S3.

…tatic Fixes two regressions from the fixed-window static rebuild: 1. Ceiling now matches the scored response. Recompute split-half SB reliability AT 70-170ms from raw rasters (build_li2026_reliability_70_170ms.py); the best-window reliability_best was ~0.08 higher, making the ceiling optimistic and scores systematically low. The static assembly now carries reliability_window (70-170ms) alongside reliability (best-window, kept for paper provenance); benchmark.py selects and ceils on reliability_window, matching MajajHong/Allen2022. Static benchmark bumped to version=2 so the DB recomputes the ceiling and re-scores. 2. Restored arealabel/snr/best_time/tn_index, dropped when the static was derived from the temporal assembly (which lacks them); merged back by neuroid_id. Fixes the validation notebook's Fig 2f dependency. Window-matched reliable counts: IT 21059, V1 2305, V2 2523, V4 3414 (paper best-window IT 26700 preserved in the reliability coord). New static version_id/sha1 in __init__.py; tests updated.

Re-ran the packaging-validation notebook on the v2 (fixed-window) static. Fig 2f and Fig 5 now reflect the 70-170ms response: within/cross 0.38/0.03, AlexNet IT 0.32 > MPNet IT 0.22, LVR 0.91 (visual > semantic holds). Fig 1f (reliability, paper counts) and Fig 3d/3e (temporal) unchanged. README notes the benchmark selects/ceils on reliability_window (IT 21059) while Fig 1f reproduces the paper's best-window counts (26700).

Revert the version bump -- the benchmark has never been in the DB, so version=1 is correct for the first/only BenchmarkInstance. Fix the README to drop the version=2 note and clarify that the benchmark selects on the 70-170ms reliability_window (IT ~21.1k) while ~26.7k is the paper's best-window count.

KartikP added 14 commits June 13, 2026 13:15

Add Li2026 (Triple-N) macaque NSD data plugin

f98bfd9

Stimulus set + NeuronRecordingAssembly for 5-macaque Neuropixels recordings on the 1000 NSD shared images; stimulus_id aligned to Allen2022.

Add Li2026 benchmark plugin (V1/V2/V4/IT, pls + ridge)

a67537e

8 neural benchmarks with non-stratified CV and reliability-based ceiling; tests pass.

Add Li2026 temporal benchmark (V1/V2/V4/IT, spantime-pls)

aeefa29

Per-image PSTHs rebinned to 10ms bins (0-300ms), 85 sessions; spantime-pls metric with reliability-based ceiling. Tests pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Li2026 (Triple-N) macaque NSD neural benchmark#2412

Add Li2026 (Triple-N) macaque NSD neural benchmark#2412
KartikP wants to merge 14 commits into
masterfrom
kp/li2026-benchmark

KartikP commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KartikP commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Data

Ceiling

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KartikP commented Jun 15, 2026 •

edited

Loading