Add Li2026 (Triple-N) macaque NSD neural benchmark#2412
Open
KartikP wants to merge 14 commits into
Open
Conversation
Stimulus set + NeuronRecordingAssembly for 5-macaque Neuropixels recordings on the 1000 NSD shared images; stimulus_id aligned to Allen2022.
8 neural benchmarks with non-stratified CV and reliability-based ceiling; tests pass.
Per-image PSTHs rebinned to 10ms bins (0-300ms), 85 sessions; spantime-pls metric with reliability-based ceiling. Tests pass.
Re-ran temporal build on all 90 GoodUnit files (was 85; 4 truncated downloads skipped). V4 restored 1119->3559; region labels now match static (IT 26700, V1 2556, V2 2625, V4 3559). New assembly sha1/version. Tests assert restored counts.
Ceiling now carries attrs['error'] (bootstrap SE of the median over neuroids) so BenchmarkInstance.ceiling_error is populated; center via nanmedian. Add url to the bibtex so the benchmark reference is created. test_ceiling asserts the scalar/error/raw DB-write contract.
build_li2026_static.py and build_li2026_temporal.py reconstruct the assemblies from the raw ScienceDB release (Processed/ + GoodUnit/ + exclude_area.xls), documenting the region join, the GoodUnit<->Processed (date, unit-count)+spikepos matching, the best-window/PSTH extraction, and the Allen2022-aligned stimulus mapping. The package_* scripts upload their output. The static build reproduces Li2026_stimulus_set.csv exactly.
li2026_validation.ipynb independently reproduces the paper's dataset-statistics figures (1f/2f/2g/3d/3e) and the AlexNet-vs-MPNet encoding result from the packaged data (no brainscore), confirming the packaging and encoding methodology. README embeds the figures with a paper-vs-reproduction comparison.
Reworked li2026_validation.ipynb to regenerate every figure from
load_dataset('Li2026'/'Li2026.temporal') and load_stimulus_set('Li2026')
-- the S3 assemblies, not the raw .mat files. Reproducing the paper's
figures from the packaged data proves the round-trip is lossless and the
assembly carries sufficient metadata. Results match the raw-data run
exactly (IT counts 26700, 2f 0.38/0.02, 3e 0.89/0.26, AlexNet IT 0.33,
LVR 0.93). Dropped Fig 2g (needs trial-resolved data this package omits)
and the 1f median NaN (now nanmedian).
matplotlib boxplot renders nothing for an array containing any NaN, so V1/V4/IT (which have a few units with undefined reliability) showed no box while V2 (no NaNs) did. Filter non-finite values before plotting; figure regenerated from the packaged static assembly. All four regions now render (V1 0.70, V2 0.80, V4 0.80, IT 0.60).
…elationship What's packaged, where the coords come from, how Triple-N's 1-based tn_index relates to NSD's sharedix and to Allen2022's stimulus_id, and the v1->v2->v3 sequencing plan. Captures the cross-study comparison story (Allen2022/Hebart2023 share image identifiers) so future maintainers don't have to rederive it from the packaging scripts.
…re parity The original static assembly was the upstream response_best matrix -- mean rate in each unit's individually-best window. That choice is appropriate for Li et al.'s unit-characterization analyses (reliability, SNR, selectivity) and is cross-validated for those, but it is not the convention Brain-Score's other primate-IT static benchmarks use (MajajHong2015 et al. all apply a single fixed 70-170 ms window uniformly to every unit). Mixing the two on the same leaderboard makes Li2026 scores non-comparable with MajajHong/Sanghavi/etc and likely inflates them via per-unit window selection bias. The paper itself falls back to fixed/binned windows for its cross- population analyses (Fig 4 RSA uses 20 ms peak-aligned bins; Fig 5 encoding models use a fixed peak time lag), so this rebuild is consistent with how the paper handles cross-population questions. Pipeline: * build_li2026_static_70_170ms.py averages the temporal assembly over the ten 10-ms bins covering 70-170 ms post-onset, preserves all neuroid + presentation coords, tags time_bin_start/end on the single output bin. * package_li2026_static_70_170ms.py uploads the new file to S3. * __init__.py now points Li2026 at the new version_id/sha1; the Li2026.temporal entry is unchanged. * README documents the methodology change and updates the sequencing table (v2 is now shipped; v1 best-window kept only as reference). Reliable-IT n=26700 unchanged (unit selection still on reliability_best > 0.4). Round-trip verified end-to-end from S3.
…tatic Fixes two regressions from the fixed-window static rebuild: 1. Ceiling now matches the scored response. Recompute split-half SB reliability AT 70-170ms from raw rasters (build_li2026_reliability_70_170ms.py); the best-window reliability_best was ~0.08 higher, making the ceiling optimistic and scores systematically low. The static assembly now carries reliability_window (70-170ms) alongside reliability (best-window, kept for paper provenance); benchmark.py selects and ceils on reliability_window, matching MajajHong/Allen2022. Static benchmark bumped to version=2 so the DB recomputes the ceiling and re-scores. 2. Restored arealabel/snr/best_time/tn_index, dropped when the static was derived from the temporal assembly (which lacks them); merged back by neuroid_id. Fixes the validation notebook's Fig 2f dependency. Window-matched reliable counts: IT 21059, V1 2305, V2 2523, V4 3414 (paper best-window IT 26700 preserved in the reliability coord). New static version_id/sha1 in __init__.py; tests updated.
Re-ran the packaging-validation notebook on the v2 (fixed-window) static. Fig 2f and Fig 5 now reflect the 70-170ms response: within/cross 0.38/0.03, AlexNet IT 0.32 > MPNet IT 0.22, LVR 0.91 (visual > semantic holds). Fig 1f (reliability, paper counts) and Fig 3d/3e (temporal) unchanged. README notes the benchmark selects/ceils on reliability_window (IT 21059) while Fig 1f reproduces the paper's best-window counts (26700).
Revert the version bump -- the benchmark has never been in the DB, so version=1 is correct for the first/only BenchmarkInstance. Fix the README to drop the version=2 note and clarify that the benchmark selects on the 70-170ms reliability_window (IT ~21.1k) while ~26.7k is the paper's best-window count.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the Li2026 (Triple-N) vision benchmark: large-scale macaque Neuropixels recordings (5 monkeys, 90 sessions) to the 1,000 NSD Shared1000 images — the same images as
Allen2022— from Li, Bao et al., Nature Neuroscience 2026 (doi:10.1038/s41593-026-02322-z).What's added
brainscore_vision/data/li2026/:Li2026stimulus set (1,000 NSD images,stimulus_idaligned toAllen2022),Li2026static assembly, andLi2026.temporalassembly (10 ms bins, 0–300 ms). Reproducible raw→assembly build scripts underdata_packaging/.brainscore_vision/benchmarks/li2026/: 12 identifiersLi2026.{V1,V2,V4,IT}-plsLi2026.{V1,V2,V4,IT}-ridgeLi2026.{V1,V2,V4,IT}-temporal-plsData
Macaque IT (pooled) plus early visual V1/V2/V4, Neuropixels responses to natural NSD scenes. The static response is the mean firing rate in a fixed 70–170 ms post-onset window applied uniformly to every unit (MajajHong2015 convention), so scores are comparable to other Brain-Score primate-IT benchmarks. Units are selected at split-half reliability > 0.4 measured at that same 70–170 ms window (
reliability_window): IT 21,059 / V1 2,305 / V2 2,523 / V4 3,414. Regions are assigned from the dataset's anatomical area table; IT pooled at session level, V1/V2/V4 at recording-position level.Ceiling
The noise ceiling is the median across neuroids of the 70–170 ms split-half Spearman-Brown reliability (
reliability_window) — the reliability of the exact response being scored — with a bootstrap standard error. The paper's best-windowreliability_bestis retained as a provenance coord but is ~0.08 higher, so it is intentionally not used as the ceiling.explained_variancenormalizes r²/ceiling per the Brain-Score convention.Validation
data_packaging/notebooks/li2026_validation.ipynbregenerates the paper's figures from the packaged assemblies (load_dataset/load_stimulus_set, not the raw files), proving the S3 round-trip is lossless and the packaged metadata is sufficient:Fig 2g (trial-noise covariance) is intentionally not reproduced — it needs trial-resolved data, which this trial-averaged package does not carry.
Notes
plsandridgestatic variants are registered; happy to converge on a single headline metric per maintainer preference.stimulus_idmatchingAllen2022for cross-dataset comparison.