Add contributor + per-role operators to advanced search by bendichter · Pull Request #2822 · dandi/dandi-archive

bendichter · 2026-05-09T15:38:57Z

Stacked on top of #2821. Adds operators for filtering dandisets by their version-metadata contributors:

contributor: — catch-all (any role); matches by name/email/identifier
9 role-specific operators: author, contact_person, data_collector, data_curator, data_manager, maintainer, project_leader, funder, sponsor. Each matches a contributor by name/email/identifier AND requires that contributor to hold the corresponding dandischema.RoleType role.
affiliation: — queries the nested Person.affiliation[] field by organization name or ROR identifier. Structurally distinct from the role operators (see below).

Behavior

Query	Effect
`contributor:"Doe, Jane"`	dandisets where any contributor element matches "Doe, Jane" via name/email/identifier
`author:Doe`	dandisets where some contributor named Doe holds the Author role
`data_curator:Doe`	dandisets where some contributor named Doe holds the DataCurator role
`funder:NIH`	dandisets where some contributor named NIH holds the Funder role
`author:Doe funder:NIH`	dandisets where (some Doe is an Author) AND (some NIH is a Funder) — possibly different contributor entries
`contributor:0000-0002-2990-9889`	matches by ORCID
`contributor:01cwqze88`	matches the ROR URL `https://ror.org/01cwqze88` via substring
`data_curator:0000-0002-2990-9889`	composes role + ORCID — that specific person must hold the DataCurator role
`affiliation:Stanford`	dandisets with any contributor affiliated with Stanford University
`affiliation:00f54p054`	ROR ID substring (matches Stanford's full `https://ror.org/00f54p054` URL)
`author:Doe affiliation:Stanford`	composes — Doe is an Author AND someone is Stanford-affiliated, on the same version
`data_curatr:Doe`	400 with `Did you mean "data_curator"?`

Each role/contributor operator's lookup ORs across name, email, AND identifier (case-insensitive substring), so ORCIDs for Persons and ROR URLs for Organizations both work. Substring matching means bare-ID forms like 01cwqze88 work without typing the full URL prefix. Operator keys themselves are also case-insensitive (AUTHOR:Doe is the same as author:Doe).

The role list is deliberately concise — most RoleType values aren't operators users reach for. The catch-all contributor: still finds anyone in any role; the role-restricting shortcuts cover the common search intents (authors, curators, funders, etc.).

Affiliation is special

Affiliation exists in dandischema.RoleType, but in practice no real DANDI contributor uses dcite:Affiliation as a roleName. Affiliations live in a separate nested field — Person.affiliation[] — populated for ~all real Person contributors (dandiset 000409 alone has 40+ contributors each affiliated with their respective universities).

So affiliation: is not a per-role operator. It uses a dedicated jsonpath ($.contributor[*].affiliation[*]) that ORs across the affiliation's name and identifier. Implementation-wise it shares the same batch-AND-on-same-Version dispatch as the role operators, so it composes with them as expected.

Why per-role operators (Option D) over `contributor: + role:`

Considered combining a generic contributor: and a separate role: qualifier with same-element semantics, but per-role operators won on:

Composability — author:Doe funder:NIH cleanly means two independent constraints (no ambiguity about which role: applies to which contributor: when both repeat).
Consistency — mirrors Gmail's from:/to:/cc: and falls out of the same AND-across-keys semantics our other operators already have.
Precedent — independent role operators is a normal shape; arbitrary "qualifier composition" semantics aren't.

Implementation cost is the same either way (one dict + one jsonpath builder).

Implementation notes

A CONTRIBUTOR_ROLE_OPS dict in dandiapi/api/services/search/operators.py drives the parser allowlist + the role-based dispatch. Adding a future role is one new entry. Kept explicit (rather than auto-derived from dandischema.RoleType) so a schema-level rename or addition can't silently change our public search syntax — see discussion thread. A unit test (test_contributor_role_ops_match_actual_dandischema_roletype) pins each value against an actual RoleType.name so drift is caught explicitly.
The contributor + affiliation jsonpath builders use SQL-time concatenation (to_jsonb(%s::text)::text) to inline the bound regex into the jsonpath text, because Postgres' jsonpath like_regex requires its pattern as a string literal (not a $variable). Same shape as the asset operators.
_apply_contributor_filters() accumulates (where, params) pairs and chains them as Version.objects.extra(...) calls, so role-based and affiliation predicates share the same batch logic.
All contributor predicates in one query AND on the same Version.metadata so a draft + published version with disjoint contributor lists never combine into a spurious match. Cross-operator AND on the same dandiset, but each operator may match a different contributor element within that version's array.

Files

dandiapi/api/services/search/operators.py — new pure-Python module with all operator vocabulary + dispatch tables; OPERATOR_KEYS is now derived as the union (no more duplication between parser and filter)
dandiapi/api/services/search/parser.py — imports OPERATOR_KEYS from operators.py; key matching is now case-insensitive
dandiapi/api/services/search/filters.py — _contributor_where(), _affiliation_where(), _apply_contributor_filters()
dandiapi/api/views/serializers.py — OpenAPI help text appended
web/src/components/DandisetSearchField.vue — popover entries for the catch-all, 6 common roles, and affiliation:
dandiapi/api/tests/test_dandiset.py — one consolidated test for catch-all/role/case-insensitivity/identifier/role-substring/composition (per @yarikoptic's "denser tests" preference); separate tests for affiliation (org name + ROR id + Person/Organization mix + composition) and the did-you-mean-on-typo path. Anonymous Doe placeholders.
dandiapi/api/tests/test_search_parser.py — drift-guard test against dandischema.RoleType.

Test plan

tox -e lint — clean
tox -e test -- dandiapi/api/tests/test_dandiset.py dandiapi/api/tests/test_search_parser.py -k "search_parser or advanced_search or contributor or affiliation" — 53 pass
Manual: curl 'http://localhost:8000/api/dandisets/?search=author:Doe', affiliation:Stanford, data_curator:0000-0002-2990-9889, etc.

Once #2821 merges, this PR's base will collapse to master automatically.

yarikoptic

my initial HI review (edit: actually it is "requesting changes")

Stacked on top of #2821 (which is itself stacked on #2814). Until those merge, ...

need to refresh you AI memory map of things -- #2814 is merged. For a number of "situations", I myself started to read/tune more of the AI output I give to folks for consumption and also "fold" some parts of the descriptions into <details>, like e.g. see freshish datalad/datalad#7852 ... on that end, started

dandi/dandi-infrastructure#278

yarikoptic · 2026-05-09T17:06:23Z

+    'visualization': 'Visualization',
+    'funder': 'Funder',
+    'sponsor': 'Sponsor',
+    'study_participant': 'StudyParticipant',


I itch with duplication allergy so badly: are those all duplicates of what we have in the dandi-schema mapped from snail_case to PythonCase?

❯ git grep -p -e ContactPerson -e Conceptualization -- dandischema/models.py dandischema/models.py=class RoleType(Enum): dandischema/models.py: #: Conceptualization dandischema/models.py: Conceptualization = "dcite:Conceptualization" dandischema/models.py: ContactPerson = "dcite:ContactPerson" dandischema/models.py=class Contributor(DandiBaseModel): dandischema/models.py: if role_names is not None and RoleType.ContactPerson in role_names: dandischema/models.py=class Dandiset(CommonModel): dandischema/models.py: if val.roleName and RoleType.ContactPerson in val.roleName: dandischema/models.py: raise ValueError("At least one contributor must have role ContactPerson")

I feel like should be programmatically mapped from that dandischema.models.RoleType!

Two duplications conflated here, taking them separately:

Within our code (parser allowlist + filter dispatch tables listing the same operator names): real, will fix. Plan is to move the dispatch tables into a small pure-Python module both parser.py and filters.py import from, then construct OPERATOR_KEYS as the union of those tables. Adding a new operator becomes one entry in the dispatch table, automatically known to the parser.

Between our code and dandischema.RoleType: I considered deriving _CONTRIBUTOR_ROLE_OPS programmatically from RoleType with an exclusion list, but I think the explicit allowlist is the right call here:

The dispatch dict is a translation (snake_case operator name → PascalCase role name). Even with auto-derivation, we'd still have an exclusion list and a snake-case transformer — it's the same lines moved into a for loop, not fewer.

API stability: a future PR that adds a new RoleType to dandi-schema shouldn't silently expand our public search syntax. With derivation, the new operator just appears with no review of the operator name, UX, or whether anyone wants it.

Renames are a footgun: if dandi-schema ever renames DataCurator → DataCuratorPerson, the auto-derivation would silently change data_curator: to data_curator_person: and break every saved/shared user query. The explicit allowlist catches that as a diff to review here.

What I'll add as a guard: a unit test that every value in _CONTRIBUTOR_ROLE_OPS is an actual RoleType.name — one assertion, catches typos against the schema without auto-tracking it.

yarikoptic · 2026-05-09T17:06:59Z

+    # Note: `affiliation` is intentionally NOT here. Despite `dcite:Affiliation`
+    # existing as a RoleType, in real DANDI data affiliations live in a
+    # separate nested field — `Person.affiliation[]` — not as a contributor's
+    # role. The `affiliation:` operator queries that nested path; see


can exclude explicitly selected few if needed

👍 see consolidated reply on the comment above — the exclusion list will live next to the dispatch dict (and a unit test will pin every entry against RoleType.name so drift is caught explicitly).

yarikoptic · 2026-05-09T17:20:32Z

+  { example: 'data_curator:Doe', description: 'Listed as a Data Curator' },
+  { example: 'funder:NIH', description: 'Listed as a Funder' },
+  { example: 'contact_person:Doe', description: 'Listed as the Contact Person' },
+  { example: 'maintainer:Doe', description: 'Listed as a Maintainer (and many more — see API docs)' },


"many more" is unclear here, of what more -- operators? or what it matches, or ...?

I feel that here we are getting into the land of "we need a UI" to support this variability. Here we do not even list all possible values I think from that earlier "duplicated" list, so how users would know? or is that the final list supported?

So I feel like we do need to figure out API to query for "available operators" and then "available values for the operator given current search query". Quick and dirty could be -- we could do smth like

having "?" as an "operator" to trigger error listing all known operators, and we just present to user (with some minimal formatting tune up). ideally should return structured record of e.g. {name:str : { description:str, available values: list[str] }} or alike

having "operator:?" returning similar record but now just for that operator.

In both cases -- would be with constraint given the rest of the query, e.g. "neuropixels species:?" would give error listing error like available 'species': "mus musculus", .... Then at least it becomes usable: referral to API is really not for user facing.

Then on top of that (in different PR) can bolt on nice frontend UI with dropbox for selection of values or operators to use for (which would inject "?" into query and run it, to render then result UI)

Agreed the popover isn't the right surface for this many operators. Two-step plan:

This PR: replace the unhelpful "many more — see API docs" hand-wave with a real link to a docs page. The docs themselves will land in dandi/dandi-docs (separate repo); I'll add the page there and link from the popover here once it's published.

Future PR (your "?" idea): a discovery endpoint along the lines you sketched — ?search=? lists known operators; ?search=species:? lists candidate values, constrained by the rest of the query. That's a real improvement, but it's enough scope to deserve its own PR rather than getting bundled here.

On the user-doesn't-need-all-of-them point: agreed. 90%+ of contributor queries will be contributor: or author:. The 25 role-specific operators exist for the case where someone DOES want to find e.g. "dandisets with NIH as a Funder," and the docs page is the right place to enumerate them rather than dumping them in the popover.

yarikoptic · 2026-05-09T17:26:13Z

+        'funder',
+        'sponsor',
+        'study_participant',
+        'affiliation',


oy -- I think I saw those somewhere... duplication again? could be avoided?

Same root cause as the role-list duplication thread above. Plan: move the dispatch tables into a small pure-Python module both parser.py and filters.py import from, and construct OPERATOR_KEYS as the union of those tables — one source of truth in our codebase. Adding a new operator becomes one entry; the parser automatically knows about it.

The unquoted owner:me → current-user shortcut required threading a `quoted` flag through the parser and a `request_user` arg through the filter dispatch — non-trivial machinery to support one alias. Per dandi#2822 review discussion, removing it from this PR keeps the owner operator focused on literal lookup-by-value (username / email / first / last / "first last") and avoids the design debate about the right escape mechanism for "I literally want a user named Me." The alias can come back in a focused follow-up PR if/when there's appetite for it. Concrete drops: - owner:me magic + 400-on-anonymous in `_apply_owner_filter` - `Operator.quoted` field on the parser dataclass - `quoted` and `request_user` parameters on `_apply_owner_filter` - `get_owned_dandisets` import (no longer used here) - `test_advanced_search_owner_me_magic_and_literal_escape` test - The two `owner-me-quoted` / `owner-me-unquoted` parser test cases - "owner:me" mentions in OpenAPI help text and the popover entry

bendichter

Refreshed the description: dropped the stale "#2814 stacked" framing (since it merged) and folded the implementation-notes / files / test-plan sections into <details> per your suggestion. Inline replies on each thread are posted.

Two structural improvements + one product trim, in response to the review on dandi#2822: 1. New `dandiapi/api/services/search/operators.py` (pure Python, no Django) holds every operator-vocabulary constant: DATE_OPS, ASSET_OPS, OWNER_OPS, AFFILIATION_OPS, CONTRIBUTOR_ROLE_OPS, FILE_TYPE_ALIASES, ASSET_NAME_PATH_OPS, AFFILIATION_JSONPATH. OPERATOR_KEYS is now the union of those tables — single source of truth, no more duplication between parser.py (allowlist) and filters.py (dispatch). Adding a new operator is one entry; the parser automatically knows about it. 2. Trim the role-restricting shortcuts from 25 to 9. After review discussion: most RoleType values aren't operators users actually reach for (`conceptualization:`, `methodology:`, `validation:`, `visualization:`, etc.). Kept the ones that map to common search intents: contributor (catch-all), author, contact_person, data_collector, data_curator, data_manager, maintainer, project_lead, funder, sponsor The catch-all `contributor:` still matches anyone in any role; only the role-restricting shortcuts are pruned. `project_lead:` is intentionally shorter than the schema name `ProjectLeader`. 3. Shrank the verbose docstrings on private filter helpers (the rationale stays in commit messages, not as documentation rot on internal API). 4. Added test_contributor_role_ops_match_actual_dandischema_roletype as a drift guard: every non-catch-all CONTRIBUTOR_ROLE_OPS value must be a real RoleType.name. Renames or removals on the schema side trip the test, forcing an explicit decision instead of silently changing public search syntax. OpenAPI help text and the search popover updated to reflect the trimmed list (`project_lead`, `data_collector`, `data_manager`, `sponsor` now shown; the misleading "many more" tail removed).

Per dandi#2822 review discussion: the old semantics required all asset operators to be satisfied by a SINGLE asset, which meant `species:mouse species:rat` only matched dandisets with a multi-species recording (rare). The natural user reading is "the dandiset has mouse data AND has rat data" — those can be on different assets, and that's the common case for comparative-species dandisets. Implementation: each asset operator now builds an independent AssetSearch subquery and the dandiset queryset is filtered with `id__in=...` per operator. Django generates one subquery per operator and AND's them at the dandiset level. Cross-key likewise: `species:mouse approach:electrophysiological` now matches any dandiset that has SOME mouse asset AND SOME ephys asset, not just dandisets with a mouse-ephys asset. Tests updated: - `test_advanced_search_repeated_same_key_operator_combines_with_and` is now `..._combines_at_dandiset_level`, with a new fixture that has two separate assets (one mouse, one rat) to actually exercise the cross-asset case the old semantic excluded. - `test_advanced_search_repeated_asset_operators_intersect` is now `test_advanced_search_asset_operators_combine_at_dandiset_level`, with a similar two-assets-split fixture that demonstrates the new inclusive behavior. Contributor / affiliation semantics unchanged — those still AND on the same Version's metadata (since contributors live per-version, not per-asset). Within that single version, predicates can match different contributor[] entries.

bendichter · 2026-05-11T16:30:46Z

here is the doc page on this that would be added into dandi-docs

advanced-search.md

pasted (?) content from above: click to expand

Advanced Search

The dandiset list's search box accepts a Gmail/GitHub-style syntax that lets you mix
free-text terms with structured key:value operators. Filter by creation date,
species, file type, contributor, role, owner, and more — all from the same input.

Quick examples

neuropixels species:mouse created_after:2023-01-01
author:"Doe, Jane" funder:NIH
data_curator:"Smith, Alice" published_after:2024-01-01
contributor:0000-0002-2990-9889 standard:nwb
affiliation:Stanford

Operators combine with AND. Quoted phrases ("like this") are treated as a single
value. Anything you type without a key: prefix is full-text matched against the
dandiset metadata, the same way the original search box worked.

How operators combine

Operators describe the dandiset, not individual assets. Each operator is
an independent constraint at the dandiset level. species:mouse species:rat
returns dandisets that have at least one mouse asset AND at least one rat
asset — they can be the same asset (multi-species recording) or two
different assets (a comparative-species dandiset).
Free text + operators: ANDed together. place cells species:mouse
returns dandisets whose metadata contains "place" AND "cells" AND has at
least one mouse asset.
Multiple different operators: ANDed at the dandiset level. author:Doe funder:NIH returns dandisets where someone named Doe is an Author and
someone named NIH is a Funder. They can be different contributor entries.
species:mouse approach:electrophysiological returns dandisets that have
some mouse data AND some electrophysiology data — possibly on different
assets, possibly on the same one.
Quoting: wrap multi-word values in double quotes, e.g.
technique:"spike sorting". A whole token wrapped in quotes opts out of
operator parsing — "author:Doe" searches for the literal text author:Doe
rather than running the operator.

Operator reference

Dates

All take an ISO date in the form YYYY-MM-DD. Bounds are exclusive on
_before and inclusive on _after.

Operator	What it filters
`created_before:YYYY-MM-DD`	Dandiset's `created` timestamp before the date
`created_after:YYYY-MM-DD`	Dandiset's `created` timestamp on/after the date
`modified_before:YYYY-MM-DD`	Most recent version's `modified` timestamp before the date
`modified_after:YYYY-MM-DD`	Most recent version's `modified` timestamp on/after the date
`published_before:YYYY-MM-DD`	Most recent published version's `created` timestamp before the date (draft-only dandisets are excluded)
`published_after:YYYY-MM-DD`	Most recent published version's `created` timestamp on/after the date

created_after:2024-01-01                    # everything created since 2024
modified_after:2025-01-01 modified_before:2026-01-01   # changed during 2025
published_after:2023-01-01                  # published since 2023

Asset content

Substring matches (case-insensitive) against the dandiset's asset metadata.
A dandiset matches if at least one of its assets satisfies the predicate.
Multiple asset operators are AND'd at the dandiset level — each must be
satisfied by some asset, but not necessarily the same one. See
How operators combine above.

Operator	What it matches
`species:VALUE`	Substring against any `wasAttributedTo[].species.name`
`approach:VALUE`	Substring against any `approach[].name`
`technique:VALUE`	Substring against any `measurementTechnique[].name`
`standard:VALUE`	Substring against any `dataStandard[].name`
`file_type:VALUE`	`encodingFormat` startswith. Accepts the aliases `nwb`, `image`, `text`, `video`, or any MIME prefix (`application/x-nwb`, `image/`, ...)

species:mouse                          # House mouse, Mus musculus, etc.
species:"Mus musculus"                 # exact-ish phrase match
approach:electrophysiological          # any contributor's approach contains this
technique:"spike sorting"
standard:nwb
file_type:image                        # → image/* mime types
file_type:application/x-nwb            # explicit MIME prefix

Owner

Operator	What it matches
`owner:VALUE`	Dandisets owned by users matching `VALUE` (case-insensitive) against `username`, `email`, `first_name`, `last_name`, or `"first_name last_name"`

owner:alice
owner:alice@example.com
owner:Smith                            # any user named Smith
owner:"Jane Doe"                       # full display name

If a name matches multiple users (e.g. two Smiths), dandisets owned by any
of them are returned.

Contributors

The contributor operators search the dandiset's metadata.contributor[] list
(the same data shown in the "Contributors" section on the landing page). Each
operator matches a contributor by name, email, OR identifier —
which means ORCID for Person contributors (0000-0002-2990-9889) and ROR URL
for Organization contributors (https://ror.org/01cwqze88) both work. Bare-ID
substrings (01cwqze88) match the full URL.

Operator	Role constraint
`contributor:VALUE`	Any role (catch-all)
`author:VALUE`	Must hold the `Author` role
`contact_person:VALUE`	Must hold the `ContactPerson` role
`data_collector:VALUE`	Must hold the `DataCollector` role
`data_curator:VALUE`	Must hold the `DataCurator` role
`data_manager:VALUE`	Must hold the `DataManager` role
`maintainer:VALUE`	Must hold the `Maintainer` role
`project_leader:VALUE`	Must hold the `ProjectLeader` role
`funder:VALUE`	Must hold the `Funder` role
`sponsor:VALUE`	Must hold the `Sponsor` role

contributor:"Doe, Jane"                # any role
author:Doe                             # Doe specifically as an Author
data_curator:0000-0002-2990-9889       # this ORCID, must be a DataCurator
funder:NIH                             # NIH (or any string containing NIH) as Funder
funder:01cwqze88                       # by ROR id
author:Doe funder:NIH                  # both must hold (possibly different people)

The role-restricting operators map to the DANDI schema's RoleType
values. The catch-all contributor: covers any other role
(Conceptualization, Researcher, etc.); for those, filter by name and use the
landing page to check the specific role.

Affiliation

affiliation is special — affiliations live in a nested field
(contributor[].affiliation[]), not as a role on the contributor itself. The
operator queries that path:

Operator	What it matches
`affiliation:VALUE`	Substring against any contributor's affiliation `name` OR `identifier` (ROR URL)

affiliation:Stanford                       # any contributor affiliated with Stanford
affiliation:"University College London"
affiliation:00f54p054                      # Stanford's ROR id (substring of the URL)
author:Doe affiliation:Stanford            # Doe as author AND someone Stanford-affiliated

Recipes

Find recent NWB dandisets from a particular lab.

file_type:nwb affiliation:"University College London" published_after:2024-01-01

Find dandisets where I'm the contact person.

contact_person:"My Name"

Find dandisets funded by NIH with mouse data.

funder:NIH species:mouse

Find dandisets that cite a particular ORCID as an author.

author:0000-0002-2990-9889

Find your own dandisets in the listing.

owner:"Your Name"

(Or use the My Dandisets tab if you're signed in — it's the same set.)

Quoting rules

Wrap a multi-word value in double quotes:
technique:"spike sorting", contributor:"Doe, Jane",
affiliation:"Cold Spring Harbor Laboratory".
Wrap a whole token in double quotes to opt out of operator parsing —
useful when the text you're searching for contains a colon:
"foo:bar" searches for the literal text foo:bar.
Unbalanced quotes return a 400 with a friendly error message.

Error messages

Invalid syntax doesn't fail silently. Common cases:

What you type	What you get back
`specie:mouse`	400 — `Unknown search operator "specie". Did you mean "species"?`
`data_curatr:Doe`	400 — `Did you mean "data_curator"?`
`created_after:not-a-date`	400 — `Invalid date for "created_after"; Use YYYY-MM-DD.`
`hello "world`	400 — `Unbalanced quote in search query. Remove the stray quote...`
`owner:` (empty value)	400 — `Operator "owner" requires a value`

Typo suggestions are produced by difflib.get_close_matches;
they're a hint, not authoritative.

Using from the API

The same syntax works against the REST API — the search string lives in the
?search= query parameter on /api/dandisets/:

curl 'https://api.dandiarchive.org/api/dandisets/?search=species:mouse+author:Doe'

import requests
r = requests.get(
    'https://api.dandiarchive.org/api/dandisets/',
    params={'search': 'species:mouse author:Doe', 'draft': 'true', 'empty': 'true'},
)
r.json()

The OpenAPI description on /swagger/ lists every operator inline.

Limitations and notes

Substring, case-insensitive. species:mouse matches House mouse,
Mus musculus, etc. There's no exact-match mode at the moment — use a longer
substring to narrow.
No OR or NOT. Operators always combine with AND. To express OR, run two
queries (or wait for a future revision; see below).
No nesting. (species:mouse OR species:rat) and similar grammar isn't
supported.
AND combines at the dandiset level for assets and contributors. Each
asset operator filters dandisets independently — different operators may
match different assets within the same dandiset. Contributor operators
combine on the same version's contributor list (so a draft + published
version with disjoint contributors don't combine into a spurious match);
within that single version, different contributor operators may match
different entries of contributor[].
?user=me (an existing query parameter) still works for "my dandisets";
there's no owner:me magic alias in the operator syntax.
Free-text and operators combine. The same ?search= parameter accepts
both, so you don't need a different endpoint depending on whether you have
operators.

yarikoptic · 2026-05-13T13:35:32Z

Thank you @bendichter . Great work -- I think we are converging. Could you please refine PR description to correspond to current changes, since I think 25 role-specific operators is no longer there and potentially other aspects? note that I also folded that extended paste on advanced search in a most recent comment.

edit: also rebase/merge master to get advantage of #2820 since now renders skinny

yarikoptic

overall looks great to my eye... let's see if more eyes could have a peek

yarikoptic · 2026-05-13T13:38:52Z

+  { example: 'contact_person:Doe', description: 'Listed as the Contact Person' },
+  { example: 'maintainer:Doe', description: 'Listed as a Maintainer' },
+  { example: 'project_leader:Doe', description: 'Listed as the Project Leader (also: data_collector, data_manager, sponsor)' },
+  { example: 'affiliation:Stanford', description: 'Has a contributor affiliated with the named organization (or ROR ID)' },


I feel we would really need a URL to docs there now ... best even not to delay but preempt location?

dandi/dandi-docs#238

candleindark · 2026-05-13T18:08:57Z

The PR description is outdated at this point. A update can help conveying the intent of the PR more clearly. (For example, the trimming of the role-specific operators in 5240487 is not reflected in the description.)

Filters dandisets to those owned by a given user. The value is matched case-insensitively against User.username OR User.email. The special form `owner:me` resolves to the requesting user (consistent with the existing ?user=me query parameter) and returns 400 if the request is anonymous. Implementation reuses the existing `get_owned_dandisets()` permission helper. We pass `with_superuser=False` so `owner:admin` returns only what admin explicitly owns — guardian's default would otherwise inflate to the entire archive for any superuser. Unknown users return zero results (not an error): a search for a nonexistent owner is a valid 0-hit query. Tests cover username/email lookup, case-insensitivity, unknown user, `owner:me` for an authenticated user, anonymous `owner:me` → 400, the superuser non-inflation guarantee, and combination with other operators. OpenAPI help text and the frontend operator popover updated.

Real users encounter the dandiset list with owners shown by display name (e.g. "Super User"), not by username. Searching that string was returning 0 because the lookup only matched username/email. Now matches case-insensitively against username, email, first_name, last_name, OR "first_name last_name" — so owner:"Super User" works the same as owner:ben.dichter@gmail.com. Multiple users may match (e.g. shared last name); we union dandisets owned by any of them via a direct DandisetUserObjectPermission query. Updated OpenAPI help text and the frontend popover example to `owner:"Jane Doe"` so users discover the new shape.

@yarikoptic

Round-2 review feedback on dandi#2821: - @yarikoptic flagged that owner:me silently shadows a real user named "Me". Fix: distinguish quoted vs unquoted at the parser level. Unquoted owner:me → magic alias for the requesting user. Quoted owner:"me" → literal lookup (matches a user whose first/last name is "Me"). Same pattern lets owner:"Me Someoneyou" reach the literal full-name match while keeping the convenient owner:me shortcut. Implementation: ParsedSearch.operators is now a list of `Operator` dataclasses (key, value, quoted) instead of bare tuples. Filters consume the new shape and the owner filter switches on the quoted flag. - Replaced personal email (ben.dichter@gmail.com) in the full-name test fixture with a generic example user. - Consolidated 10 small owner-tests into 3 denser ones that share setup per @yarikoptic's "make each test matter more" feedback. Coverage is unchanged (every documented lookup path is asserted; cross-key AND with another operator; multi-user union via shared last name; unknown user → 0; superuser non-inflation; owner:me magic; owner:"me" literal-escape; anonymous owner:me → 400). DB setup runs ~3x instead of ~10x. Updated OpenAPI help text and the search popover to mention the owner:me alias and the quoted-escape.

The unquoted owner:me → current-user shortcut required threading a `quoted` flag through the parser and a `request_user` arg through the filter dispatch — non-trivial machinery to support one alias. Per dandi#2822 review discussion, removing it from this PR keeps the owner operator focused on literal lookup-by-value (username / email / first / last / "first last") and avoids the design debate about the right escape mechanism for "I literally want a user named Me." The alias can come back in a focused follow-up PR if/when there's appetite for it. Concrete drops: - owner:me magic + 400-on-anonymous in `_apply_owner_filter` - `Operator.quoted` field on the parser dataclass - `quoted` and `request_user` parameters on `_apply_owner_filter` - `get_owned_dandisets` import (no longer used here) - `test_advanced_search_owner_me_magic_and_literal_escape` test - The two `owner-me-quoted` / `owner-me-unquoted` parser test cases - "owner:me" mentions in OpenAPI help text and the popover entry

…okup 29 new operators total: catch-all `contributor:` plus one per dandi-schema RoleType (`author`, `data_curator`, `funder`, `contact_person`, etc.). Independent-operator semantics — `author:Doe funder:NIH` returns dandisets where SOME contributor has Doe-as-Author AND SOME contributor (possibly different) has NIH-as-Funder. Each role-specific operator constrains a single contributor[] element to have BOTH the name match AND the role. Implementation: - A single `_CONTRIBUTOR_ROLE_OPS` dict drives both the parser allowlist and the filter dispatch; adding a future role is one new entry. - `_contributor_jsonpath()` builds a Postgres jsonb_path_exists predicate that ORs across `name`, `email`, AND `identifier` (so ORCID for Persons and ROR URL for Organizations both work, including bare-ID substring forms like `01cwqze88` matching the full ROR URL). - All contributor operators in a single query AND on the same Version's metadata so a draft + published version with disjoint contributor lists never combine into a spurious match. Why 29 separate operators rather than a `contributor: + role:` pair: independent operators compose cleanly (cross-key AND falls out naturally; no ambiguity about which role applies to which contributor when there are multiple). Same precedent as Gmail's `from:`/`to:`/`cc:`. The 28 role names come straight from `dandischema.RoleType`. Test: one consolidated test covers catch-all + role-specific lookup, case-insensitivity, identifier (ORCID + ROR + bare-ID substring), role-substring matching `dcite:`-prefixed stored values, role + ORCID composition (positive and negative), and independent cross-role AND. Plus a separate test for the typo → 400-with-suggestion path. Anonymous test fixtures use generic Doe placeholders, no real names. OpenAPI help text and the search popover updated.

The previous commit treated `affiliation:` as a role-name match (looking for `dcite:Affiliation` in `contributor[].roleName`), but real DANDI data never uses that role; affiliations live in a separate nested field `contributor[].affiliation[]`. The operator silently returned 0 hits despite plenty of (e.g.) Stanford-affiliated contributors. Fix: route `affiliation:` through a dedicated jsonpath that scans `$.contributor[*].affiliation[*]` and matches against the affiliation's `name` OR `identifier` (case-insensitive substring). So: affiliation:Stanford → matches Stanford University affiliation:"University College London" → quoted multi-word affiliation:00f54p054 → matches via ROR ID substring Composes with role/contributor operators on the same Version, same as the other contributor-style operators (independent-operator AND). Also refactored `_apply_contributor_filters` to accept a list of (where, params) pairs rather than (value, role) — cleaner since both the role-based and affiliation operators now share the same dispatch.

Per review: `other:` would be a thin surface for "uncategorized contributors" — not a useful filter — and `ethics_approval:` isn't a contributor-style role users would search by. Removing them tightens the operator vocabulary to the 25 substantive RoleType values + the contributor catch-all + affiliation.

Two structural improvements + one product trim, in response to the review on dandi#2822: 1. New `dandiapi/api/services/search/operators.py` (pure Python, no Django) holds every operator-vocabulary constant: DATE_OPS, ASSET_OPS, OWNER_OPS, AFFILIATION_OPS, CONTRIBUTOR_ROLE_OPS, FILE_TYPE_ALIASES, ASSET_NAME_PATH_OPS, AFFILIATION_JSONPATH. OPERATOR_KEYS is now the union of those tables — single source of truth, no more duplication between parser.py (allowlist) and filters.py (dispatch). Adding a new operator is one entry; the parser automatically knows about it. 2. Trim the role-restricting shortcuts from 25 to 9. After review discussion: most RoleType values aren't operators users actually reach for (`conceptualization:`, `methodology:`, `validation:`, `visualization:`, etc.). Kept the ones that map to common search intents: contributor (catch-all), author, contact_person, data_collector, data_curator, data_manager, maintainer, project_lead, funder, sponsor The catch-all `contributor:` still matches anyone in any role; only the role-restricting shortcuts are pruned. `project_lead:` is intentionally shorter than the schema name `ProjectLeader`. 3. Shrank the verbose docstrings on private filter helpers (the rationale stays in commit messages, not as documentation rot on internal API). 4. Added test_contributor_role_ops_match_actual_dandischema_roletype as a drift guard: every non-catch-all CONTRIBUTOR_ROLE_OPS value must be a real RoleType.name. Renames or removals on the schema side trip the test, forcing an explicit decision instead of silently changing public search syntax. OpenAPI help text and the search popover updated to reflect the trimmed list (`project_lead`, `data_collector`, `data_manager`, `sponsor` now shown; the misleading "many more" tail removed).

@example

- Variable renames: ds_baker_curator → ds_doe_curator, ds_baker_author_only → ds_doe_author_only (the test data was already Doe; only the variable names still carried the old name). - One stale query string `AUTHOR:baker` updated to `AUTHOR:doe`. - One fixture email field `'jane.doe.com'` (broken: no @) restored to `'jane.doe@example.com'` — leftover from the earlier perl rename that stripped @example out.

Per dandi#2822 review discussion: the old semantics required all asset operators to be satisfied by a SINGLE asset, which meant `species:mouse species:rat` only matched dandisets with a multi-species recording (rare). The natural user reading is "the dandiset has mouse data AND has rat data" — those can be on different assets, and that's the common case for comparative-species dandisets. Implementation: each asset operator now builds an independent AssetSearch subquery and the dandiset queryset is filtered with `id__in=...` per operator. Django generates one subquery per operator and AND's them at the dandiset level. Cross-key likewise: `species:mouse approach:electrophysiological` now matches any dandiset that has SOME mouse asset AND SOME ephys asset, not just dandisets with a mouse-ephys asset. Tests updated: - `test_advanced_search_repeated_same_key_operator_combines_with_and` is now `..._combines_at_dandiset_level`, with a new fixture that has two separate assets (one mouse, one rat) to actually exercise the cross-asset case the old semantic excluded. - `test_advanced_search_repeated_asset_operators_intersect` is now `test_advanced_search_asset_operators_combine_at_dandiset_level`, with a similar two-assets-split fixture that demonstrates the new inclusive behavior. Contributor / affiliation semantics unchanged — those still AND on the same Version's metadata (since contributors live per-version, not per-asset). Within that single version, predicates can match different contributor[] entries.

Postgres jsonpath quirk: `like_regex` requires its pattern to be a STRING LITERAL inside the jsonpath text — not a `$variable`. The contributor + affiliation builders I wrote tried to use the `vars` argument of `jsonb_path_exists` for the regex pattern, which Postgres rejects with `syntax error at or near "$val" of jsonpath input`. (The asset operators avoid this by concatenating `to_jsonb(?::text)::text` into the jsonpath at SQL execution time — the regex pattern ends up as a properly-quoted JSON string literal in the path. The user value is still bound as a parameter, never inlined into the SQL.) Refactor: applied the same SQL-time concatenation trick to the contributor + affiliation builders. Three new helpers — `_contributor_where`, `_affiliation_where`, and a shared `_LIKE_REGEX_PATTERN` constant — replace the old `_contributor_role_jsonpath` + `_build_jsonpath_where` pair that relied on the broken `vars` mechanism. Removed the unused `AFFILIATION_JSONPATH` constant from operators.py and dropped the `json` import from filters.py since we no longer marshal `vars` objects. Net behavior unchanged; the failing CI tests should pass now.

CI surfaced an assertion that AUTHOR:doe should match the same set as author:doe. The old _TOKEN_RE / _BARE_OP_RE only accepted lowercase operator keys, so uppercase tokens fell through to free text and returned 0 results. Accept either case in the regex and lowercase the captured key before validation/dispatch. Matches user expectations (GitHub's search operators are case-insensitive on the key side too).

Co-authored-by: Isaac To <candleindark@users.noreply.github.com>

@candleindark

Per @candleindark's review: a contributor can be an Organization as well as a Person, and the affiliation jsonpath (which traverses `contributor[*].affiliation[*]`) should walk past Organizations (which have no `affiliation` field of their own) without exploding. Added Organization contributors to both `ds_stanford` and `ds_ucl`: NIH as a Funder on ds_stanford and Wellcome Trust as a Funder on ds_ucl. The new assertions confirm: - `affiliation:Stanford` (and the other affiliation queries) keep working with mixed Person/Organization contributors. - The Organization's own `identifier` is NOT matched by `affiliation:` (it's not an affiliation; the test pins this). - Cross-key with `funder:NIH affiliation:Stanford` works — different contributor elements on the same Version. Also: used `National Institutes of Health (NIH)` for the org name so the `funder:NIH` substring test actually matches (the abbreviation isn't part of the spelled-out form alone). Realistic — DANDI contributors often use this parenthetical form.

bendichter · 2026-05-13T20:01:44Z

@yarikoptic @candleindark — refreshed the PR description: dropped the stale "25 role-specific operators" framing (now 9), called out the case-insensitive operator keys, the explicit-allowlist rationale, and the RoleType drift-guard test. Also rebased on master to pick up #2820 (skinny popover render) — both branches force-pushed. tox -e test is green locally (53 advanced-search + parser tests pass).

@yarikoptic

Per @yarikoptic's review (PR dandi#2822). The deferred imports inside test_contributor_role_ops_match_actual_dandischema_roletype were a holdover from when this file deliberately avoided dandischema imports; that constraint no longer applies, and module-level imports are the project convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

candleindark · 2026-05-14T01:00:51Z

+    Operators thus AND on the same Version (a draft and a published version
+    with disjoint contributor lists never combine into a spurious match).


It may be a good idea to add a test assure that there is no spurious match. Such a test can safeguard against a future that break this behavior.

Good idea — added in 0cbdca7. The new test_advanced_search_contributor_operators_and_on_same_version seeds a dandiset whose draft has Author=Doe and whose published version has Funder=NIH (no contributor overlap), then asserts author:Doe funder:NIH rejects it (while a positive-control dandiset with both contributors on the same version still matches). A future change that ANDs per-operator subqueries against unrelated Version rows would let the spurious match through and trip the test.

All concerns I raised have been addressed, deferring to other reviewers for full approval.

@candleindark

Per @candleindark's review (PR dandi#2822). Pins down the "AND on the same Version" semantics with a regression test: a dandiset whose draft has `Author=Doe` and whose published version has `Funder=NIH` (with no overlap) must NOT match `author:Doe funder:NIH`. A future change that chains the predicates against unrelated Version rows would let this spurious match through, and would now trip the test. A positive control (a single version that holds both contributors) confirms the operator composition itself still works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tributor # Conflicts: # dandiapi/api/services/search/filters.py # dandiapi/api/services/search/parser.py

yarikoptic reviewed May 9, 2026

View reviewed changes

bendichter force-pushed the advanced-search-contributor branch from 446eb7c to 845bb68 Compare May 11, 2026 15:44

bendichter mentioned this pull request May 11, 2026

Add owner: operator to advanced search #2821

Open

3 tasks

bendichter commented May 11, 2026

View reviewed changes

bendichter changed the title ~~Add contributor + per-role operators to advanced search (stacked on #2821)~~ Add contributor + per-role operators to advanced search May 11, 2026

bendichter requested a review from yarikoptic May 12, 2026 22:45

yarikoptic reviewed May 13, 2026

View reviewed changes

yarikoptic requested a review from candleindark May 13, 2026 13:44

candleindark previously requested changes May 13, 2026

View reviewed changes

Comment thread dandiapi/api/tests/test_dandiset.py

candleindark reviewed May 13, 2026

View reviewed changes

Comment thread dandiapi/api/services/search/filters.py Outdated

bendichter added 14 commits May 13, 2026 15:52

Apply ruff format to test_dandiset.py

74fcc5b

Use project_leader (full schema name) as the operator name

463c366

Apply ruff format

e4d03d3

bendichter and others added 3 commits May 13, 2026 15:52

Update dandiapi/api/services/search/filters.py

a994d90

Co-authored-by: Isaac To <candleindark@users.noreply.github.com>

bendichter force-pushed the advanced-search-contributor branch from 1889be6 to f9a8155 Compare May 13, 2026 20:00

This was referenced May 13, 2026

docs: add Advanced Search page for key:value search operators dandi/dandi-docs#238

Open

Operator-autocomplete dropdown for advanced search (stacked on #2822) #2826

Draft

Add num_subjects: advanced-search operator #2827

Open

candleindark reviewed May 14, 2026

View reviewed changes

bendichter requested a review from yarikoptic May 14, 2026 01:21

bendichter requested a review from candleindark June 8, 2026 17:48

Merge remote-tracking branch 'origin/master' into advanced-search-con…

460c61c

…tributor # Conflicts: # dandiapi/api/services/search/filters.py # dandiapi/api/services/search/parser.py

bendichter mentioned this pull request Jun 11, 2026

Blog post: Find Dandisets Faster with Advanced Search dandi/dandi-about#118

Draft

		Operators thus AND on the same Version (a draft and a published version
		with disjoint contributor lists never combine into a spurious match).

Conversation

bendichter commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Behavior

Affiliation is special

Why per-role operators (Option D) over contributor: + role:

Uh oh!

yarikoptic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bendichter left a comment

Choose a reason for hiding this comment

Uh oh!

bendichter commented May 11, 2026 • edited by yarikoptic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Advanced Search

Quick examples

How operators combine

Operator reference

Dates

Asset content

Owner

Contributors

Affiliation

Recipes

Quoting rules

Error messages

Using from the API

Limitations and notes

Uh oh!

yarikoptic commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikoptic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

candleindark commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

bendichter commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

bendichter commented May 9, 2026 •

edited

Loading

Why per-role operators (Option D) over `contributor: + role:`

yarikoptic left a comment •

edited

Loading

bendichter commented May 11, 2026 •

edited by yarikoptic

Loading

yarikoptic commented May 13, 2026 •

edited

Loading