Skip to content

Show stats panel in occurrence list sidebar#1308

Draft
mihow wants to merge 32 commits into
mainfrom
feat/occurrence-stats-ui
Draft

Show stats panel in occurrence list sidebar#1308
mihow wants to merge 32 commits into
mainfrom
feat/occurrence-stats-ui

Conversation

@mihow

@mihow mihow commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a Stats panel at the top of the occurrence list sidebar. It's the frontend for the model-agreement endpoint from #1307 (now merged). The panel shows, for the current filtered result set, how much has been human-verified and how closely the model's predictions agree with those human verifications. Because it reuses the list view's active filters, the numbers always match what's on screen — change a filter and the stats re-query.

The goal is to give reviewers a quick, honest read on model quality for whatever slice they're looking at, with the uncertainty made visible rather than hidden behind a single percentage.

Screenshots

image image image image

List of Changes

# What the user sees How
1 A Stats panel above the filter sections in the occurrence list sidebar New OccurrenceStats component (ui/src/pages/occurrences/occurrence-stats.tsx), wired into occurrences.tsx
2 Numbers always match the visible list (taxon, deployment, date, verification status, default filters) Threads the same active filters array the list sends to useOccurrences, converted to query params with the same active/error rules as getFetchUrl
3 Verified occurrences — share of the filtered set that's human-verified (shows <1% instead of 0% when the count rounds down but is non-zero) verified_pct; exact counts moved into the tooltip
4 Agreement (exact taxon) shown above the fold; any rank, coarser rank, and Cohen's κ tucked under a More toggle (collapsed by default) to reduce clutter Collapsible section
5 An (i) tooltip on every metric with the exact counts and a plain-language explanation — including why the agreement denominator (verified occurrences with a model prediction) can be smaller than the verified count Dynamic tooltip text built from the response
6 The 95% confidence interval shown as the agreement headline (e.g. 83–94%) plus a fuzzy diagonal-hatch band on the bar marking the uncertain zone Solid fill to the lower CI bound, hatch across the CI range over the gray track
7 Cohen's κ as a signed, zero-centred bar [-1, 1] bar
8 Panel shows a loading skeleton, and renders nothing on error so it never blocks the list

How the confidence interval is drawn

Each agreement bar is one 0–100% track. A solid fill runs up to the lower 95% CI bound (the "confident floor"), and a diagonal hatch covers the CI range (low→high) — the uncertain zone where the true value sits. The hatch is drawn over the gray track rather than over the solid fill, so it stays visible no matter where the point estimate lands (an earlier version layered the hatch on the solid fill, which hid it blue-on-blue when the estimate sat near 100%). A wide hatch reads immediately as "shaky number", a narrow one as "confident".

Design notes (why these particular metrics)

The agreement rate is the share of human-verified occurrences where the human's pick matched the model's pick. Three calibration ideas are baked in:

  • Confidence interval instead of a hard cutoff. Rather than a yes/no "enough data" line, the Wilson 95% CI shows how shaky the number is — wide when few occurrences are verified, tightening as more get verified. A Wilson score interval behaves well at small samples. This is more honest than a magic threshold like "30 verifications", which is a rule of thumb that only holds if verifications are a random sample — they aren't; people verify the unusual or eye-catching ones first.
  • Agreement beyond chance (Cohen's κ). Plain agreement % has a blind spot: if most occurrences in a project are one common species, human and model "agree" most of the time just by both guessing the common one — luck, not skill. κ subtracts the expected-by-chance agreement (1.0 = perfect, 0 = no better than guessing, negative = worse).
  • Counts in the tooltip. The exact K of N lives in the (i) tooltip so the headline stays uncluttered, but the reader can still see how many verifications the rate is built on.

Same caveat applies to all three: they describe only the occurrences people chose to verify, not the whole project.

Known follow-up (backend, not this PR)

While testing against real data, the agreement rates looked high even on lightly-verified projects. On inspection, roughly half the verifications on a sample project are users accepting the model's suggestion (Identification.agreed_with_prediction), which sets the human taxon equal to the model's by construction and counts as an exact match automatically. Measured on a sample project, excluding accept-the-suggestion IDs drops exact agreement from 90% (n=100, CI 83–94%) to 38% (n=16, CI 18–61%). This is a definition issue in the merged endpoint (#1307), not in this UI. Details and a suggested fix are in a comment on this PR; a backend follow-up can exclude those IDs from the agreement denominator.

Test plan

  • tsc --noEmit, eslint, and prettier --check clean on touched files (run in a pinned Node 22.12 container; the worktree's host node_modules is unavailable).
  • Live browser render verified against a well-populated project via the worktree dev server, proxied to the staging backend. All bars render; the More section expands; tooltips carry the live counts.
  • CI hatch verified across point estimates — including near-100% cases (e.g. 21–100% renders as mostly hatch) and mid-range cases.
  • Filter reactivity verified: toggling Default Filters off sets ?apply_defaults=false and the panel re-queries with the same param, so the same filter array drives both list and stats.
  • Staging backend deployed to current main so the Netlify deploy preview serves the live endpoint.

🤖 Generated with Claude Code

@netlify

netlify Bot commented May 15, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview ready!

Name Link
🔨 Latest commit d3c6acf
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a3359e0c1c3240008ae4821
😎 Deploy Preview https://deploy-preview-1308--antenna-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 62 (🔴 down 3 from production)
Accessibility: 81 (🔴 down 8 from production)
Best Practices: 92 (🔴 down 8 from production)
SEO: 92 (no change from production)
PWA: 80 (no change from production)
View the detailed breakdown and full score reports
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8b08405-d33c-470e-8a17-f51294883c42

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/occurrence-stats-ui

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 326cd68 to 4ae69ec Compare May 21, 2026 00:52
mihow pushed a commit that referenced this pull request May 21, 2026
…ry params

- Rename `agreed_under_order_*` → `agreed_any_rank_*` to match the endpoint's
  dropped ORDER threshold (0565f06).
- Add optional `agreement_coarsest_rank` + `agreed_coarser_rank_*` fields to
  the response type (not consumed yet — UI follows in #1308).
- Widen `filters` to accept arrays and append repeated query params so
  multi-value filters (e.g. `algorithm`, `not_algorithm` — backend reads via
  `request.query_params.getlist(...)`) survive. Per CodeRabbit review.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch 3 times, most recently from d621ac3 to 3692eba Compare May 21, 2026 01:13
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 3692eba to d0669ee Compare May 21, 2026 01:18
mihow added a commit that referenced this pull request May 22, 2026
useModelAgreement.ts belongs with the frontend consumer (#1308), not the
backend endpoint PR. Keeps #1307 backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow added a commit that referenced this pull request May 22, 2026
Typed React Query wrapper for /occurrences/stats/model-agreement/.
Owned by this UI PR (#1308); the backend PR (#1307) is now backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 5e5252d to 50c5ff9 Compare May 22, 2026 04:36
mihow pushed a commit that referenced this pull request May 26, 2026
…ry params

- Rename `agreed_under_order_*` → `agreed_any_rank_*` to match the endpoint's
  dropped ORDER threshold (0565f06).
- Add optional `agreement_coarsest_rank` + `agreed_coarser_rank_*` fields to
  the response type (not consumed yet — UI follows in #1308).
- Widen `filters` to accept arrays and append repeated query params so
  multi-value filters (e.g. `algorithm`, `not_algorithm` — backend reads via
  `request.query_params.getlist(...)`) survive. Per CodeRabbit review.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow added a commit that referenced this pull request May 26, 2026
useModelAgreement.ts belongs with the frontend consumer (#1308), not the
backend endpoint PR. Keeps #1307 backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/human-model-agreement-endpoint branch from f958a38 to c4a4171 Compare May 26, 2026 01:10
mihow added a commit that referenced this pull request May 26, 2026
Typed React Query wrapper for /occurrences/stats/model-agreement/.
Owned by this UI PR (#1308); the backend PR (#1307) is now backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 50c5ff9 to 1241967 Compare May 26, 2026 01:10
mihow added a commit that referenced this pull request May 26, 2026
Typed React Query wrapper for /occurrences/stats/model-agreement/.
Owned by this UI PR (#1308); the backend PR (#1307) is now backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 3a5e022 to ef2cf01 Compare May 26, 2026 19:58
mihow pushed a commit that referenced this pull request May 27, 2026
…ry params

- Rename `agreed_under_order_*` → `agreed_any_rank_*` to match the endpoint's
  dropped ORDER threshold (0565f06).
- Add optional `agreement_coarsest_rank` + `agreed_coarser_rank_*` fields to
  the response type (not consumed yet — UI follows in #1308).
- Widen `filters` to accept arrays and append repeated query params so
  multi-value filters (e.g. `algorithm`, `not_algorithm` — backend reads via
  `request.query_params.getlist(...)`) survive. Per CodeRabbit review.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow added a commit that referenced this pull request May 27, 2026
useModelAgreement.ts belongs with the frontend consumer (#1308), not the
backend endpoint PR. Keeps #1307 backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/human-model-agreement-endpoint branch from 9347277 to e476333 Compare May 27, 2026 01:11
mihow added a commit that referenced this pull request May 27, 2026
Typed React Query wrapper for /occurrences/stats/model-agreement/.
Owned by this UI PR (#1308); the backend PR (#1307) is now backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from ef2cf01 to 2391505 Compare May 27, 2026 01:12
@mihow mihow changed the title feat(ui): live stats panel in occurrence list sidebar Show stats panel in occurrence list sidebar May 27, 2026
@mihow mihow marked this pull request as draft May 27, 2026 13:20
mihow and others added 4 commits May 27, 2026 06:25
Pure-Python LCA over (taxon_id, rank, parents_json) tuples. Returns
the deepest shared TaxonRank or None. Used by the upcoming
human-model-agreement stat to bucket agreement at-or-finer-than ORDER.

Plan: docs/claude/planning/2026-05-14-human-model-agreement-endpoint.md
Side-research: docs/claude/planning/occurrence-filter-driven-exports.md

Co-Authored-By: Claude <noreply@anthropic.com>
… queryset

Pure aggregation; caller wires apply_default_filters + OccurrenceFilter.
Annotates best machine prediction, prefetches non-withdrawn identifications,
batches Taxon fetch for parents_json, buckets exact / under-order / above-order.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds HumanModelAgreementSerializer and the human_model_agreement action
on OccurrenceStatsViewSet. Extracts OccurrenceViewSet's filter backends +
filterset_fields into a module-level tuple so OccurrenceStatsViewSet can
reuse the same OccurrenceFilter pass-through (deployment, event, taxa lists,
verified, score thresholds, apply_defaults=false, etc).

The top_identifiers action keeps its current behavior — filter_queryset
is only invoked by actions that opt in.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds 6 HTTP-level tests: missing project_id 400, draft 404, empty zeros,
happy-path exact match, deployment filter pass-through, apply_defaults=false
score-threshold bypass.

Also adds DjangoFilterBackend to OccurrenceStatsViewSet.filter_backends so
filterset_fields (event, deployment, determination__rank, ...) actually take
effect. Without DjangoFilterBackend, filterset_fields are silently ignored
and ?deployment=N returns the unfiltered set.

Co-Authored-By: Claude <noreply@anthropic.com>
Michael Bunsen and others added 7 commits May 27, 2026 06:27
…ry params

- Rename `agreed_under_order_*` → `agreed_any_rank_*` to match the endpoint's
  dropped ORDER threshold (0565f06).
- Add optional `agreement_coarsest_rank` + `agreed_coarser_rank_*` fields to
  the response type (not consumed yet — UI follows in #1308).
- Widen `filters` to accept arrays and append repeated query params so
  multi-value filters (e.g. `algorithm`, `not_algorithm` — backend reads via
  `request.query_params.getlist(...)`) survive. Per CodeRabbit review.

Co-Authored-By: Claude <noreply@anthropic.com>
Session-scratchpad doc — belongs in local notes, not the merged branch.

Co-Authored-By: Claude <noreply@anthropic.com>
- 2026-05-14-human-model-agreement-endpoint.md — design narrative; superseded
  by code + PR description.
- occurrence-filter-driven-exports.md — side-research stub Copilot flagged as
  out-of-scope. Promoted to a PR-description follow-up item.

Co-Authored-By: Claude <noreply@anthropic.com>
create_detections assigns the classification taxon via .order_by("?"),
so the previous test picked a random machine taxon and then required a
sister species under the same genus. Random non-species picks (ORDER /
FAMILY / GENUS) have no sister, flaking ~50% of runs.

Pin both the machine prediction and the human ID to two fixed Vanessa
species, so the LCA is always GENUS (any-rank bucket, not exact) and the
test is deterministic.

Co-Authored-By: Claude <noreply@anthropic.com>
useModelAgreement.ts belongs with the frontend consumer (#1308), not the
backend endpoint PR. Keeps #1307 backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
Both derive from the verified_rows already in memory — no extra query.

- wilson_interval(): 95% Wilson score CI on agreed_exact_pct and
  agreed_any_rank_pct (agreed_*_ci_low / _ci_high). Wilson stays inside
  [0,1] and is honest at the small n typical of verified sets, where the
  normal approximation breaks down.
- cohens_kappa(): exact-taxon agreement beyond chance (cohens_kappa
  field, range [-1, 1]). Null when no doubly-classified occurrences or
  expected agreement is 1.0. Discounts the agreement you'd get for free
  in a project dominated by one common species.

Adds 5 nullable response fields. Backwards-compatible (additive only).
9 pure-Python unit tests + 2 HTTP field-presence tests.

Co-Authored-By: Claude <noreply@anthropic.com>
Both are generic statistical helpers — they don't depend on Django or any
domain model. Lifting them out of ami/main/models_future/occurrence.py so
other endpoints/jobs that need binomial CIs or chance-corrected agreement
can import them without dragging in the occurrence module.

Same implementations, just relocated. Renamed parameter names on
cohens_kappa from (human, model) to (rater_a, rater_b) so the helper
reads as generic rather than human-vs-model specific.

Tests already use isolated `from ami.utils.stats import …` imports
(updated all 9 sites in ami/main/tests.py).

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/human-model-agreement-endpoint branch from e476333 to 336c1fe Compare May 27, 2026 13:29
mihow added a commit that referenced this pull request May 27, 2026
Typed React Query wrapper for /occurrences/stats/model-agreement/.
Owned by this UI PR (#1308); the backend PR (#1307) is now backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 2391505 to 237a013 Compare May 27, 2026 13:29
mihow and others added 10 commits May 28, 2026 15:35
Adds ResponseSchemaMetadata (ami/base/metadata.py) — a SimpleMetadata
subclass that emits the response serializer's field schema (type, label,
help_text, bounds) under actions.GET. DRF's default SimpleMetadata only
emits field schema for write methods (POST / PUT), so read-only stats
endpoints previously returned only name + description on OPTIONS.

Wires it into OccurrenceStatsViewSet and passes serializer_class= to
each @action decorator so view.get_serializer() resolves to the
per-action response serializer during OPTIONS resolution.

Result: frontends can fetch OPTIONS once per stats endpoint and key
tooltips / labels by field name. Stat copy lives next to the serializer
definition; interpretation copy stays in the FE bundle next to the
visualization.

Documented in docs/claude/reference/api-stats-pattern.md.

Co-Authored-By: Claude <noreply@anthropic.com>
Identification.taxon is nullable — a comment-only verification has a
machine prediction but no human label to compare. Previously such rows
landed in the agreement denominator (verified_with_prediction_count)
but never in any numerator, silently dragging agreed_*_pct down.

Adds a comparable cohort: verified occurrences with BOTH a machine
prediction and a human taxon. All agreed_*_pct and the Wilson CIs now
divide by comparable_count instead of verified_with_prediction_count,
so numerator and denominator describe the same set. Cohen's kappa
already used this cohort (both_present_pairs), so it is unchanged.

Surfaces two new fields so consumers can see why comparable_count
differs from verified_count:
- comparable_count — denominator for agreed_*_pct
- verified_without_taxon_count — verified, has prediction, no human taxon

Co-Authored-By: Claude <noreply@anthropic.com>
Replaces the manual try/except rank parsing with a ChoiceField run
through SingleParamSerializer, matching the project's standard
boundary-validation pattern.

Closes a gap where ?agreement_coarsest_rank= (blank) silently no-opped
instead of returning the documented 400 for an invalid rank. DRF treats
blank fields in QueryDict (HTML) input as absent, so the value is passed
in a plain dict to force "" through validation. Unknown ranks and
UNKNOWN (absent from the choice list) also 400 at the boundary, and the
param stays case-insensitive via an explicit uppercase.

drf-spectacular reads the ChoiceField choices into the OpenAPI schema as
an enum, so /api/v2/docs/ now lists the valid rank values.

Co-Authored-By: Claude <noreply@anthropic.com>
successes > total (or negative) makes the variance term negative and
crashes deeper in math.sqrt with an opaque domain error. Since
wilson_interval is a public helper in ami/utils/stats, guard the inputs
and raise a clear ValueError at the boundary instead. No production
caller can currently hit this — agreed_* counts are always a subset of
the comparable denominator — but the helper shouldn't depend on that.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds an OccurrenceStats panel above the filter sections on the
occurrence list page. Consumes the /occurrences/stats/model-agreement/
endpoint, threading the same active filter array the list view sends so
the numbers always reflect the current result set.

Shows two metrics: verified occurrences % and human-model agreement
rate % (rank-level / under-order agreement).

Co-Authored-By: Claude <noreply@anthropic.com>
One-line field rename in the occurrence stats panel to match the backend's
dropped ORDER threshold. Hook type rename + multi-value filter support
landed on the base branch (4a92c0b on #1307).

Co-Authored-By: Claude <noreply@anthropic.com>
`StatBar` takes an optional `count` rendered as "0% (121)". Wired into the
Verified occurrences bar so a small-but-nonzero verified set that rounds to
0% still surfaces the underlying count.

Co-Authored-By: Claude <noreply@anthropic.com>
Typed React Query wrapper for /occurrences/stats/model-agreement/.
Owned by this UI PR (#1308); the backend PR (#1307) is now backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>
Two new horizontal bars below the existing verified / agreement-rate bars:

- 'Agreement 95% CI (Wilson)' — RangeBar showing the Wilson CI as a
  filled segment between low and high (wide bar = shaky number, narrow
  bar = tight). Value reads '87–97%'. '—' when no verified-with-pred set.
- 'Cohen's κ (beyond chance)' — SignedBar over [-1, 1] with the zero
  midpoint marked. Positive fills right, negative fills left. Value
  reads '0.41'. '—' when undefined (empty or single-category set).

Hook type extended with the five new fields (agreed_*_ci_low/high +
cohens_kappa). Loading skeleton bumped to 4 placeholders.

Co-Authored-By: Claude <noreply@anthropic.com>
…nline

Stats panel now renders three agreement bars side-by-side instead of one
generic agreement row plus a separate CI range bar:

- Agreement (exact taxon) — agreed_exact_*
- Agreement (any rank) — agreed_any_rank_* (LCA at any real rank)
- Agreement (≥ <rank>) — agreed_coarser_rank_* (only when the caller passes
  ?agreement_coarsest_rank=<RANK>; otherwise hidden)

Wilson 95% CI is folded into each agreement bar instead of sitting on its
own row. The bar is a single 0–100% track with:

- a translucent CI band (bg-primary/40) from low to high
- 2px-wide CI bound caps (whiskers) at low/high
- a 3px tall dark vertical marker for the point estimate

This puts the uncertainty visually adjacent to the number it qualifies —
the bar IS the CI, the marker IS the point — so the CI is no longer easy
to overlook. Each agreement row also surfaces raw counts ("90 of 100").

Cohen's κ keeps its existing signed bar.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/occurrence-stats-ui branch from 237a013 to 3ecc891 Compare May 28, 2026 22:44
mihow added a commit that referenced this pull request May 29, 2026
* feat(occurrence-stats): add lca_rank_between helper

Pure-Python LCA over (taxon_id, rank, parents_json) tuples. Returns
the deepest shared TaxonRank or None. Used by the upcoming
human-model-agreement stat to bucket agreement at-or-finer-than ORDER.

Plan: docs/claude/planning/2026-05-14-human-model-agreement-endpoint.md
Side-research: docs/claude/planning/occurrence-filter-driven-exports.md

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(occurrence-stats): aggregate human-model agreement over filtered queryset

Pure aggregation; caller wires apply_default_filters + OccurrenceFilter.
Annotates best machine prediction, prefetches non-withdrawn identifications,
batches Taxon fetch for parents_json, buckets exact / under-order / above-order.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(occurrence-stats): wire human-model-agreement action

Adds HumanModelAgreementSerializer and the human_model_agreement action
on OccurrenceStatsViewSet. Extracts OccurrenceViewSet's filter backends +
filterset_fields into a module-level tuple so OccurrenceStatsViewSet can
reuse the same OccurrenceFilter pass-through (deployment, event, taxa lists,
verified, score thresholds, apply_defaults=false, etc).

The top_identifiers action keeps its current behavior — filter_queryset
is only invoked by actions that opt in.

Co-Authored-By: Claude <noreply@anthropic.com>

* test(occurrence-stats): HTTP coverage for human-model-agreement action

Adds 6 HTTP-level tests: missing project_id 400, draft 404, empty zeros,
happy-path exact match, deployment filter pass-through, apply_defaults=false
score-threshold bypass.

Also adds DjangoFilterBackend to OccurrenceStatsViewSet.filter_backends so
filterset_fields (event, deployment, determination__rank, ...) actually take
effect. Without DjangoFilterBackend, filterset_fields are silently ignored
and ?deployment=N returns the unfiltered set.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(ui): useHumanModelAgreement hook for occurrence stats

Mirrors useTopIdentifiers's useAuthorizedQuery pattern. Accepts an
arbitrary filter map so the occurrence list page can thread its filter
state through unchanged (deployment, event, taxon, score thresholds,
apply_defaults).

Co-Authored-By: Claude <noreply@anthropic.com>

* docs(prompts): handoff for PR #1307 rework — rename + SQL push-down + review fixes

Captures: review findings from Copilot + CodeRabbit, perf bench evidence
(43k rows → 159s timeout on apply_defaults=false), and the planned changes
for the next session (rename to model-agreement, push aggregation into
SQL/ORM, fix UNKNOWN rank LCA + denominator + verified_by_me anon gap +
test gaps).

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(occurrence-stats): rename to model-agreement + push aggregation to SQL

Addresses review feedback on PR #1307:

Rename (drop "human"):
- URL: /occurrences/stats/human-model-agreement/ -> /model-agreement/
- Function: human_model_agreement_for_project -> model_agreement_for_project
- Serializer: HumanModelAgreementSerializer -> ModelAgreementSerializer
- Viewset action + url_path: human_model_agreement -> model_agreement
- FE hook: useHumanModelAgreement -> useModelAgreement (file + symbol)
- FE type: Response -> ModelAgreementResponse (fixes DOM Response shadow)
- Test class: TestHumanModelAgreementForProject -> TestModelAgreementForProject

SQL push-down (Copilot+CodeRabbit perf flag):
- Replace list(qs) full-row materialization with annotated aggregate().
- Annotate best_user_taxon_id via Subquery over Identification
  (BEST_IDENTIFICATION_ORDER). Drop the prefetch + select_related("taxon")
  on identifications since only taxon_id is read.
- aggregate() Count(filter=Q(...)) for total/verified/exact/no-prediction.
- For under-order disagreement: group disagreement set by distinct
  (user_taxon, machine_taxon) pair before LCA. Each pair's LCA runs once.
- Bench against project 18 (43,149 occurrences): pre-rework apply_defaults=false
  curl timed out at 159s; post-rework 1.96s unfiltered / 3.4s with bypass
  (93,019 occurrences post-filter).

Denominator fix (Copilot):
- agreed_*_pct now divides by verified_with_prediction_count instead of
  verified_count. A verified occurrence with no machine prediction can't
  agree or disagree; including it in the denominator drags the rate down
  without representing actual model disagreement.
- Surface no_prediction_count + verified_with_prediction_count as sibling
  fields so consumers can see how many such occurrences exist.

UNKNOWN rank bug (Copilot):
- TaxonRank.UNKNOWN sorts after SPECIES in OrderedEnum definition order,
  so without explicit exclusion UNKNOWN >= ORDER is True and a shared
  UNKNOWN ancestor would wrongly count as under-order agreement. Filter
  UNKNOWN out of lca_rank_between's candidate ranks. Add regression test.

Tests:
- New: test_unknown_rank_excluded_from_lca (LCA regression)
- New: test_agreement_under_order_bucket (HTTP coverage for sister-species
  case, previously only exact-match shortcut was exercised)
- Updated: happy-path asserts verified_with_prediction_count and
  no_prediction_count.

22/22 backend tests green:
  docker compose exec django python manage.py test
    ami.main.tests.TestLcaRankBetween
    ami.main.tests.TestModelAgreementForProject
    ami.main.tests.TestOccurrenceStatsViewSet

Co-Authored-By: Claude <noreply@anthropic.com>

* docs(plan): add text lang to fenced block (markdownlint MD040)

Co-Authored-By: Claude <noreply@anthropic.com>

* perf(occurrence-stats): scope agreement subqueries to verified set

Replace the .aggregate() over the full filtered queryset with a two-step
approach:
  1. SQL Count('pk') for total_occurrences (no joins, no subqueries).
  2. Fetch the verified set (occurrences with at least one non-withdrawn
     ident) with both best_user_taxon_id and best_machine_prediction_taxon_id
     annotated, then bucket counts + LCA in Python.

Why: the previous version evaluated two correlated subqueries (best user
identification + best machine prediction) on every row of the filtered
queryset. For typical projects, >95% of occurrences have no identification
— those rows ran the user-ident subquery only to discover NULL, then ran
the (much more expensive) machine-prediction subquery on detections that
won't contribute to any agreement bucket. Scoping the subqueries to the
verified set avoids that waste.

Bench (cold, cache invalidated):

  Project                          Total    Verified   Pre      Post
  P#85 SEC-SEQ                     36,253   13,140     —        1.18s
  P#20 BCI                         40,958    1,351     —        0.92s
  P#84 Pennsylvania                18,407      251     —        0.56s
  P#24 Atlantic Forestry            2,797      274     —        0.50s
  P#18 Vermont                     43,149       45     ~928ms   0.35s
  P#23 Insectarium Montreal        20,393       74     —        0.43s

Warm via django-cachalot: 122–343ms across all projects.

For P#85 (highest absolute identification count in the system), the cost
is dominated by apply_default_filters' score-threshold join, not the
subqueries. apply_defaults=false actually runs faster (0.69s cold,
179,466 total / 13,140 verified) because the classification join is
skipped.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(occurrence-stats): drop ORDER threshold; add coarsest_rank query param

Replaces hardcoded `lca >= TaxonRank.ORDER` agreement gate with two layers:

- Always returned: `agreed_any_rank_*` — exact matches plus any non-null LCA
  at a real rank (UNKNOWN excluded). The upstream filter (e.g. a Lepidoptera
  include list) is what bounds the meaningful scope, not a hardcoded
  threshold in this function.
- Optional `?agreement_coarsest_rank=FAMILY`: when supplied, response also
  includes `agreed_coarser_rank_*` (exact + LCAs at or below the threshold).
  The applied rank is echoed in `agreement_coarsest_rank`; null when absent.

Also addresses CodeRabbit feedback on the existing branch:
- Dedupe base queryset before counting (joins from default-filter chain can
  inflate Occurrence rows).
- Bound `*_pct` FloatFields to [0.0, 1.0] in the serializer.

Param validation: invalid rank → 400; UNKNOWN rejected as not meaningful.
Tests cover any-rank fallback, threshold filtering, invalid + UNKNOWN
rejection, and threshold echo.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(ui): align model-agreement hook with BE rename + multi-value query params

- Rename `agreed_under_order_*` → `agreed_any_rank_*` to match the endpoint's
  dropped ORDER threshold (0565f06).
- Add optional `agreement_coarsest_rank` + `agreed_coarser_rank_*` fields to
  the response type (not consumed yet — UI follows in #1308).
- Widen `filters` to accept arrays and append repeated query params so
  multi-value filters (e.g. `algorithm`, `not_algorithm` — backend reads via
  `request.query_params.getlist(...)`) survive. Per CodeRabbit review.

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(docs): drop NEXT_SESSION_PROMPT.md from PR

Session-scratchpad doc — belongs in local notes, not the merged branch.

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(docs): drop session-scratchpad planning docs from PR

- 2026-05-14-human-model-agreement-endpoint.md — design narrative; superseded
  by code + PR description.
- occurrence-filter-driven-exports.md — side-research stub Copilot flagged as
  out-of-scope. Promoted to a PR-description follow-up item.

Co-Authored-By: Claude <noreply@anthropic.com>

* test(occurrence-stats): make any-rank bucket test deterministic

create_detections assigns the classification taxon via .order_by("?"),
so the previous test picked a random machine taxon and then required a
sister species under the same genus. Random non-species picks (ORDER /
FAMILY / GENUS) have no sister, flaking ~50% of runs.

Pin both the machine prediction and the human ID to two fixed Vanessa
species, so the LCA is always GENUS (any-rank bucket, not exact) and the
test is deterministic.

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(occurrence-stats): move FE hook to UI PR #1308

useModelAgreement.ts belongs with the frontend consumer (#1308), not the
backend endpoint PR. Keeps #1307 backend-only.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(occurrence-stats): add Wilson CI + Cohen's kappa to model-agreement

Both derive from the verified_rows already in memory — no extra query.

- wilson_interval(): 95% Wilson score CI on agreed_exact_pct and
  agreed_any_rank_pct (agreed_*_ci_low / _ci_high). Wilson stays inside
  [0,1] and is honest at the small n typical of verified sets, where the
  normal approximation breaks down.
- cohens_kappa(): exact-taxon agreement beyond chance (cohens_kappa
  field, range [-1, 1]). Null when no doubly-classified occurrences or
  expected agreement is 1.0. Discounts the agreement you'd get for free
  in a project dominated by one common species.

Adds 5 nullable response fields. Backwards-compatible (additive only).
9 pure-Python unit tests + 2 HTTP field-presence tests.

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(stats): move wilson_interval + cohens_kappa to ami/utils/stats

Both are generic statistical helpers — they don't depend on Django or any
domain model. Lifting them out of ami/main/models_future/occurrence.py so
other endpoints/jobs that need binomial CIs or chance-corrected agreement
can import them without dragging in the occurrence module.

Same implementations, just relocated. Renamed parameter names on
cohens_kappa from (human, model) to (rater_a, rater_b) so the helper
reads as generic rather than human-vs-model specific.

Tests already use isolated `from ami.utils.stats import …` imports
(updated all 9 sites in ami/main/tests.py).

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(stats): expose response schema via OPTIONS metadata

Adds ResponseSchemaMetadata (ami/base/metadata.py) — a SimpleMetadata
subclass that emits the response serializer's field schema (type, label,
help_text, bounds) under actions.GET. DRF's default SimpleMetadata only
emits field schema for write methods (POST / PUT), so read-only stats
endpoints previously returned only name + description on OPTIONS.

Wires it into OccurrenceStatsViewSet and passes serializer_class= to
each @action decorator so view.get_serializer() resolves to the
per-action response serializer during OPTIONS resolution.

Result: frontends can fetch OPTIONS once per stats endpoint and key
tooltips / labels by field name. Stat copy lives next to the serializer
definition; interpretation copy stays in the FE bundle next to the
visualization.

Documented in docs/claude/reference/api-stats-pattern.md.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(stats): exclude taxon-less verifications from agreement denominator

Identification.taxon is nullable — a comment-only verification has a
machine prediction but no human label to compare. Previously such rows
landed in the agreement denominator (verified_with_prediction_count)
but never in any numerator, silently dragging agreed_*_pct down.

Adds a comparable cohort: verified occurrences with BOTH a machine
prediction and a human taxon. All agreed_*_pct and the Wilson CIs now
divide by comparable_count instead of verified_with_prediction_count,
so numerator and denominator describe the same set. Cohen's kappa
already used this cohort (both_present_pairs), so it is unchanged.

Surfaces two new fields so consumers can see why comparable_count
differs from verified_count:
- comparable_count — denominator for agreed_*_pct
- verified_without_taxon_count — verified, has prediction, no human taxon

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(stats): validate agreement_coarsest_rank via ChoiceField

Replaces the manual try/except rank parsing with a ChoiceField run
through SingleParamSerializer, matching the project's standard
boundary-validation pattern.

Closes a gap where ?agreement_coarsest_rank= (blank) silently no-opped
instead of returning the documented 400 for an invalid rank. DRF treats
blank fields in QueryDict (HTML) input as absent, so the value is passed
in a plain dict to force "" through validation. Unknown ranks and
UNKNOWN (absent from the choice list) also 400 at the boundary, and the
param stays case-insensitive via an explicit uppercase.

drf-spectacular reads the ChoiceField choices into the OpenAPI schema as
an enum, so /api/v2/docs/ now lists the valid rank values.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(stats): wilson_interval rejects successes outside [0, total]

successes > total (or negative) makes the variance term negative and
crashes deeper in math.sqrt with an opaque domain error. Since
wilson_interval is a public helper in ami/utils/stats, guard the inputs
and raise a clear ValueError at the boundary instead. No production
caller can currently hit this — agreed_* counts are always a subset of
the comparable denominator — but the helper shouldn't depend on that.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Michael Bunsen <michael@mixedneeds.com>
Base automatically changed from feat/human-model-agreement-endpoint to main May 29, 2026 03:54
@netlify

netlify Bot commented Jun 18, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec ready!

Name Link
🔨 Latest commit d3c6acf
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a3359e0b9dad100080e1521
😎 Deploy Preview https://deploy-preview-1308--antenna-ssec.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

mihow and others added 2 commits June 17, 2026 18:40
- Fix missing gray track background on the bars (bg-muted rendered
  near-invisible; switch to bg-border to match the slider component)
- Shrink metric labels to body-overline-small with an InfoTooltip beside
  each, aligning the Stats panel with the filter controls
- Collapse the detailed metrics (exact taxon, coarser rank, Cohen's kappa)
  behind a "More detail" toggle, closed by default; keep Verified
  occurrences and Agreement (any rank) always visible
- Clarify the verified-vs-denominator gap: show "N of M have a model
  prediction to compare against" under Verified occurrences, and explain
  the agreement denominator in each tooltip

Co-Authored-By: Claude <noreply@anthropic.com>
Iterate on the occurrence stats panel per review feedback:

- Use one simple bar shape (gray track + primary fill) for every metric;
  drop the separate CI whisker visualization
- Layer a translucent diagonal-hatch band over the 95% CI range on the
  agreement bars so the uncertainty reads as "fuzzy" without a second chart
- Show the CI range as the agreement headline (e.g. "83-94%"); move the
  point estimate and exact counts into the info tooltip
- Make all tooltips dynamic and route them through the (i) icon, including
  the verified-vs-prediction denominator note
- Reorder: exact-taxon agreement above the fold, any-rank under "More"
- Show "<1%" instead of "0%" when the count is non-zero but rounds down

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow

mihow commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: While iterating on the panel UI, Anna noticed the agreement numbers look high and the confidence intervals look narrow even when only a small fraction of occurrences are verified. I looked into why. There are two separate things going on, and one of them is a measurement issue worth fixing in the endpoint (merged in #1307), not in this UI PR.

1. The narrow CI is correct, and it is not about the sampling fraction

CI width is driven by the absolute size of the comparable set — verified occurrences that have both a human taxon and a model prediction — not by what percentage of all occurrences are verified. On a sample project with roughly 24k occurrences and about 100 comparable verifications, n ≈ 100 gives roughly ±6% at 95%, which is genuinely narrow. The fact that only ~0.4% of the population is verified does not widen it: the Wilson interval assumes an effectively infinite population, and a finite-population correction would make it narrower, not wider.

A higher confidence level does widen the band — that is the z constant (WILSON_Z_95 = 1.96 in ami/utils/stats.py). Using 99% (z ≈ 2.576) would turn 83–94% into roughly 80–96%. But that only widens the band; it does not explain why the point estimate is high.

2. The high agreement appears inflated by accept-the-suggestion verifications

About half of the verifications in the sample project are users accepting the model's prediction (Identification.agreed_with_prediction is set). For those, the human taxon equals the model's predicted taxon by construction, so they count as an exact match automatically. model_agreement_for_project currently takes the best non-withdrawn identification's taxon without distinguishing independent identifications from accept-the-suggestion ones, so this circularity inflates exact agreement, any-rank agreement, and Cohen's κ.

Measured on the sample project, excluding accept-the-suggestion identifications and keeping only independent human IDs:

Cohort Exact agreement n 95% CI
As shipped (all identifications) 90% 100 83–94%
Independent identifications only 38% 16 18–61%

So genuine independent human-vs-model agreement is substantially lower, with a wide CI — which matches the intuition that a small verified set should be uncertain. The narrow 83–94% came from the inflated n. Selection bias (verifiers tending to confirm easy detections) pushes in the same direction, but the accept-the-suggestion circularity is the larger and more fixable effect. These are measured numbers from staging data, so the interpretation (circularity is the cause) is well supported, though the exact split will vary by project.

Suggested implementation

In model_agreement_for_project (ami/main/models_future/occurrence.py), exclude identifications that merely agreed with the model prediction from the human side of the comparison:

best_user_ident = Identification.objects.filter(
    occurrence=OuterRef("pk"),
    withdrawn=False,
    agreed_with_prediction__isnull=True,   # exclude accept-the-suggestion IDs
).order_by(*BEST_IDENTIFICATION_ORDER)

This makes agreement measure independent confirmation. It affects agreed_exact, agreed_any_rank, the coarser-rank variant, and cohens_kappa, since they all derive from best_user_taxon_id.

A few things worth deciding before implementing:

  • Do we want a single independent-only number, or do we report both (independent vs. inclusive) so the accept rate stays visible? Reporting both is more transparent but adds fields to the response.
  • Should "agreed with another human identification" (agreed_with_identification) be treated the same way? In the sample it was 0, but it raises the same independence question.
  • Worth exposing the confidence level as a parameter if anyone wants 99%.

Happy to open a follow-up PR against the endpoint with the filter plus a test, since #1307 is already merged. Flagging it here for visibility on the UI work.

mihow and others added 2 commits June 17, 2026 19:15
Co-Authored-By: Claude <noreply@anthropic.com>
The agreement bar drew a solid fill to the point estimate and layered the
hatch on top. When the point estimate sat near the upper CI bound (e.g.
21-100%), the solid fill covered the whole CI band and the blue-on-blue
hatch was invisible. Now the solid fill stops at the lower CI bound and the
hatch covers the full CI range over the gray track, so it reads as 'fuzzy'
regardless of where the estimate lands.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant