Skip to content

Add evaluation harness and hardness knob for detector benchmarking#16

Open
fabio-rovai wants to merge 1 commit into
SantanderAI:mainfrom
fabio-rovai:benchmark-eval-and-hardness
Open

Add evaluation harness and hardness knob for detector benchmarking#16
fabio-rovai wants to merge 1 commit into
SantanderAI:mainfrom
fabio-rovai:benchmark-eval-and-hardness

Conversation

@fabio-rovai

Copy link
Copy Markdown

Add an evaluation harness and a hardness knob

Why

The README describes this project as a way to "benchmark graph-based fraud
detection models". Today the generator produces data but ships no way to score a
detector against the ground truth, and the default data is easy to separate by
trivial heuristics: every fraud edge carries the exact same sentinel amount
(9999.00), rings are disjoint, and there are no legitimate cycles. A naive
"flag every account on a high-value cycle" baseline scores near-perfectly on the
defaults, so the generator cannot currently tell a good detector from a bad one.

This PR closes both gaps, additively and backward compatibly (defaults
unchanged).

What

1. Evaluation harness (evaluate.py, new).
Given a generated dataset directory and a set of detector-flagged account ids (a
plain list or a CSV), it computes precision, recall and F1 at two granularities:

  • account level (did the detector flag the accounts that take part in a ring), and
  • ring level (a ring counts as detected when at least ring_threshold of its
    accounts are flagged; default 1.0 = all).

It reads ground truth from fraud/fraud_cases.csv, reports a confusion summary
(with true negatives when an accounts/ directory is present), and exposes a CLI
via a new gen-fraud-graph-evaluate console script (--data, --flagged,
--ring-threshold, --json).

2. Hardness knob (--hardness {low,medium,high}).
A difficulty preset, wired through Config, that makes the data harder to
separate by trivial heuristics:

  • amount_jitter: spread fraud amounts around the sentinel so the amount alone
    is no longer a giveaway,
  • ring_overlap: let rings share accounts (overlapping rings, not just disjoint
    ones),
  • decoy_ratio: inject legitimate high-value cycles into the normal transaction
    stream (transactions/transactions_decoy.csv, never written to
    fraud_cases.csv), so pure amount-thresholding and pure cycle-topology each
    fail.

low (the default) reproduces the original behaviour exactly, so existing users
see no change.

Tests

tests/test_evaluate.py adds 12 tests: exact precision/recall/F1 on a hand-built
fixture, ring-threshold behaviour, the loaders, an end-to-end evaluate_dataset,
the hardness presets, and a generator smoke test asserting that high hardness
writes decoy cycles and jitters fraud amounts while low stays backward
compatible. The full suite (54 tests) passes.

Notes

  • Backward compatible: defaults are unchanged, no existing module or output is
    modified.
  • Contributor License Agreement: I understand this repository requires signing
    the Santander CLA before merge, and I will sign it on this PR.

The README frames this generator as a way to benchmark fraud detectors, but it
ships no scoring and the default data is easy to separate by trivial heuristics
(single sentinel fraud amount, disjoint rings, no legitimate cycles).

This change is additive and backward compatible (low hardness reproduces the
original behaviour exactly):

- evaluate.py + gen-fraud-graph-evaluate CLI: precision/recall/F1 at account and
  ring level against fraud_cases.csv, with a confusion summary.
- --hardness {low,medium,high}: jitters fraud amounts, overlaps rings, and
  injects decoy legitimate high-value cycles so amount-thresholding and
  cycle-topology each fail.
- tests/test_evaluate.py: 12 tests (exact metrics on a fixture, ring threshold,
  loaders, hardness presets, generator smoke test). Full suite 54 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fabio-rovai fabio-rovai requested a review from a team as a code owner June 23, 2026 20:39
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@fabio-rovai

Copy link
Copy Markdown
Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant