Add evaluation harness and hardness knob for detector benchmarking by fabio-rovai · Pull Request #16 · SantanderAI/gen-fraud-graph

fabio-rovai · 2026-06-23T20:39:53Z

Add an evaluation harness and a hardness knob

Why

The README describes this project as a way to "benchmark graph-based fraud
detection models". Today the generator produces data but ships no way to score a
detector against the ground truth, and the default data is easy to separate by
trivial heuristics: every fraud edge carries the exact same sentinel amount
(9999.00), rings are disjoint, and there are no legitimate cycles. A naive
"flag every account on a high-value cycle" baseline scores near-perfectly on the
defaults, so the generator cannot currently tell a good detector from a bad one.

This PR closes both gaps, additively and backward compatibly (defaults
unchanged).

What

1. Evaluation harness (evaluate.py, new).
Given a generated dataset directory and a set of detector-flagged account ids (a
plain list or a CSV), it computes precision, recall and F1 at two granularities:

account level (did the detector flag the accounts that take part in a ring), and
ring level (a ring counts as detected when at least ring_threshold of its
accounts are flagged; default 1.0 = all).

It reads ground truth from fraud/fraud_cases.csv, reports a confusion summary
(with true negatives when an accounts/ directory is present), and exposes a CLI
via a new gen-fraud-graph-evaluate console script (--data, --flagged,
--ring-threshold, --json).

2. Hardness knob (--hardness {low,medium,high}).
A difficulty preset, wired through Config, that makes the data harder to
separate by trivial heuristics:

amount_jitter: spread fraud amounts around the sentinel so the amount alone
is no longer a giveaway,
ring_overlap: let rings share accounts (overlapping rings, not just disjoint
ones),
decoy_ratio: inject legitimate high-value cycles into the normal transaction
stream (transactions/transactions_decoy.csv, never written to
fraud_cases.csv), so pure amount-thresholding and pure cycle-topology each
fail.

low (the default) reproduces the original behaviour exactly, so existing users
see no change.

Tests

tests/test_evaluate.py adds 12 tests: exact precision/recall/F1 on a hand-built
fixture, ring-threshold behaviour, the loaders, an end-to-end evaluate_dataset,
the hardness presets, and a generator smoke test asserting that high hardness
writes decoy cycles and jitters fraud amounts while low stays backward
compatible. The full suite (54 tests) passes.

Notes

Backward compatible: defaults are unchanged, no existing module or output is
modified.
Contributor License Agreement: I understand this repository requires signing
the Santander CLA before merge, and I will sign it on this PR.

The README frames this generator as a way to benchmark fraud detectors, but it ships no scoring and the default data is easy to separate by trivial heuristics (single sentinel fraud amount, disjoint rings, no legitimate cycles). This change is additive and backward compatible (low hardness reproduces the original behaviour exactly): - evaluate.py + gen-fraud-graph-evaluate CLI: precision/recall/F1 at account and ring level against fraud_cases.csv, with a confusion summary. - --hardness {low,medium,high}: jitters fraud amounts, overlaps rings, and injects decoy legitimate high-value cycles so amount-thresholding and cycle-topology each fail. - tests/test_evaluate.py: 12 tests (exact metrics on a fixture, ring threshold, loaders, hardness presets, generator smoke test). Full suite 54 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-23T20:40:05Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

fabio-rovai · 2026-06-23T20:50:36Z

recheck

fabio-rovai requested a review from a team as a code owner June 23, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add evaluation harness and hardness knob for detector benchmarking#16

Add evaluation harness and hardness knob for detector benchmarking#16
fabio-rovai wants to merge 1 commit into
SantanderAI:mainfrom
fabio-rovai:benchmark-eval-and-hardness

fabio-rovai commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

fabio-rovai commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fabio-rovai commented Jun 23, 2026

Add an evaluation harness and a hardness knob

Why

What

Tests

Notes

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabio-rovai commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 23, 2026 •

edited

Loading