Add evaluation harness and hardness knob for detector benchmarking#16
Open
fabio-rovai wants to merge 1 commit into
Open
Add evaluation harness and hardness knob for detector benchmarking#16fabio-rovai wants to merge 1 commit into
fabio-rovai wants to merge 1 commit into
Conversation
The README frames this generator as a way to benchmark fraud detectors, but it
ships no scoring and the default data is easy to separate by trivial heuristics
(single sentinel fraud amount, disjoint rings, no legitimate cycles).
This change is additive and backward compatible (low hardness reproduces the
original behaviour exactly):
- evaluate.py + gen-fraud-graph-evaluate CLI: precision/recall/F1 at account and
ring level against fraud_cases.csv, with a confusion summary.
- --hardness {low,medium,high}: jitters fraud amounts, overlaps rings, and
injects decoy legitimate high-value cycles so amount-thresholding and
cycle-topology each fail.
- tests/test_evaluate.py: 12 tests (exact metrics on a fixture, ring threshold,
loaders, hardness presets, generator smoke test). Full suite 54 passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
All contributors have signed the CLA ✍️ ✅ |
Author
|
recheck |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add an evaluation harness and a hardness knob
Why
The README describes this project as a way to "benchmark graph-based fraud
detection models". Today the generator produces data but ships no way to score a
detector against the ground truth, and the default data is easy to separate by
trivial heuristics: every fraud edge carries the exact same sentinel amount
(9999.00), rings are disjoint, and there are no legitimate cycles. A naive
"flag every account on a high-value cycle" baseline scores near-perfectly on the
defaults, so the generator cannot currently tell a good detector from a bad one.
This PR closes both gaps, additively and backward compatibly (defaults
unchanged).
What
1. Evaluation harness (
evaluate.py, new).Given a generated dataset directory and a set of detector-flagged account ids (a
plain list or a CSV), it computes precision, recall and F1 at two granularities:
ring_thresholdof itsaccounts are flagged; default 1.0 = all).
It reads ground truth from
fraud/fraud_cases.csv, reports a confusion summary(with true negatives when an
accounts/directory is present), and exposes a CLIvia a new
gen-fraud-graph-evaluateconsole script (--data,--flagged,--ring-threshold,--json).2. Hardness knob (
--hardness {low,medium,high}).A difficulty preset, wired through
Config, that makes the data harder toseparate by trivial heuristics:
amount_jitter: spread fraud amounts around the sentinel so the amount aloneis no longer a giveaway,
ring_overlap: let rings share accounts (overlapping rings, not just disjointones),
decoy_ratio: inject legitimate high-value cycles into the normal transactionstream (
transactions/transactions_decoy.csv, never written tofraud_cases.csv), so pure amount-thresholding and pure cycle-topology eachfail.
low(the default) reproduces the original behaviour exactly, so existing userssee no change.
Tests
tests/test_evaluate.pyadds 12 tests: exact precision/recall/F1 on a hand-builtfixture, ring-threshold behaviour, the loaders, an end-to-end
evaluate_dataset,the hardness presets, and a generator smoke test asserting that
highhardnesswrites decoy cycles and jitters fraud amounts while
lowstays backwardcompatible. The full suite (54 tests) passes.
Notes
modified.
the Santander CLA before merge, and I will sign it on this PR.