fix(factor_coder): prevent per-instrument ML training hang in factor code#1410
fix(factor_coder): prevent per-instrument ML training hang in factor code#1410genisis0x wants to merge 1 commit into
Conversation
…code When the factor coder generates ML-based factors (LSTM, RandomForest, XGBoost, etc.) the LLM occasionally wraps model construction and .fit() inside a nested loop over instruments and trading days. On a realistic A-share panel (~5K instruments x ~1K days x N epochs) that produces O(N * T) training iterations and the run hangs at 100% CPU for hours instead of completing in minutes. Critic feedback already rejects the pattern in retrospect, but the LLM keeps regenerating similar code and each iteration pays the full execution cost. This change introduces a two-layer guard for the anti-pattern: 1) Static evaluator guardrail detect_per_instrument_training_antipattern walks the AST of the generated factor code and looks for a ``for`` loop whose target or iterable identifier contains an instrument-like hint (instrument, stock, ticker, symbol, code) AND that contains a nested ``for`` whose body matches an ML estimator constructor or training call (``.fit``, ``.partial_fit``, ``.train``, ``train_*(...)`` , LSTM/GRU/RNN/Transformer/RandomForest/XGB/LGBM/ CatBoost/GradientBoosting/SVR/SVC/MLP/Sequential). When a match is found, FactorEvaluatorForCoder short-circuits with critic-style feedback so CoSTEER can repair without paying for a multi-hour execute() call. 2) Prompt hardening qlib_factor_strategy gains an explicit rule against per-instrument or per-day model retraining, with a reference panel-fit + batch- predict pattern and a groupby/rolling vectorized alternative. This pushes the LLM toward correct ML factor code on first generation. Test cases cover positive detection for LSTM, RandomForest and XGBoost variants and negative detection for the recommended panel-fit pattern, groupby/rolling pipelines, nested non-training loops, empty input, and syntax-broken code. Fixes microsoft#1407
|
@genisis0x please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
|
CLA reviewed and good to go. @microsoft-github-policy-service agree |
|
Hi @XianBW — also ~7 days open. |
Summary
Two-layer guard
1. Static evaluator guardrail.
detect_per_instrument_training_antipatternwalks the AST of the generated factor code and matches:forwhose target or iterable identifier contains an instrument-like hint (instrument,stock,ticker,symbol,code),forwhose body contains an ML estimator constructor or training call (.fit,.partial_fit,.train,train_*(, LSTM/GRU/RNN/Transformer/RandomForest/XGB/LGBM/CatBoost/GradientBoosting/SVR/SVC/MLP/Sequential).When matched,
FactorEvaluatorForCoder.evaluateshort-circuits with critic-style feedback so CoSTEER repair runs against the guidance instead of paying for a hungimplementation.execute()call.2. Prompt hardening.
qlib_factor_strategygains an explicit rule against per-instrument or per-day model retraining, with reference patterns:groupby(level='instrument').rolling(...).apply(...)vectorized form.This pushes the LLM toward correct ML factor code on first generation.
Validation
uv run pytest test/utils/coder/test_factor_antipattern.py— 8 passedgroupby/rolling/applypipelines are allowed.None.uv run black --check rdagent/components/coder/factor_coder/evaluators.py test/utils/coder/test_factor_antipattern.py -l 120— clean.Notes
prompts.yamlis YAML; black/ruff do not format it, so only Python files are part of the lint surface.Fixes #1407
📚 Documentation preview 📚: https://RDAgent--1410.org.readthedocs.build/en/1410/