Symptom
tests/pipeline/test_optimization.py::test_dump_modules[description_no_llm] fails intermittently with:
ValueError: CategoricalDistribution does not support dynamic value space.
The test sometimes passes, sometimes fails, on identical commits — i.e. it is a real flake, not a regression.
Where it surfaces
Example failing job logs:
Likely root cause
The failure is preceded by warnings emitted from src/autointent/nodes/_node_optimizer.py lines 386, 389, 392, 394:
UserWarning: Choices for a categorical distribution should be a tuple of
None, bool, int, float and str for persistent storage but contains
{'model_name': 'sergeyzh/rubert-tiny-turbo', 'device': 'cpu'} which is of type dict.
Optuna requires CategoricalDistribution choices to be hashable scalars for persistent storage. The search-space currently passes dict objects (embedder configs) as choices, which Optuna stores via repr() — when the persisted study is loaded back, the "value space" reconstructed from string reprs of dicts no longer matches by identity, and the description_no_llm variant's particular ordering of choices appears to trip the equality check non-deterministically (insertion order in Python dicts is stable but dict repr ordering interacts with set/list de-dup downstream in Optuna).
The flake-ness comes from this trip being sensitive to ordering decisions inside Optuna's storage layer, which can shift between runs.
Suggested fix direction
Convert dict-typed categorical choices into a hashable, scalar representation before handing them to Optuna — e.g. a stable string key ("st:sergeyzh/rubert-tiny-turbo:cpu") or an index into a side-table of full configs. Resolve the chosen key back to the dict at trial time.
This would:
- Eliminate the four
UserWarnings pointing at _node_optimizer.py:386-394.
- Make the persisted study round-trippable cleanly.
- Remove the flake.
Workaround
Rerun the failing job. The test is unrelated to the bulk of diffs flagged in CI for it.
Discovered during
The strict-mypy-on-tests workstream (Phase B fan-out). Not blocking that workstream — the failing test is outside every subagent's diff.
Symptom
tests/pipeline/test_optimization.py::test_dump_modules[description_no_llm]fails intermittently with:The test sometimes passes, sometimes fails, on identical commits — i.e. it is a real flake, not a regression.
Where it surfaces
27102499946green, prior run27101359057red, run before that27098583531green).tests/pipeline/at all, confirming it's not driven by the diff.Example failing job logs:
Likely root cause
The failure is preceded by warnings emitted from
src/autointent/nodes/_node_optimizer.pylines 386, 389, 392, 394:Optuna requires
CategoricalDistributionchoices to be hashable scalars for persistent storage. The search-space currently passesdictobjects (embedder configs) as choices, which Optuna stores viarepr()— when the persisted study is loaded back, the "value space" reconstructed from stringreprs of dicts no longer matches by identity, and thedescription_no_llmvariant's particular ordering of choices appears to trip the equality check non-deterministically (insertion order in Python dicts is stable but dict repr ordering interacts with set/list de-dup downstream in Optuna).The flake-ness comes from this trip being sensitive to ordering decisions inside Optuna's storage layer, which can shift between runs.
Suggested fix direction
Convert dict-typed categorical choices into a hashable, scalar representation before handing them to Optuna — e.g. a stable string key (
"st:sergeyzh/rubert-tiny-turbo:cpu") or an index into a side-table of full configs. Resolve the chosen key back to the dict at trial time.This would:
UserWarnings pointing at_node_optimizer.py:386-394.Workaround
Rerun the failing job. The test is unrelated to the bulk of diffs flagged in CI for it.
Discovered during
The strict-mypy-on-tests workstream (Phase B fan-out). Not blocking that workstream — the failing test is outside every subagent's diff.