Skip to content

Flaky test: test_dump_modules[description_no_llm] — CategoricalDistribution does not support dynamic value space #314

@voorhs

Description

@voorhs

Symptom

tests/pipeline/test_optimization.py::test_dump_modules[description_no_llm] fails intermittently with:

ValueError: CategoricalDistribution does not support dynamic value space.

The test sometimes passes, sometimes fails, on identical commits — i.e. it is a real flake, not a regression.

Where it surfaces

Example failing job logs:

Likely root cause

The failure is preceded by warnings emitted from src/autointent/nodes/_node_optimizer.py lines 386, 389, 392, 394:

UserWarning: Choices for a categorical distribution should be a tuple of
None, bool, int, float and str for persistent storage but contains
{'model_name': 'sergeyzh/rubert-tiny-turbo', 'device': 'cpu'} which is of type dict.

Optuna requires CategoricalDistribution choices to be hashable scalars for persistent storage. The search-space currently passes dict objects (embedder configs) as choices, which Optuna stores via repr() — when the persisted study is loaded back, the "value space" reconstructed from string reprs of dicts no longer matches by identity, and the description_no_llm variant's particular ordering of choices appears to trip the equality check non-deterministically (insertion order in Python dicts is stable but dict repr ordering interacts with set/list de-dup downstream in Optuna).

The flake-ness comes from this trip being sensitive to ordering decisions inside Optuna's storage layer, which can shift between runs.

Suggested fix direction

Convert dict-typed categorical choices into a hashable, scalar representation before handing them to Optuna — e.g. a stable string key ("st:sergeyzh/rubert-tiny-turbo:cpu") or an index into a side-table of full configs. Resolve the chosen key back to the dict at trial time.

This would:

  1. Eliminate the four UserWarnings pointing at _node_optimizer.py:386-394.
  2. Make the persisted study round-trippable cleanly.
  3. Remove the flake.

Workaround

Rerun the failing job. The test is unrelated to the bulk of diffs flagged in CI for it.

Discovered during

The strict-mypy-on-tests workstream (Phase B fan-out). Not blocking that workstream — the failing test is outside every subagent's diff.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions