Does the action head matter? Flow vs. Regression vs. Diffusion on a VLA

A controlled comparison of three action-generation objectives for a Vision-Language-Action policy, on the LIBERO manipulation benchmark. Everything is held fixed — the SmolVLM backbone, the frozen visual-language conditioning, the ~100M action expert itself, the data, and the training recipe — and only the objective varies:

flow matching — SmolVLA's native head (predict a velocity field, integrate it)
regression — a single deterministic forward, trained with L1
diffusion — DDPM training / DDIM sampling (predict noise ε)

Because all three reuse the same transformer expert (same cross-attention to the prefix, same action_out_proj), any difference in performance is attributable to the objective, not the architecture.

Result

LIBERO-Spatial, 10 tasks × 8 rollouts = 80 episodes per head (best checkpoint of each):

objective	success ± SE	inference latency	action smoothness ↓	train steps
flow matching	73.8% ± 4.9	615 ms / chunk	0.162	30k
L1 regression	70.0% ± 5.1	286 ms / chunk	0.103	30k
DDPM diffusion	38.8% ± 5.4	618 ms / chunk	0.249	60k

(latency = mean ms per action chunk on Apple-Silicon MPS — absolute numbers are device-relative; the ratios are what matter. smoothness = mean ‖aₜ₊₁ − aₜ‖ over a rollout, lower is smoother.)

Three findings:

Flow ≈ regression — a statistical tie. The 3.8-point gap is well within the combined standard error (~7), so flow's apparent edge is not significant.
Both decisively beat diffusion (~32–35 points, ≈4–5 SE) — even though diffusion was given 2× the training budget.
Regression is the practical winner. It matches flow on success while running 2× faster (one forward pass vs. ten denoising steps) and producing the smoothest trajectories. So for this head, iterative sampling buys nothing: neither flow's nor diffusion's multi-step decoding improves success over one-shot L1, and diffusion's costs both accuracy and smoothness.

Methodological aside: at 30 episodes the ranking was wrong (regression and diffusion looked tied at 57%, flow ahead at 70%). Going to 80 episodes flipped it — regression rose to 70%, diffusion fell to 39%. The headline numbers only mean something with the error bars attached.

The setup

Base model: SmolVLA (LeRobot) — a SmolVLM2-500M vision-language backbone + a ~100M flow-matching action expert (450M total).
Task: LIBERO-Spatial (10 pick-and-place tasks), via the lerobot/libero dataset for training and the LIBERO simulator (robosuite/MuJoCo) for evaluation.
What's frozen: the entire VLM backbone (train_expert_only). Only the action expert + its input/output projections are fine-tuned — the same parameters for all three objectives.
The three objectives live in src/vla/heads/expert_objectives.py: flow is LeRobot's native SmolVLAPolicy.forward; regression_loss/regression_sample do a deterministic forward (zero query, t=0) + L1; diffusion_loss/diffusion_sample do DDPM/DDIM (predict ε), reusing the expert's embed_suffix + action_out_proj and caching the prefix KV so DDIM inference stays as cheap as flow.
Training: SmolVLA's own optimizer + LR schedule (AdamW, lr 1e-4, warmup 1k → cosine decay). Flow and regression converge by ~30k steps; diffusion's noisier ε-objective needed a stretched schedule to ~60k (itself a finding — DDPM is the least sample-efficient here).
Evaluation: closed-loop rollouts in the LIBERO sim, reporting task success, inference latency, and action smoothness, with a binomial standard error on the success rate.

Reproduce

Evaluate (works on macOS — the eval harness routes around LIBERO's Linux-only deps):

uv sync
bash scripts/setup_libero_macos.sh            # one-time: robosuite + MuJoCo + LIBERO assets

# gate any of the trained checkpoints (success / latency / smoothness):
uv run python scripts/phase0_gate.py --head flow \
  --checkpoint james-steiner/smolvla-libero-flow-v3       --subfolder step_30000 \
  --stats-dataset lerobot/libero --tasks 10 --trials 8
uv run python scripts/phase0_gate.py --head regression \
  --checkpoint james-steiner/smolvla-libero-regression    --subfolder step_30000 --stats-dataset lerobot/libero --tasks 10 --trials 8
uv run python scripts/phase0_gate.py --head diffusion \
  --checkpoint james-steiner/smolvla-libero-diffusion-v2  --subfolder final     --stats-dataset lerobot/libero --tasks 10 --trials 8

Train (needs a CUDA GPU; one config per head):

# on a fresh cloud pod — clones, installs, verifies CUDA, launches under nohup, auto-pushes ckpts
export WANDB_API_KEY=... HF_TOKEN=... CONFIG=configs/flow.yaml HUB_REPO=<you>/smolvla-libero-flow
git clone <repo> && bash vla/scripts/bootstrap_cloud.sh
# or directly:  uv run python scripts/train.py --config configs/regression.yaml --hub-repo <you>/...

What made this non-trivial (engineering notes)

The interesting part wasn't the heads — it was getting a trustworthy result:

A silent training bug that gated 0%. The first flow run hit a low loss but scored zero. I ruled out the data, obs construction, action convention, and step count by direct measurement (including replaying ground-truth demo actions in the sim, and comparing our model vs. a known-good checkpoint on identical frames), and traced it to a missing LR schedule — our hand-rolled loop ran a constant LR with no decay, so the policy never annealed and stayed "timid" (under-magnitude actions that never commit to a grasp). Matching SmolVLA's built-in optimizer and scheduler preset took flow from 0% → 70%.
Validating the eval harness before trusting it. A published LIBERO checkpoint scores ~58% through the same harness, and ground-truth demo actions replayed from the matching sim init state succeed — so a 0% means the model, not the pipeline.
Error bars over vibes. See the methodological aside above — 30 episodes gave the wrong ranking.
Reproducibility plumbing: macOS LIBERO setup that avoids the Linux-only egl-probe/robomimic build (MuJoCo offscreen via CGL); checkpoints auto-pushed to the Hub during training so an out-of-credits cloud pod never loses progress.

Repo layout

src/vla/
  heads/expert_objectives.py  — the 3 objectives on the shared expert (regression/diffusion loss+sample)
  policy/smolvla_wrapper.py    — load SmolVLA, freeze the VLM, contextualised-prefix helpers, processors
  data/libero.py               — lerobot/libero → action-chunked batches + episode-level train/val split
  eval/metrics.py              — latency timer, action smoothness
scripts/
  train.py                     — one loop, objective by --config; SmolVLA optimizer+scheduler preset;
                                 gradient accumulation, resume, auto-push to HF
  phase0_gate.py               — LIBERO sim rollout eval (success ± SE / latency / smoothness; --head)
  bootstrap_cloud.sh           — one-command cloud training
  setup_libero_macos.sh        — macOS LIBERO sim setup
configs/                       — base + flow / regression / diffusion

Limitations

Single suite (LIBERO-Spatial) and a single seed, so the ±5% bars are within-run; latency is measured on MPS (device-relative). Diffusion may still improve past 60k, but it was already given twice the budget of the other two. The point here is the controlled comparison of objectives on a fixed expert, not a leaderboard number.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
scripts		scripts
src/vla		src/vla
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Does the action head matter? Flow vs. Regression vs. Diffusion on a VLA

Result

The setup

Reproduce

What made this non-trivial (engineering notes)

Repo layout

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Does the action head matter? Flow vs. Regression vs. Diffusion on a VLA

Result

The setup

Reproduce

What made this non-trivial (engineering notes)

Repo layout

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages