A controlled comparison of three action-generation objectives for a Vision-Language-Action policy, on the LIBERO manipulation benchmark. Everything is held fixed — the SmolVLM backbone, the frozen visual-language conditioning, the ~100M action expert itself, the data, and the training recipe — and only the objective varies:
- flow matching — SmolVLA's native head (predict a velocity field, integrate it)
- regression — a single deterministic forward, trained with L1
- diffusion — DDPM training / DDIM sampling (predict noise ε)
Because all three reuse the same transformer expert (same cross-attention to the prefix, same
action_out_proj), any difference in performance is attributable to the objective, not the
architecture.
LIBERO-Spatial, 10 tasks × 8 rollouts = 80 episodes per head (best checkpoint of each):
| objective | success ± SE | inference latency | action smoothness ↓ | train steps |
|---|---|---|---|---|
| flow matching | 73.8% ± 4.9 | 615 ms / chunk | 0.162 | 30k |
| L1 regression | 70.0% ± 5.1 | 286 ms / chunk | 0.103 | 30k |
| DDPM diffusion | 38.8% ± 5.4 | 618 ms / chunk | 0.249 | 60k |
(latency = mean ms per action chunk on Apple-Silicon MPS — absolute numbers are device-relative; the ratios are what matter. smoothness = mean ‖aₜ₊₁ − aₜ‖ over a rollout, lower is smoother.)
Three findings:
- Flow ≈ regression — a statistical tie. The 3.8-point gap is well within the combined standard error (~7), so flow's apparent edge is not significant.
- Both decisively beat diffusion (~32–35 points, ≈4–5 SE) — even though diffusion was given 2× the training budget.
- Regression is the practical winner. It matches flow on success while running 2× faster (one forward pass vs. ten denoising steps) and producing the smoothest trajectories. So for this head, iterative sampling buys nothing: neither flow's nor diffusion's multi-step decoding improves success over one-shot L1, and diffusion's costs both accuracy and smoothness.
Methodological aside: at 30 episodes the ranking was wrong (regression and diffusion looked tied at 57%, flow ahead at 70%). Going to 80 episodes flipped it — regression rose to 70%, diffusion fell to 39%. The headline numbers only mean something with the error bars attached.
- Base model: SmolVLA (LeRobot) — a SmolVLM2-500M vision-language backbone + a ~100M flow-matching action expert (450M total).
- Task: LIBERO-Spatial (10 pick-and-place tasks), via the
lerobot/liberodataset for training and the LIBERO simulator (robosuite/MuJoCo) for evaluation. - What's frozen: the entire VLM backbone (
train_expert_only). Only the action expert + its input/output projections are fine-tuned — the same parameters for all three objectives. - The three objectives live in
src/vla/heads/expert_objectives.py: flow is LeRobot's nativeSmolVLAPolicy.forward;regression_loss/regression_sampledo a deterministic forward (zero query, t=0) + L1;diffusion_loss/diffusion_sampledo DDPM/DDIM (predict ε), reusing the expert'sembed_suffix+action_out_projand caching the prefix KV so DDIM inference stays as cheap as flow. - Training: SmolVLA's own optimizer + LR schedule (AdamW, lr 1e-4, warmup 1k → cosine decay). Flow and regression converge by ~30k steps; diffusion's noisier ε-objective needed a stretched schedule to ~60k (itself a finding — DDPM is the least sample-efficient here).
- Evaluation: closed-loop rollouts in the LIBERO sim, reporting task success, inference latency, and action smoothness, with a binomial standard error on the success rate.
Evaluate (works on macOS — the eval harness routes around LIBERO's Linux-only deps):
uv sync
bash scripts/setup_libero_macos.sh # one-time: robosuite + MuJoCo + LIBERO assets
# gate any of the trained checkpoints (success / latency / smoothness):
uv run python scripts/phase0_gate.py --head flow \
--checkpoint james-steiner/smolvla-libero-flow-v3 --subfolder step_30000 \
--stats-dataset lerobot/libero --tasks 10 --trials 8
uv run python scripts/phase0_gate.py --head regression \
--checkpoint james-steiner/smolvla-libero-regression --subfolder step_30000 --stats-dataset lerobot/libero --tasks 10 --trials 8
uv run python scripts/phase0_gate.py --head diffusion \
--checkpoint james-steiner/smolvla-libero-diffusion-v2 --subfolder final --stats-dataset lerobot/libero --tasks 10 --trials 8Train (needs a CUDA GPU; one config per head):
# on a fresh cloud pod — clones, installs, verifies CUDA, launches under nohup, auto-pushes ckpts
export WANDB_API_KEY=... HF_TOKEN=... CONFIG=configs/flow.yaml HUB_REPO=<you>/smolvla-libero-flow
git clone <repo> && bash vla/scripts/bootstrap_cloud.sh
# or directly: uv run python scripts/train.py --config configs/regression.yaml --hub-repo <you>/...The interesting part wasn't the heads — it was getting a trustworthy result:
- A silent training bug that gated 0%. The first flow run hit a low loss but scored zero. I ruled out the data, obs construction, action convention, and step count by direct measurement (including replaying ground-truth demo actions in the sim, and comparing our model vs. a known-good checkpoint on identical frames), and traced it to a missing LR schedule — our hand-rolled loop ran a constant LR with no decay, so the policy never annealed and stayed "timid" (under-magnitude actions that never commit to a grasp). Matching SmolVLA's built-in optimizer and scheduler preset took flow from 0% → 70%.
- Validating the eval harness before trusting it. A published LIBERO checkpoint scores ~58% through the same harness, and ground-truth demo actions replayed from the matching sim init state succeed — so a 0% means the model, not the pipeline.
- Error bars over vibes. See the methodological aside above — 30 episodes gave the wrong ranking.
- Reproducibility plumbing: macOS LIBERO setup that avoids the Linux-only
egl-probe/robomimicbuild (MuJoCo offscreen via CGL); checkpoints auto-pushed to the Hub during training so an out-of-credits cloud pod never loses progress.
src/vla/
heads/expert_objectives.py — the 3 objectives on the shared expert (regression/diffusion loss+sample)
policy/smolvla_wrapper.py — load SmolVLA, freeze the VLM, contextualised-prefix helpers, processors
data/libero.py — lerobot/libero → action-chunked batches + episode-level train/val split
eval/metrics.py — latency timer, action smoothness
scripts/
train.py — one loop, objective by --config; SmolVLA optimizer+scheduler preset;
gradient accumulation, resume, auto-push to HF
phase0_gate.py — LIBERO sim rollout eval (success ± SE / latency / smoothness; --head)
bootstrap_cloud.sh — one-command cloud training
setup_libero_macos.sh — macOS LIBERO sim setup
configs/ — base + flow / regression / diffusion
Single suite (LIBERO-Spatial) and a single seed, so the ±5% bars are within-run; latency is measured on MPS (device-relative). Diffusion may still improve past 60k, but it was already given twice the budget of the other two. The point here is the controlled comparison of objectives on a fixed expert, not a leaderboard number.