Skip to content

JamesSteiner/vla

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Does the action head matter? Flow vs. Regression vs. Diffusion on a VLA

A controlled comparison of three action-generation objectives for a Vision-Language-Action policy, on the LIBERO manipulation benchmark. Everything is held fixed — the SmolVLM backbone, the frozen visual-language conditioning, the ~100M action expert itself, the data, and the training recipe — and only the objective varies:

  • flow matching — SmolVLA's native head (predict a velocity field, integrate it)
  • regression — a single deterministic forward, trained with L1
  • diffusion — DDPM training / DDIM sampling (predict noise ε)

Because all three reuse the same transformer expert (same cross-attention to the prefix, same action_out_proj), any difference in performance is attributable to the objective, not the architecture.

Result

LIBERO-Spatial, 10 tasks × 8 rollouts = 80 episodes per head (best checkpoint of each):

objective success ± SE inference latency action smoothness ↓ train steps
flow matching 73.8% ± 4.9 615 ms / chunk 0.162 30k
L1 regression 70.0% ± 5.1 286 ms / chunk 0.103 30k
DDPM diffusion 38.8% ± 5.4 618 ms / chunk 0.249 60k

(latency = mean ms per action chunk on Apple-Silicon MPS — absolute numbers are device-relative; the ratios are what matter. smoothness = mean ‖aₜ₊₁ − aₜ‖ over a rollout, lower is smoother.)

Three findings:

  1. Flow ≈ regression — a statistical tie. The 3.8-point gap is well within the combined standard error (~7), so flow's apparent edge is not significant.
  2. Both decisively beat diffusion (~32–35 points, ≈4–5 SE) — even though diffusion was given 2× the training budget.
  3. Regression is the practical winner. It matches flow on success while running 2× faster (one forward pass vs. ten denoising steps) and producing the smoothest trajectories. So for this head, iterative sampling buys nothing: neither flow's nor diffusion's multi-step decoding improves success over one-shot L1, and diffusion's costs both accuracy and smoothness.

Methodological aside: at 30 episodes the ranking was wrong (regression and diffusion looked tied at 57%, flow ahead at 70%). Going to 80 episodes flipped it — regression rose to 70%, diffusion fell to 39%. The headline numbers only mean something with the error bars attached.

The setup

  • Base model: SmolVLA (LeRobot) — a SmolVLM2-500M vision-language backbone + a ~100M flow-matching action expert (450M total).
  • Task: LIBERO-Spatial (10 pick-and-place tasks), via the lerobot/libero dataset for training and the LIBERO simulator (robosuite/MuJoCo) for evaluation.
  • What's frozen: the entire VLM backbone (train_expert_only). Only the action expert + its input/output projections are fine-tuned — the same parameters for all three objectives.
  • The three objectives live in src/vla/heads/expert_objectives.py: flow is LeRobot's native SmolVLAPolicy.forward; regression_loss/regression_sample do a deterministic forward (zero query, t=0) + L1; diffusion_loss/diffusion_sample do DDPM/DDIM (predict ε), reusing the expert's embed_suffix + action_out_proj and caching the prefix KV so DDIM inference stays as cheap as flow.
  • Training: SmolVLA's own optimizer + LR schedule (AdamW, lr 1e-4, warmup 1k → cosine decay). Flow and regression converge by ~30k steps; diffusion's noisier ε-objective needed a stretched schedule to ~60k (itself a finding — DDPM is the least sample-efficient here).
  • Evaluation: closed-loop rollouts in the LIBERO sim, reporting task success, inference latency, and action smoothness, with a binomial standard error on the success rate.

Reproduce

Evaluate (works on macOS — the eval harness routes around LIBERO's Linux-only deps):

uv sync
bash scripts/setup_libero_macos.sh            # one-time: robosuite + MuJoCo + LIBERO assets

# gate any of the trained checkpoints (success / latency / smoothness):
uv run python scripts/phase0_gate.py --head flow \
  --checkpoint james-steiner/smolvla-libero-flow-v3       --subfolder step_30000 \
  --stats-dataset lerobot/libero --tasks 10 --trials 8
uv run python scripts/phase0_gate.py --head regression \
  --checkpoint james-steiner/smolvla-libero-regression    --subfolder step_30000 --stats-dataset lerobot/libero --tasks 10 --trials 8
uv run python scripts/phase0_gate.py --head diffusion \
  --checkpoint james-steiner/smolvla-libero-diffusion-v2  --subfolder final     --stats-dataset lerobot/libero --tasks 10 --trials 8

Train (needs a CUDA GPU; one config per head):

# on a fresh cloud pod — clones, installs, verifies CUDA, launches under nohup, auto-pushes ckpts
export WANDB_API_KEY=... HF_TOKEN=... CONFIG=configs/flow.yaml HUB_REPO=<you>/smolvla-libero-flow
git clone <repo> && bash vla/scripts/bootstrap_cloud.sh
# or directly:  uv run python scripts/train.py --config configs/regression.yaml --hub-repo <you>/...

What made this non-trivial (engineering notes)

The interesting part wasn't the heads — it was getting a trustworthy result:

  • A silent training bug that gated 0%. The first flow run hit a low loss but scored zero. I ruled out the data, obs construction, action convention, and step count by direct measurement (including replaying ground-truth demo actions in the sim, and comparing our model vs. a known-good checkpoint on identical frames), and traced it to a missing LR schedule — our hand-rolled loop ran a constant LR with no decay, so the policy never annealed and stayed "timid" (under-magnitude actions that never commit to a grasp). Matching SmolVLA's built-in optimizer and scheduler preset took flow from 0% → 70%.
  • Validating the eval harness before trusting it. A published LIBERO checkpoint scores ~58% through the same harness, and ground-truth demo actions replayed from the matching sim init state succeed — so a 0% means the model, not the pipeline.
  • Error bars over vibes. See the methodological aside above — 30 episodes gave the wrong ranking.
  • Reproducibility plumbing: macOS LIBERO setup that avoids the Linux-only egl-probe/robomimic build (MuJoCo offscreen via CGL); checkpoints auto-pushed to the Hub during training so an out-of-credits cloud pod never loses progress.

Repo layout

src/vla/
  heads/expert_objectives.py  — the 3 objectives on the shared expert (regression/diffusion loss+sample)
  policy/smolvla_wrapper.py    — load SmolVLA, freeze the VLM, contextualised-prefix helpers, processors
  data/libero.py               — lerobot/libero → action-chunked batches + episode-level train/val split
  eval/metrics.py              — latency timer, action smoothness
scripts/
  train.py                     — one loop, objective by --config; SmolVLA optimizer+scheduler preset;
                                 gradient accumulation, resume, auto-push to HF
  phase0_gate.py               — LIBERO sim rollout eval (success ± SE / latency / smoothness; --head)
  bootstrap_cloud.sh           — one-command cloud training
  setup_libero_macos.sh        — macOS LIBERO sim setup
configs/                       — base + flow / regression / diffusion

Limitations

Single suite (LIBERO-Spatial) and a single seed, so the ±5% bars are within-run; latency is measured on MPS (device-relative). Diffusion may still improve past 60k, but it was already given twice the budget of the other two. The point here is the controlled comparison of objectives on a fixed expert, not a leaderboard number.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors