Skip to content

Update main from release-2026.05.01#468

Open
github-actions[bot] wants to merge 23 commits into
mainfrom
release-2026.05.01
Open

Update main from release-2026.05.01#468
github-actions[bot] wants to merge 23 commits into
mainfrom
release-2026.05.01

Conversation

@github-actions
Copy link
Copy Markdown

🚀 An automated PR - Please make any changes needed to resolve merge conflicts and then approve and merge!

hhuuggoo and others added 23 commits May 26, 2026 19:59
These are pulled in transitively by unsloth, but declaring them
explicitly makes the image's training API surface stable against
unsloth version bumps. The Token Factory fine-tune training script
(in a separate repo) imports trl.SFTTrainer and peft directly, and
needs to be able to rely on those being present and version-compatible
with the rest of the image's HF stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-deps

saturn-python-llm: declare trl, peft, datasets explicitly
Five recipe-template.json files were using bogus field names
(recipeName/image/gpu/saturnVersion) introduced when the 12.4 images
were added in Aug 2025 and propagated forward to 12.9 and the AMD
image. The release-images builder uploads these as-is to S3 with
schema_version 2022.03.01, which causes the legacy pre_load in
saturn's BaseRecipeSchema to wrap them under "spec" — at which point
they fail ImageSpecSchema validation (Missing required field "name";
Unknown fields recipeName/gpu/image/saturnVersion).

Switch all five templates to the established shape
(name/description/hardware_type/supports). The AMD template uses
hardware_type=AMD; the others use gpu/cpu as appropriate, matching
the existing CUDA 11.8 / 12.1 templates.
The Token Factory fine-tuning service wraps axolotl: an in-pod shim
invokes `axolotl train <config>` via subprocess, so the binary needs
to be on the image's Python path.

Pinned exactly to 0.16.1 because Atlas's config renderer is keyed to
specific axolotl YAML field names that change across versions; a
loose pin or unpinned spec would silently break rendered configs on
the next axolotl release.

Context: saturncloud/saturn#6394.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olotl

saturn-python-llm: add axolotl 0.16.1
…rift

Fix recipe-template field names to match ImageSpecSchema
Without an explicit python constraint, mamba now resolves python=3.14 on
conda-forge. That breaks both envs on release-2026.05.01:

- saturn-python-tensorflow: pip can't find tensorflow[and-cuda] for
  cp314 (no wheels yet) and snowflake-connector-python resolves to a
  cp314 wheel, so the pip step in `mamba env update` fails.
- saturn-python-pytorch: the multi-channel solve
  (pytorch + rapidsai + nvidia + conda-forge) blows up with
  "queue count overflow" and SIGABRTs after ~2.5h.

Pinning python=3.13 keeps us close to the latest while staying on a
version everything in the env list ships wheels/builds for. We can
revisit 3.14 once tensorflow et al catch up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….05.01

Pin python=3.13 in pytorch and tensorflow envs
…hannel

The pytorch conda channel is frozen at 2.5.1 (Oct 2024 — the team
announced 2.5 as the last release on the channel) and has no
cp313 builds. With the python=3.13 pin we landed in #471, the conda
solve for `pytorch::pytorch` against the frozen channel has nothing
to resolve. The rapidsai + pytorch + nvidia + conda-forge channel mix
was also what caused the previous solve to hit `queue count overflow`
and SIGABRT after 2h41m.

Switch torch / torchvision / torchaudio to PyPI cu129 wheels — pip
installs them after the conda env update, so we get the modern PyTorch
2.11 + CUDA 12.9 stack against the matching gpu-12.9 base image. Drop
the pytorch + nvidia + rapidsai conda channels and the pytorch::*
deps. Drop dask-cuda along with rapidsai (it's the only thing here
that needed that channel).

Pairs with a release-images change to point the
saturn-python-pytorch build at saturnbase-python-gpu-12.9 instead of
saturnbase-python-gpu-12.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-pytorch: PyPI torch on cu129, drop pytorch conda channel
Mamba was resolving python=3.14 on conda-forge for any env without an
explicit python pin (3.14 released 2025-10-07). #471 already pinned
pytorch and tensorflow. This finishes the sweep:

- saturn-python: 3.11 -> 3.13.
- saturn-python-rapids: unpinned -> 3.13 (this image was on a path to
  hit the same 3.14 problem next build).
- saturn-python-llm: 3.11 -> 3.13. Also drops pytorch/nvidia conda
  channels and the conda pytorch/pytorch-cuda/cuda-toolkit deps for
  the same reason the saturn-python-pytorch image did in #472: the
  pytorch conda channel is frozen at 2.5.1 and has no cp313 builds.
  torch/torchvision/torchaudio move to pip via the PyPI cu129 index.
  flash_attn URL swaps cp311 -> cp313 (same build family already
  publishes a cp313 wheel at that tag).
- saturn-python-pytorch: collapses the two-line --extra-index-url
  pip arg into the correct single-line form. The split-line form is
  not the conda env yml pip-args grammar -- pip parses it as a
  bare --extra-index-url with no value.

Leaving saturn-python-312-slim* alone: their names encode python312
and they're already pinned to 3.12. R images keep python=3.11 since
python there is secondary tooling, not the image purpose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…thon-3.13-sweep

# Conflicts:
#	saturn-python-pytorch/environment.yml
Pin python=3.13 across py images, fix pytorch index-url syntax
…h_attn

The python=3.13 path is blocked by axolotl 0.16.1's unconditional
zstandard==0.22.0 transitive pin -- that wheel only ships cp310-cp312,
no cp313. Stepping down to python=3.12 keeps the rest of the axolotl
0.16.1 pin chain (torch==2.8.0, transformers==5.5.0, accelerate==1.13.0,
bitsandbytes==0.49.1, datasets==4.5.0, trl==0.29.0) resolvable from
pre-built wheels.

Also:

- Drop auto-gptq and autoawq. auto-gptq's sdist runs `import torch`
  at build-deps phase before torch is installed, breaking the env.
  autoawq has the same kind of issue. Neither is referenced by any
  saturncloud code; vllm handles GPTQ/AWQ checkpoint loading via
  compressed-tensors without these libs.
- Bump flash_attn to v2.8.3 with the cu12torch2.8 cp312 wheel. The
  cu12torch2.7 wheel was ABI-incompatible with torch 2.8.0 (undefined
  symbol _ZN3c104cuda9SetDeviceEa). v2.8.3 is the version axolotl
  0.16.1 itself wants under its flash-attn extra.

Verified locally: env builds cleanly, all heavy imports (torch,
transformers, axolotl, flash_attn, vllm, peft, trl, datasets,
accelerate, bitsandbytes) succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-llm: pin python=3.12 (axolotl 0.16.1 transitive blocker)
…6.02

Three independent failures in the existing env on python=3.13:

- cuda-version=12.0 was no longer in conda-forge (only 12.4+ ships now),
  so the previous CI build couldn't even resolve a CUDA package set.
  Bump to 12.9 to match the gpu-12.9 base we already have for the
  pytorch image; rapids' bundled CUDA libs don't actually need to match
  the runtime base.
- dask-sql is abandoned upstream and tops out at python 3.12 on
  conda-forge. With our python=3.13 sweep it pulled the solver into
  2021-era versions. No saturncloud code references dask-sql; users
  who want SQL-on-dask can pip-install it on demand.
- Pinning rapids unbounded resolved to 25.08, which embeds an older
  cuml that broke at import time against the newer scikit-learn 1.8
  conda-forge ships (BaseEstimator._get_default_requests was renamed
  to _get_metadata_request). rapids>=26.02 resolves to 26.04 with a
  cuml that matches.

Verified locally: env builds clean, all heavy imports (cudf, cuml,
cupy, dask, dask_ml, sklearn, pyarrow, cvxpy, prefect, numba) succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-rapids: bump cuda to 12.9, drop dask-sql, pin rapids>=26.02
CI's pip resolver walks vllm back through versions looking for one that
satisfies the full constraint set (torch==2.8.0 from axolotl 0.16.1
narrows the window significantly). Locally it lands at 0.11.0; in the
build container it kept walking past 0.11.0's manylinux1 wheel and
eventually fell into vllm 0.5.x sdists, which try to call /usr/local/cuda/bin/nvcc
at metadata-extraction time — but the runtime base image doesn't ship nvcc.

Pinning to 0.11.0 (the version the local solve already lands on)
short-circuits the backtracking and keeps the resolution wheel-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-llm: pin vllm==0.11.0 (avoid sdist fallback in CI)
vLLM 0.11.0 reads tokenizer.all_special_tokens_extended at startup, which
transformers 5.x removed -> AttributeError -> CrashLoopBackOff. The env was
leaving transformers unpinned and resolving to 5.5.0, breaking every vLLM
serve pod built on this image.

Pin transformers>=4.55,<5 (conda + repeated in the pip: block). 4.57.6 was
empirically verified to boot vLLM 0.11.0, load Qwen2.5-7B + a LoRA adapter,
and serve /v1/chat/completions.

axolotl 0.16.1 (kept exactly pinned: TF's Atlas YAML renderer is keyed to its
field names) carries an over-strict transformers==5.5.0 metadata pin that
would drag 5.x back in. It runs fine on transformers 4.57.x, so it is now
installed --no-deps in a separate Dockerfile step, with its real transitive
deps declared explicitly in environment.yml. The cross-constraint still holds:
axolotl 0.16.1 forces torch==2.8.0 and only vLLM 0.10-0.11 satisfy that, so
vLLM stays pinned at 0.11.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…formers-pin

Pin transformers <5 in saturn-python-llm so vLLM 0.11 boots
…hon-llm)

saturn-python-llm tried to be one image for both vLLM serving and axolotl
fine-tuning, but the dep stacks are incompatible: vLLM 0.11 needs transformers<5
(5.x removed tokenizer.all_special_tokens_extended -> CrashLoopBackOff), while
axolotl 0.16.1 needs the transformers 5.x API (Trainer.create_optimizer(model=)
is 5.x-only; on 4.57 training dies inside the loop). Split by engine, extensible
to future inference engines / fine-tuning frameworks:

- saturn-python-vllm:    inference (vLLM, transformers<5; axolotl deps removed,
                         no --no-deps hack)
- saturn-python-axolotl: fine-tuning (axolotl installed WITH deps, so it pulls
                         the correct transformers 5.5 / datasets 4.5 / trl 0.29
                         / hf-hub>=1 stack; flash-attn + deepspeed + mlflow extras)

Both build on the cu129 GPU base. Registered for building in saturncloud/release-images
(PR adds them to data_science.py / main_release.py / the build matrix).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mages

Add saturn-python-vllm (inference) + saturn-python-axolotl (training), split from saturn-python-llm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant