Update main from release-2026.05.01 by github-actions[bot] · Pull Request #468 · saturncloud/images

github-actions · 2026-05-26T20:09:27Z

🚀 An automated PR - Please make any changes needed to resolve merge conflicts and then approve and merge!

These are pulled in transitively by unsloth, but declaring them explicitly makes the image's training API surface stable against unsloth version bumps. The Token Factory fine-tune training script (in a separate repo) imports trl.SFTTrainer and peft directly, and needs to be able to rely on those being present and version-compatible with the rest of the image's HF stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-deps saturn-python-llm: declare trl, peft, datasets explicitly

Five recipe-template.json files were using bogus field names (recipeName/image/gpu/saturnVersion) introduced when the 12.4 images were added in Aug 2025 and propagated forward to 12.9 and the AMD image. The release-images builder uploads these as-is to S3 with schema_version 2022.03.01, which causes the legacy pre_load in saturn's BaseRecipeSchema to wrap them under "spec" — at which point they fail ImageSpecSchema validation (Missing required field "name"; Unknown fields recipeName/gpu/image/saturnVersion). Switch all five templates to the established shape (name/description/hardware_type/supports). The AMD template uses hardware_type=AMD; the others use gpu/cpu as appropriate, matching the existing CUDA 11.8 / 12.1 templates.

The Token Factory fine-tuning service wraps axolotl: an in-pod shim invokes `axolotl train <config>` via subprocess, so the binary needs to be on the image's Python path. Pinned exactly to 0.16.1 because Atlas's config renderer is keyed to specific axolotl YAML field names that change across versions; a loose pin or unpinned spec would silently break rendered configs on the next axolotl release. Context: saturncloud/saturn#6394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…olotl saturn-python-llm: add axolotl 0.16.1

…rift Fix recipe-template field names to match ImageSpecSchema

Without an explicit python constraint, mamba now resolves python=3.14 on conda-forge. That breaks both envs on release-2026.05.01: - saturn-python-tensorflow: pip can't find tensorflow[and-cuda] for cp314 (no wheels yet) and snowflake-connector-python resolves to a cp314 wheel, so the pip step in `mamba env update` fails. - saturn-python-pytorch: the multi-channel solve (pytorch + rapidsai + nvidia + conda-forge) blows up with "queue count overflow" and SIGABRTs after ~2.5h. Pinning python=3.13 keeps us close to the latest while staying on a version everything in the env list ships wheels/builds for. We can revisit 3.14 once tensorflow et al catch up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….05.01 Pin python=3.13 in pytorch and tensorflow envs

…hannel The pytorch conda channel is frozen at 2.5.1 (Oct 2024 — the team announced 2.5 as the last release on the channel) and has no cp313 builds. With the python=3.13 pin we landed in #471, the conda solve for `pytorch::pytorch` against the frozen channel has nothing to resolve. The rapidsai + pytorch + nvidia + conda-forge channel mix was also what caused the previous solve to hit `queue count overflow` and SIGABRT after 2h41m. Switch torch / torchvision / torchaudio to PyPI cu129 wheels — pip installs them after the conda env update, so we get the modern PyTorch 2.11 + CUDA 12.9 stack against the matching gpu-12.9 base image. Drop the pytorch + nvidia + rapidsai conda channels and the pytorch::* deps. Drop dask-cuda along with rapidsai (it's the only thing here that needed that channel). Pairs with a release-images change to point the saturn-python-pytorch build at saturnbase-python-gpu-12.9 instead of saturnbase-python-gpu-12.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

saturn-python-pytorch: PyPI torch on cu129, drop pytorch conda channel

Mamba was resolving python=3.14 on conda-forge for any env without an explicit python pin (3.14 released 2025-10-07). #471 already pinned pytorch and tensorflow. This finishes the sweep: - saturn-python: 3.11 -> 3.13. - saturn-python-rapids: unpinned -> 3.13 (this image was on a path to hit the same 3.14 problem next build). - saturn-python-llm: 3.11 -> 3.13. Also drops pytorch/nvidia conda channels and the conda pytorch/pytorch-cuda/cuda-toolkit deps for the same reason the saturn-python-pytorch image did in #472: the pytorch conda channel is frozen at 2.5.1 and has no cp313 builds. torch/torchvision/torchaudio move to pip via the PyPI cu129 index. flash_attn URL swaps cp311 -> cp313 (same build family already publishes a cp313 wheel at that tag). - saturn-python-pytorch: collapses the two-line --extra-index-url pip arg into the correct single-line form. The split-line form is not the conda env yml pip-args grammar -- pip parses it as a bare --extra-index-url with no value. Leaving saturn-python-312-slim* alone: their names encode python312 and they're already pinned to 3.12. R images keep python=3.11 since python there is secondary tooling, not the image purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…thon-3.13-sweep # Conflicts: # saturn-python-pytorch/environment.yml

Pin python=3.13 across py images, fix pytorch index-url syntax

…h_attn The python=3.13 path is blocked by axolotl 0.16.1's unconditional zstandard==0.22.0 transitive pin -- that wheel only ships cp310-cp312, no cp313. Stepping down to python=3.12 keeps the rest of the axolotl 0.16.1 pin chain (torch==2.8.0, transformers==5.5.0, accelerate==1.13.0, bitsandbytes==0.49.1, datasets==4.5.0, trl==0.29.0) resolvable from pre-built wheels. Also: - Drop auto-gptq and autoawq. auto-gptq's sdist runs `import torch` at build-deps phase before torch is installed, breaking the env. autoawq has the same kind of issue. Neither is referenced by any saturncloud code; vllm handles GPTQ/AWQ checkpoint loading via compressed-tensors without these libs. - Bump flash_attn to v2.8.3 with the cu12torch2.8 cp312 wheel. The cu12torch2.7 wheel was ABI-incompatible with torch 2.8.0 (undefined symbol _ZN3c104cuda9SetDeviceEa). v2.8.3 is the version axolotl 0.16.1 itself wants under its flash-attn extra. Verified locally: env builds cleanly, all heavy imports (torch, transformers, axolotl, flash_attn, vllm, peft, trl, datasets, accelerate, bitsandbytes) succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

saturn-python-llm: pin python=3.12 (axolotl 0.16.1 transitive blocker)

…6.02 Three independent failures in the existing env on python=3.13: - cuda-version=12.0 was no longer in conda-forge (only 12.4+ ships now), so the previous CI build couldn't even resolve a CUDA package set. Bump to 12.9 to match the gpu-12.9 base we already have for the pytorch image; rapids' bundled CUDA libs don't actually need to match the runtime base. - dask-sql is abandoned upstream and tops out at python 3.12 on conda-forge. With our python=3.13 sweep it pulled the solver into 2021-era versions. No saturncloud code references dask-sql; users who want SQL-on-dask can pip-install it on demand. - Pinning rapids unbounded resolved to 25.08, which embeds an older cuml that broke at import time against the newer scikit-learn 1.8 conda-forge ships (BaseEstimator._get_default_requests was renamed to _get_metadata_request). rapids>=26.02 resolves to 26.04 with a cuml that matches. Verified locally: env builds clean, all heavy imports (cudf, cuml, cupy, dask, dask_ml, sklearn, pyarrow, cvxpy, prefect, numba) succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

saturn-python-rapids: bump cuda to 12.9, drop dask-sql, pin rapids>=26.02

CI's pip resolver walks vllm back through versions looking for one that satisfies the full constraint set (torch==2.8.0 from axolotl 0.16.1 narrows the window significantly). Locally it lands at 0.11.0; in the build container it kept walking past 0.11.0's manylinux1 wheel and eventually fell into vllm 0.5.x sdists, which try to call /usr/local/cuda/bin/nvcc at metadata-extraction time — but the runtime base image doesn't ship nvcc. Pinning to 0.11.0 (the version the local solve already lands on) short-circuits the backtracking and keeps the resolution wheel-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

saturn-python-llm: pin vllm==0.11.0 (avoid sdist fallback in CI)

vLLM 0.11.0 reads tokenizer.all_special_tokens_extended at startup, which transformers 5.x removed -> AttributeError -> CrashLoopBackOff. The env was leaving transformers unpinned and resolving to 5.5.0, breaking every vLLM serve pod built on this image. Pin transformers>=4.55,<5 (conda + repeated in the pip: block). 4.57.6 was empirically verified to boot vLLM 0.11.0, load Qwen2.5-7B + a LoRA adapter, and serve /v1/chat/completions. axolotl 0.16.1 (kept exactly pinned: TF's Atlas YAML renderer is keyed to its field names) carries an over-strict transformers==5.5.0 metadata pin that would drag 5.x back in. It runs fine on transformers 4.57.x, so it is now installed --no-deps in a separate Dockerfile step, with its real transitive deps declared explicitly in environment.yml. The cross-constraint still holds: axolotl 0.16.1 forces torch==2.8.0 and only vLLM 0.10-0.11 satisfy that, so vLLM stays pinned at 0.11.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…formers-pin Pin transformers <5 in saturn-python-llm so vLLM 0.11 boots

…hon-llm) saturn-python-llm tried to be one image for both vLLM serving and axolotl fine-tuning, but the dep stacks are incompatible: vLLM 0.11 needs transformers<5 (5.x removed tokenizer.all_special_tokens_extended -> CrashLoopBackOff), while axolotl 0.16.1 needs the transformers 5.x API (Trainer.create_optimizer(model=) is 5.x-only; on 4.57 training dies inside the loop). Split by engine, extensible to future inference engines / fine-tuning frameworks: - saturn-python-vllm: inference (vLLM, transformers<5; axolotl deps removed, no --no-deps hack) - saturn-python-axolotl: fine-tuning (axolotl installed WITH deps, so it pulls the correct transformers 5.5 / datasets 4.5 / trl 0.29 / hf-hub>=1 stack; flash-attn + deepspeed + mlflow extras) Both build on the cu129 GPU base. Registered for building in saturncloud/release-images (PR adds them to data_science.py / main_release.py / the build matrix). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…mages Add saturn-python-vllm (inference) + saturn-python-axolotl (training), split from saturn-python-llm

hhuuggoo and others added 23 commits May 26, 2026 19:59

Merge pull request #467 from saturncloud/feature/saturn-python-llm-tf…

8d11560

…-deps saturn-python-llm: declare trl, peft, datasets explicitly

Merge pull request #470 from saturncloud/feature/saturn-python-llm-ax…

e1e7efe

…olotl saturn-python-llm: add axolotl 0.16.1

Merge pull request #469 from saturncloud/fix/recipe-template-schema-d…

36a5b6f

…rift Fix recipe-template field names to match ImageSpecSchema

Merge pull request #471 from saturncloud/hugo/pin-python-3.13-ds-2026…

7d8622c

….05.01 Pin python=3.13 in pytorch and tensorflow envs

Merge pull request #472 from saturncloud/hugo/pytorch-pypi-cu129-py313

bf56f58

saturn-python-pytorch: PyPI torch on cu129, drop pytorch conda channel

Merge remote-tracking branch 'origin/release-2026.05.01' into hugo/py…

e325830

…thon-3.13-sweep # Conflicts: # saturn-python-pytorch/environment.yml

Merge pull request #473 from saturncloud/hugo/python-3.13-sweep

1fca62f

Pin python=3.13 across py images, fix pytorch index-url syntax

Merge pull request #474 from saturncloud/hugo/llm-python-3.12

b22fa1e

saturn-python-llm: pin python=3.12 (axolotl 0.16.1 transitive blocker)

Merge pull request #475 from saturncloud/hugo/rapids-cuda-12.9-py313

823c550

saturn-python-rapids: bump cuda to 12.9, drop dask-sql, pin rapids>=26.02

Merge pull request #476 from saturncloud/hugo/llm-pin-vllm-0.11.0

5060187

saturn-python-llm: pin vllm==0.11.0 (avoid sdist fallback in CI)

Merge pull request #477 from saturncloud/hugo/saturn-python-llm-trans…

0451ead

…formers-pin Pin transformers <5 in saturn-python-llm so vLLM 0.11 boots

Merge pull request #478 from saturncloud/hugo/tf-split-vllm-axolotl-i…

e7d44d0

…mages Add saturn-python-vllm (inference) + saturn-python-axolotl (training), split from saturn-python-llm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update main from release-2026.05.01#468

Update main from release-2026.05.01#468
github-actions[bot] wants to merge 23 commits into
mainfrom
release-2026.05.01

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant