Update main from release-2026.05.01#468
Open
github-actions[bot] wants to merge 23 commits into
Open
Conversation
These are pulled in transitively by unsloth, but declaring them explicitly makes the image's training API surface stable against unsloth version bumps. The Token Factory fine-tune training script (in a separate repo) imports trl.SFTTrainer and peft directly, and needs to be able to rely on those being present and version-compatible with the rest of the image's HF stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-deps saturn-python-llm: declare trl, peft, datasets explicitly
Five recipe-template.json files were using bogus field names (recipeName/image/gpu/saturnVersion) introduced when the 12.4 images were added in Aug 2025 and propagated forward to 12.9 and the AMD image. The release-images builder uploads these as-is to S3 with schema_version 2022.03.01, which causes the legacy pre_load in saturn's BaseRecipeSchema to wrap them under "spec" — at which point they fail ImageSpecSchema validation (Missing required field "name"; Unknown fields recipeName/gpu/image/saturnVersion). Switch all five templates to the established shape (name/description/hardware_type/supports). The AMD template uses hardware_type=AMD; the others use gpu/cpu as appropriate, matching the existing CUDA 11.8 / 12.1 templates.
The Token Factory fine-tuning service wraps axolotl: an in-pod shim invokes `axolotl train <config>` via subprocess, so the binary needs to be on the image's Python path. Pinned exactly to 0.16.1 because Atlas's config renderer is keyed to specific axolotl YAML field names that change across versions; a loose pin or unpinned spec would silently break rendered configs on the next axolotl release. Context: saturncloud/saturn#6394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olotl saturn-python-llm: add axolotl 0.16.1
…rift Fix recipe-template field names to match ImageSpecSchema
Without an explicit python constraint, mamba now resolves python=3.14 on conda-forge. That breaks both envs on release-2026.05.01: - saturn-python-tensorflow: pip can't find tensorflow[and-cuda] for cp314 (no wheels yet) and snowflake-connector-python resolves to a cp314 wheel, so the pip step in `mamba env update` fails. - saturn-python-pytorch: the multi-channel solve (pytorch + rapidsai + nvidia + conda-forge) blows up with "queue count overflow" and SIGABRTs after ~2.5h. Pinning python=3.13 keeps us close to the latest while staying on a version everything in the env list ships wheels/builds for. We can revisit 3.14 once tensorflow et al catch up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….05.01 Pin python=3.13 in pytorch and tensorflow envs
…hannel The pytorch conda channel is frozen at 2.5.1 (Oct 2024 — the team announced 2.5 as the last release on the channel) and has no cp313 builds. With the python=3.13 pin we landed in #471, the conda solve for `pytorch::pytorch` against the frozen channel has nothing to resolve. The rapidsai + pytorch + nvidia + conda-forge channel mix was also what caused the previous solve to hit `queue count overflow` and SIGABRT after 2h41m. Switch torch / torchvision / torchaudio to PyPI cu129 wheels — pip installs them after the conda env update, so we get the modern PyTorch 2.11 + CUDA 12.9 stack against the matching gpu-12.9 base image. Drop the pytorch + nvidia + rapidsai conda channels and the pytorch::* deps. Drop dask-cuda along with rapidsai (it's the only thing here that needed that channel). Pairs with a release-images change to point the saturn-python-pytorch build at saturnbase-python-gpu-12.9 instead of saturnbase-python-gpu-12.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-pytorch: PyPI torch on cu129, drop pytorch conda channel
Mamba was resolving python=3.14 on conda-forge for any env without an explicit python pin (3.14 released 2025-10-07). #471 already pinned pytorch and tensorflow. This finishes the sweep: - saturn-python: 3.11 -> 3.13. - saturn-python-rapids: unpinned -> 3.13 (this image was on a path to hit the same 3.14 problem next build). - saturn-python-llm: 3.11 -> 3.13. Also drops pytorch/nvidia conda channels and the conda pytorch/pytorch-cuda/cuda-toolkit deps for the same reason the saturn-python-pytorch image did in #472: the pytorch conda channel is frozen at 2.5.1 and has no cp313 builds. torch/torchvision/torchaudio move to pip via the PyPI cu129 index. flash_attn URL swaps cp311 -> cp313 (same build family already publishes a cp313 wheel at that tag). - saturn-python-pytorch: collapses the two-line --extra-index-url pip arg into the correct single-line form. The split-line form is not the conda env yml pip-args grammar -- pip parses it as a bare --extra-index-url with no value. Leaving saturn-python-312-slim* alone: their names encode python312 and they're already pinned to 3.12. R images keep python=3.11 since python there is secondary tooling, not the image purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…thon-3.13-sweep # Conflicts: # saturn-python-pytorch/environment.yml
Pin python=3.13 across py images, fix pytorch index-url syntax
…h_attn The python=3.13 path is blocked by axolotl 0.16.1's unconditional zstandard==0.22.0 transitive pin -- that wheel only ships cp310-cp312, no cp313. Stepping down to python=3.12 keeps the rest of the axolotl 0.16.1 pin chain (torch==2.8.0, transformers==5.5.0, accelerate==1.13.0, bitsandbytes==0.49.1, datasets==4.5.0, trl==0.29.0) resolvable from pre-built wheels. Also: - Drop auto-gptq and autoawq. auto-gptq's sdist runs `import torch` at build-deps phase before torch is installed, breaking the env. autoawq has the same kind of issue. Neither is referenced by any saturncloud code; vllm handles GPTQ/AWQ checkpoint loading via compressed-tensors without these libs. - Bump flash_attn to v2.8.3 with the cu12torch2.8 cp312 wheel. The cu12torch2.7 wheel was ABI-incompatible with torch 2.8.0 (undefined symbol _ZN3c104cuda9SetDeviceEa). v2.8.3 is the version axolotl 0.16.1 itself wants under its flash-attn extra. Verified locally: env builds cleanly, all heavy imports (torch, transformers, axolotl, flash_attn, vllm, peft, trl, datasets, accelerate, bitsandbytes) succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-llm: pin python=3.12 (axolotl 0.16.1 transitive blocker)
…6.02 Three independent failures in the existing env on python=3.13: - cuda-version=12.0 was no longer in conda-forge (only 12.4+ ships now), so the previous CI build couldn't even resolve a CUDA package set. Bump to 12.9 to match the gpu-12.9 base we already have for the pytorch image; rapids' bundled CUDA libs don't actually need to match the runtime base. - dask-sql is abandoned upstream and tops out at python 3.12 on conda-forge. With our python=3.13 sweep it pulled the solver into 2021-era versions. No saturncloud code references dask-sql; users who want SQL-on-dask can pip-install it on demand. - Pinning rapids unbounded resolved to 25.08, which embeds an older cuml that broke at import time against the newer scikit-learn 1.8 conda-forge ships (BaseEstimator._get_default_requests was renamed to _get_metadata_request). rapids>=26.02 resolves to 26.04 with a cuml that matches. Verified locally: env builds clean, all heavy imports (cudf, cuml, cupy, dask, dask_ml, sklearn, pyarrow, cvxpy, prefect, numba) succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-rapids: bump cuda to 12.9, drop dask-sql, pin rapids>=26.02
CI's pip resolver walks vllm back through versions looking for one that satisfies the full constraint set (torch==2.8.0 from axolotl 0.16.1 narrows the window significantly). Locally it lands at 0.11.0; in the build container it kept walking past 0.11.0's manylinux1 wheel and eventually fell into vllm 0.5.x sdists, which try to call /usr/local/cuda/bin/nvcc at metadata-extraction time — but the runtime base image doesn't ship nvcc. Pinning to 0.11.0 (the version the local solve already lands on) short-circuits the backtracking and keeps the resolution wheel-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saturn-python-llm: pin vllm==0.11.0 (avoid sdist fallback in CI)
vLLM 0.11.0 reads tokenizer.all_special_tokens_extended at startup, which transformers 5.x removed -> AttributeError -> CrashLoopBackOff. The env was leaving transformers unpinned and resolving to 5.5.0, breaking every vLLM serve pod built on this image. Pin transformers>=4.55,<5 (conda + repeated in the pip: block). 4.57.6 was empirically verified to boot vLLM 0.11.0, load Qwen2.5-7B + a LoRA adapter, and serve /v1/chat/completions. axolotl 0.16.1 (kept exactly pinned: TF's Atlas YAML renderer is keyed to its field names) carries an over-strict transformers==5.5.0 metadata pin that would drag 5.x back in. It runs fine on transformers 4.57.x, so it is now installed --no-deps in a separate Dockerfile step, with its real transitive deps declared explicitly in environment.yml. The cross-constraint still holds: axolotl 0.16.1 forces torch==2.8.0 and only vLLM 0.10-0.11 satisfy that, so vLLM stays pinned at 0.11.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…formers-pin Pin transformers <5 in saturn-python-llm so vLLM 0.11 boots
…hon-llm)
saturn-python-llm tried to be one image for both vLLM serving and axolotl
fine-tuning, but the dep stacks are incompatible: vLLM 0.11 needs transformers<5
(5.x removed tokenizer.all_special_tokens_extended -> CrashLoopBackOff), while
axolotl 0.16.1 needs the transformers 5.x API (Trainer.create_optimizer(model=)
is 5.x-only; on 4.57 training dies inside the loop). Split by engine, extensible
to future inference engines / fine-tuning frameworks:
- saturn-python-vllm: inference (vLLM, transformers<5; axolotl deps removed,
no --no-deps hack)
- saturn-python-axolotl: fine-tuning (axolotl installed WITH deps, so it pulls
the correct transformers 5.5 / datasets 4.5 / trl 0.29
/ hf-hub>=1 stack; flash-attn + deepspeed + mlflow extras)
Both build on the cu129 GPU base. Registered for building in saturncloud/release-images
(PR adds them to data_science.py / main_release.py / the build matrix).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mages Add saturn-python-vllm (inference) + saturn-python-axolotl (training), split from saturn-python-llm
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 An automated PR - Please make any changes needed to resolve merge conflicts and then approve and merge!