Bump transformers from 4.57.1 to 5.0.0rc3#3
Open
dependabot[bot] wants to merge 1 commit into
Open
Conversation
Bumps [transformers](https://github.com/huggingface/transformers) from 4.57.1 to 5.0.0rc3. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.57.1...v5.0.0rc3) --- updated-dependencies: - dependency-name: transformers dependency-version: 5.0.0rc3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
CityChan
added a commit
to CityChan/MemexRL
that referenced
this pull request
May 29, 2026
Smoke test smoke1h_hard_lossless_20260528_204913 ran rollout 0 cleanly
(truncated_ratio=9.6%, 208 samples gathered, healthy reward signal at
raw_reward=-0.139) and completed train step 0 (pg_loss=-0.42, entropy
=0.38). Train step 1 then OOMed at:
slime/utils/ppo_utils.py:173 inside _VocabParallelEntropy.forward
normalized_vocab_parallel_logits = vocab_parallel_logits - logits_max
torch.OutOfMemoryError: Tried to allocate 6.11 GiB
This subtraction creates a full (N_tokens, 152064) intermediate. The
chunked-entropy monkey-patch from memex-30b-sbatch-patches Accenture#3 only
chunks the *later* mul_reduce call; the subtraction at line 173 still
materializes a full-vocab tensor. With max-tokens-per-gpu=2048, that
tensor was ~2.5 GiB raw plus the entropy clone copy plus normalized,
and the 12 GiB of reserved-but-unallocated PyTorch fragmentation left
over from step 0 made the step 1 6 GiB allocation un-fittable into
the 5.37 GiB still free on the 95 GiB GH200.
Step 0 succeeding proves the algorithm/gradient chain works end-to-end
at this batch size. The fix is just to halve per-iter peak memory
again so step 1+ have headroom against the fragmentation that
accumulated from step 0. Doubling the microbatch count means ~30%
slower actor_train (per-iter scales sub-linearly with token count
because of fixed kernel-launch overheads), which trades training
throughput for elimination of the iter-1 OOM that wiped out the
remaining 12 iters of rollout 0 anyway.
This is the third halving of MAX_TOKENS_PER_GPU since the first
post-truncation-patch run:
8192 (original) -> too big, lossless_db GPU OOM in
compute_log_probs(logits.clone())
4096 (commit 2b75d3b) -> too big, same path
2048 (commit 08306fd) -> still OOM but in entropy.forward
1024 (this commit) -> per-iter peak ~halves the entropy
subtraction; should fit alongside
step-0 fragmentation
If 1024 still OOMs the proper fix is to extend the chunked-entropy
monkey-patch to also chunk the line-173 subtraction (and the
subsequent exp / div paths). Doing that requires ~50 lines of patch
because the in-place exp_/div_ pattern in the existing forward can't
naively be applied to a chunked subtractor without losing the
backward-pass requirement that ctx.save_for_backward gets the
original vocab_parallel_logits. Reserving for a follow-up if 1024
isn't enough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CityChan
added a commit
to CityChan/MemexRL
that referenced
this pull request
May 29, 2026
…ure#5) Adds a monkey-patch that replaces the entire _VocabParallelEntropy.forward and backward in slime/utils/ppo_utils.py with vocab-chunked versions. Why this matters: the existing chunked-mul_reduce patch (sbatch patch Accenture#3) only chunks the very last reduction inside the entropy forward. The earlier line 173: normalized_vocab_parallel_logits = vocab_parallel_logits - logits_max allocates a full (N_tokens, vocab=152064) bf16 tensor. With our long- context agent samples (~14K tokens each, single-sample microbatches under slime's dynamic batching because per-sample > max-tokens-per-gpu), N is ~20K, the alloc is ~6 GB, and Vista's GH200 with SGLang holding ~30 GB of CUDA graph state has no contiguous 6 GB free at train iter 1 after step 0's PyTorch cache fragmentation. Smokes 727417 and 727488 (2026-05-28) both reproduced this: rollout 0 + train step 0 succeed, train step 1 OOMs in the line-173 subtraction, regardless of whether max-tokens-per-gpu is 4096, 2048, or 1024 (the per-microbatch token count is set by sample length, not the cap). The new patch eliminates the (N, V) intermediate entirely: forward (two vocab-chunked passes): Pass 1 accumulates sum_exp = sum_v(exp(z_v - max)) chunk by chunk. Pass 2 accumulates sum_softmax_times_logits = sum_v(softmax_v * z_v) chunk by chunk. Per-chunk tensors are (N, 16384) ~= 625 MB, not 6 GB. Saves only logits_max + sum_exp + sum_softmax_times_logits + vocab_parallel_logits for backward (no softmax_logits buffer kept alive). backward (vocab-chunked, in-place): Recomputes softmax chunk-wise from sum_exp + logits_max + chunk of saved logits, applies the standard -softmax*(z - sum)*grad_out formula, writes the result into the vocab_parallel_logits buffer in-place. Safe because slime always calls compute_entropy_from_ logits(logits.clone(), tp_group), so the buffer is private. Math identity: same as upstream. sum_exp and sum_softmax_times_logits match to the last bit (modulo floating-point reduction order). entropy formula unchanged. Backward formula unchanged. Memory: peak alloc inside entropy.forward drops from ~12 GB (vocab_parallel_logits + normalized_vocab_parallel_logits) to ~8 GB (vocab_parallel_logits + ~625 MB per-chunk intermediate). That ~4 GB of saved peak is exactly what we need to fit alongside the post-step-0 fragmentation that crashed the smoke runs. The legacy chunked-mul_reduce patch (Accenture#3) is kept directly below as a fallback in case slime upstream changes _VocabParallelEntropy's source in a way that breaks the new patch's needle match. If the new patch succeeds, the old patch's needle (which targets the original `def mul_reduce` inside the original forward) will no longer match and it will print a harmless WARN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CityChan
added a commit
to CityChan/MemexRL
that referenced
this pull request
May 31, 2026
The needle-based patch from 6b801e1 / 5203349 didn't match on vista even after the heredoc fix, because slime/utils/ppo_utils.py already had the legacy chunked-mul_reduce patch applied from previous runs. My needle included the original `@torch.compile\n def mul_reduce(a, b)\n return (a * b).sum(...)` block, but the file's actual mul_reduce is now the multi-line chunked version from sbatch patch Accenture#3. Result: needle missed silently, only legacy patch was active, training still OOMed at original line 186 (the `vocab_parallel_logits - logits_max` subtraction). Job 729970 (2026-05-30) reproduced this: train step 0 completes (pg_loss=-0.449, entropy=0.401), step 1 OOMs at the same line 186 as before, with 14.09 GiB reserved-but-unallocated fragmentation and only 6.24 GiB free trying to allocate 5.54 GiB. Identical pattern to 728933. Rewrite the patch to locate the class boundaries by text markers (`class _VocabParallelEntropy(torch.autograd.Function):` start, `def compute_entropy_from_logits(` end) and replace the entire class wholesale. This is robust to whatever the legacy patch did to the method body, and idempotent via the chunked_entropy_full marker inside the new class. When this patch succeeds, the legacy chunked-mul_reduce patch running after will find neither its marker nor its needle (the mul_reduce inner function no longer exists in the chunked forward), print a harmless WARN, and skip. If this patch fails (class boundaries somehow not found because slime renamed the class or compute_entropy_from_logits), the legacy patch still applies as before — partial relief, training will still OOM, but at least we're no worse than today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps transformers from 4.57.1 to 5.0.0rc3.
Release notes
Sourced from transformers's releases.
... (truncated)
Commits
cb5079fv5.0.0rc3d1808f2[ci] Fixing some failing tests for important models (#43231)3d27645Add LightOnOCR model implementation (#41621)77146ccfix crash in when running FSDP2+TP (#43226)61317f5[CB] Ensure parallel decoding test passes using FA (#43277)1efe1a6Fix failingPegasusX,Mvp&LEDmodel integration tests (#43245)e8ae373[consistency] Ensure models are added to the_toctree.yml(#43264)c85be98[docs] tensorrt-llm (#43176)38022fd[style] Fix init isort and align makefile and CI (#43260)e977446Fix failingHiera,SwiftFormer&LEDModel integration tests (#43225)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)You can disable automated security fix PRs for this repo from the Security Alerts page.