Skip to content

Bump transformers from 4.57.1 to 5.0.0rc3#3

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/transformers-5.0.0rc3
Open

Bump transformers from 4.57.1 to 5.0.0rc3#3
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/transformers-5.0.0rc3

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Apr 10, 2026

Copy link
Copy Markdown

Bumps transformers from 4.57.1 to 5.0.0rc3.

Release notes

Sourced from transformers's releases.

Release candidate v5.0.0rc3

New models:

What's Changed

We are getting closer and closer to the official release! This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [transformers](https://github.com/huggingface/transformers) from 4.57.1 to 5.0.0rc3.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](huggingface/transformers@v4.57.1...v5.0.0rc3)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.0.0rc3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels Apr 10, 2026
CityChan added a commit to CityChan/MemexRL that referenced this pull request May 29, 2026
Smoke test smoke1h_hard_lossless_20260528_204913 ran rollout 0 cleanly
(truncated_ratio=9.6%, 208 samples gathered, healthy reward signal at
raw_reward=-0.139) and completed train step 0 (pg_loss=-0.42, entropy
=0.38). Train step 1 then OOMed at:

  slime/utils/ppo_utils.py:173 inside _VocabParallelEntropy.forward
    normalized_vocab_parallel_logits = vocab_parallel_logits - logits_max
  torch.OutOfMemoryError: Tried to allocate 6.11 GiB

This subtraction creates a full (N_tokens, 152064) intermediate. The
chunked-entropy monkey-patch from memex-30b-sbatch-patches Accenture#3 only
chunks the *later* mul_reduce call; the subtraction at line 173 still
materializes a full-vocab tensor. With max-tokens-per-gpu=2048, that
tensor was ~2.5 GiB raw plus the entropy clone copy plus normalized,
and the 12 GiB of reserved-but-unallocated PyTorch fragmentation left
over from step 0 made the step 1 6 GiB allocation un-fittable into
the 5.37 GiB still free on the 95 GiB GH200.

Step 0 succeeding proves the algorithm/gradient chain works end-to-end
at this batch size. The fix is just to halve per-iter peak memory
again so step 1+ have headroom against the fragmentation that
accumulated from step 0. Doubling the microbatch count means ~30%
slower actor_train (per-iter scales sub-linearly with token count
because of fixed kernel-launch overheads), which trades training
throughput for elimination of the iter-1 OOM that wiped out the
remaining 12 iters of rollout 0 anyway.

This is the third halving of MAX_TOKENS_PER_GPU since the first
post-truncation-patch run:
  8192 (original)           -> too big, lossless_db GPU OOM in
                                compute_log_probs(logits.clone())
  4096 (commit 2b75d3b)     -> too big, same path
  2048 (commit 08306fd)     -> still OOM but in entropy.forward
  1024 (this commit)        -> per-iter peak ~halves the entropy
                                subtraction; should fit alongside
                                step-0 fragmentation

If 1024 still OOMs the proper fix is to extend the chunked-entropy
monkey-patch to also chunk the line-173 subtraction (and the
subsequent exp / div paths). Doing that requires ~50 lines of patch
because the in-place exp_/div_ pattern in the existing forward can't
naively be applied to a chunked subtractor without losing the
backward-pass requirement that ctx.save_for_backward gets the
original vocab_parallel_logits. Reserving for a follow-up if 1024
isn't enough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CityChan added a commit to CityChan/MemexRL that referenced this pull request May 29, 2026
…ure#5)

Adds a monkey-patch that replaces the entire _VocabParallelEntropy.forward
and backward in slime/utils/ppo_utils.py with vocab-chunked versions.

Why this matters: the existing chunked-mul_reduce patch (sbatch patch Accenture#3)
only chunks the very last reduction inside the entropy forward. The
earlier line 173:

    normalized_vocab_parallel_logits = vocab_parallel_logits - logits_max

allocates a full (N_tokens, vocab=152064) bf16 tensor. With our long-
context agent samples (~14K tokens each, single-sample microbatches
under slime's dynamic batching because per-sample > max-tokens-per-gpu),
N is ~20K, the alloc is ~6 GB, and Vista's GH200 with SGLang holding
~30 GB of CUDA graph state has no contiguous 6 GB free at train iter 1
after step 0's PyTorch cache fragmentation. Smokes 727417 and 727488
(2026-05-28) both reproduced this: rollout 0 + train step 0 succeed,
train step 1 OOMs in the line-173 subtraction, regardless of whether
max-tokens-per-gpu is 4096, 2048, or 1024 (the per-microbatch token
count is set by sample length, not the cap).

The new patch eliminates the (N, V) intermediate entirely:

  forward (two vocab-chunked passes):
    Pass 1 accumulates sum_exp = sum_v(exp(z_v - max)) chunk by chunk.
    Pass 2 accumulates sum_softmax_times_logits = sum_v(softmax_v * z_v)
    chunk by chunk. Per-chunk tensors are (N, 16384) ~= 625 MB, not 6 GB.
    Saves only logits_max + sum_exp + sum_softmax_times_logits +
    vocab_parallel_logits for backward (no softmax_logits buffer kept
    alive).

  backward (vocab-chunked, in-place):
    Recomputes softmax chunk-wise from sum_exp + logits_max + chunk of
    saved logits, applies the standard -softmax*(z - sum)*grad_out
    formula, writes the result into the vocab_parallel_logits buffer
    in-place. Safe because slime always calls compute_entropy_from_
    logits(logits.clone(), tp_group), so the buffer is private.

Math identity: same as upstream. sum_exp and sum_softmax_times_logits
match to the last bit (modulo floating-point reduction order). entropy
formula unchanged. Backward formula unchanged.

Memory: peak alloc inside entropy.forward drops from ~12 GB
(vocab_parallel_logits + normalized_vocab_parallel_logits) to ~8 GB
(vocab_parallel_logits + ~625 MB per-chunk intermediate). That ~4 GB
of saved peak is exactly what we need to fit alongside the post-step-0
fragmentation that crashed the smoke runs.

The legacy chunked-mul_reduce patch (Accenture#3) is kept directly below as a
fallback in case slime upstream changes _VocabParallelEntropy's source
in a way that breaks the new patch's needle match. If the new patch
succeeds, the old patch's needle (which targets the original `def
mul_reduce` inside the original forward) will no longer match and it
will print a harmless WARN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CityChan added a commit to CityChan/MemexRL that referenced this pull request May 31, 2026
The needle-based patch from 6b801e1 / 5203349 didn't match on vista
even after the heredoc fix, because slime/utils/ppo_utils.py already
had the legacy chunked-mul_reduce patch applied from previous runs.
My needle included the original `@torch.compile\n        def
mul_reduce(a, b)\n            return (a * b).sum(...)` block, but the
file's actual mul_reduce is now the multi-line chunked version from
sbatch patch Accenture#3. Result: needle missed silently, only legacy patch
was active, training still OOMed at original line 186 (the
`vocab_parallel_logits - logits_max` subtraction).

Job 729970 (2026-05-30) reproduced this: train step 0 completes
(pg_loss=-0.449, entropy=0.401), step 1 OOMs at the same line 186 as
before, with 14.09 GiB reserved-but-unallocated fragmentation and
only 6.24 GiB free trying to allocate 5.54 GiB. Identical pattern to
728933.

Rewrite the patch to locate the class boundaries by text markers
(`class _VocabParallelEntropy(torch.autograd.Function):` start,
`def compute_entropy_from_logits(` end) and replace the entire class
wholesale. This is robust to whatever the legacy patch did to the
method body, and idempotent via the chunked_entropy_full marker
inside the new class.

When this patch succeeds, the legacy chunked-mul_reduce patch
running after will find neither its marker nor its needle (the
mul_reduce inner function no longer exists in the chunked forward),
print a harmless WARN, and skip.

If this patch fails (class boundaries somehow not found because
slime renamed the class or compute_entropy_from_logits), the legacy
patch still applies as before — partial relief, training will still
OOM, but at least we're no worse than today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants