Bump transformers from 4.57.1 to 5.0.0rc3 by dependabot[bot] · Pull Request #3 · Accenture/MemexRL

dependabot · 2026-04-10T01:51:55Z

Bumps transformers from 4.57.1 to 5.0.0rc3.

Release notes

Release candidate v5.0.0rc3

New models:

[GLM-4.7] GLM-Lite Supoort by @zRzRzRzRzRzRzR in huggingface/transformers#43031

[GLM-Image] AR Model Support for GLM-Image by @zRzRzRzRzRzRzR in huggingface/transformers#43100

Add LWDetr model by @sbucaille in huggingface/transformers#40991

Add LightOnOCR model implementation by @baptiste-aubertin in huggingface/transformers#41621

What's Changed

We are getting closer and closer to the official release! This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.

Update Japanese README to match English version by @lilin-1 in huggingface/transformers#43069

[docs] Deploying by @stevhliu in huggingface/transformers#42263

[docs] inference engines by @stevhliu in huggingface/transformers#42932

Fix typos: Remove duplicate duplicate words words by @efeecllk in huggingface/transformers#43040

[style] Rework ruff rules and update all files by @Cyrilvallez in huggingface/transformers#43144

[CB] Minor fix in kwargs by @remi-or in huggingface/transformers#43147

[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT by @sniper35 in huggingface/transformers#43068

Fix some deprecated practices in torch 2.9 by @Cyrilvallez in huggingface/transformers#43167

Fix Fuyu processor width dimension bug in _get_num_multimodal_tokens by @Abhinavexists in huggingface/transformers#43137

Inherit from PreTrainedTokenizerBase by @juliendenize in huggingface/transformers#43143

Generation config boolean defaults by @zucchini-nlp in huggingface/transformers#43000

Fix failing BartModelIntegrationTest by @Sai-Suraj-27 in huggingface/transformers#43160

fix failure of llava/pixtral by @sywangyi in huggingface/transformers#42985

GemmaTokenizer: remove redundant whitespace pre-tokenizer by @vaibhav-research in huggingface/transformers#43106

Support auto_doctring in Processors by @yonigozlan in huggingface/transformers#42101

Fix failing BitModelIntegrationTest by @Sai-Suraj-27 in huggingface/transformers#43164

[Fp8] Fix experts by @vasqu in huggingface/transformers#43154

Docs: improve wording for documentation build instructions by @Sailnagale in huggingface/transformers#43007

[makefile] Cleanup and improve the rules by @Cyrilvallez in huggingface/transformers#43171

Some new models added stuff that was already removed by @Cyrilvallez in huggingface/transformers#43179

Fixes and compilation warning in torchao docs by @merveenoyan in huggingface/transformers#42909

[cache] Remove all deprecated classes by @Cyrilvallez in huggingface/transformers#43168

Bump huggingface_hub minimal version by @Wauplin in huggingface/transformers#43188

Rework check_config_attributes.py by @Cyrilvallez in huggingface/transformers#43191

Fix generation config validation by @zucchini-nlp in huggingface/transformers#43175

[style] Use 'x | y' syntax for processors as well by @Wauplin in huggingface/transformers#43189

Remove deprecated objects by @Cyrilvallez in huggingface/transformers#43170

fix chunked prefill implementation issue-43082 by @marcndo in huggingface/transformers#43132

Reduce add_dates verbosity by @yonigozlan in huggingface/transformers#43184

Add support for MiniMax-M2 by @rogeryoungh in huggingface/transformers#42028

Fix failing salesforce-ctrl, xlm & gpt-neo model generation tests by @Sai-Suraj-27 in huggingface/transformers#43180

Less verbose library helpers by @Cyrilvallez in huggingface/transformers#43197

run all test files on CircleCI by @ydshieh in huggingface/transformers#43146

Clamp temperature to >=1.0 for Dia generation by @Haseebasif7 in huggingface/transformers#43029

Fix spelling typos in comments and code by @raimbekovm in huggingface/transformers#43046

[docs] llama.cpp by @stevhliu in huggingface/transformers#43185

... (truncated)

Commits

cb5079f v5.0.0rc3
d1808f2 [ci] Fixing some failing tests for important models (#43231)
3d27645 Add LightOnOCR model implementation (#41621)
77146cc fix crash in when running FSDP2+TP (#43226)
61317f5 [CB] Ensure parallel decoding test passes using FA (#43277)
1efe1a6 Fix failing PegasusX, Mvp & LED model integration tests (#43245)
e8ae373 [consistency] Ensure models are added to the _toctree.yml (#43264)
c85be98 [docs] tensorrt-llm (#43176)
38022fd [style] Fix init isort and align makefile and CI (#43260)
e977446 Fix failing Hiera, SwiftFormer & LED Model integration tests (#43225)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [transformers](https://github.com/huggingface/transformers) from 4.57.1 to 5.0.0rc3. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.57.1...v5.0.0rc3) --- updated-dependencies: - dependency-name: transformers dependency-version: 5.0.0rc3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

Smoke test smoke1h_hard_lossless_20260528_204913 ran rollout 0 cleanly (truncated_ratio=9.6%, 208 samples gathered, healthy reward signal at raw_reward=-0.139) and completed train step 0 (pg_loss=-0.42, entropy =0.38). Train step 1 then OOMed at: slime/utils/ppo_utils.py:173 inside _VocabParallelEntropy.forward normalized_vocab_parallel_logits = vocab_parallel_logits - logits_max torch.OutOfMemoryError: Tried to allocate 6.11 GiB This subtraction creates a full (N_tokens, 152064) intermediate. The chunked-entropy monkey-patch from memex-30b-sbatch-patches Accenture#3 only chunks the *later* mul_reduce call; the subtraction at line 173 still materializes a full-vocab tensor. With max-tokens-per-gpu=2048, that tensor was ~2.5 GiB raw plus the entropy clone copy plus normalized, and the 12 GiB of reserved-but-unallocated PyTorch fragmentation left over from step 0 made the step 1 6 GiB allocation un-fittable into the 5.37 GiB still free on the 95 GiB GH200. Step 0 succeeding proves the algorithm/gradient chain works end-to-end at this batch size. The fix is just to halve per-iter peak memory again so step 1+ have headroom against the fragmentation that accumulated from step 0. Doubling the microbatch count means ~30% slower actor_train (per-iter scales sub-linearly with token count because of fixed kernel-launch overheads), which trades training throughput for elimination of the iter-1 OOM that wiped out the remaining 12 iters of rollout 0 anyway. This is the third halving of MAX_TOKENS_PER_GPU since the first post-truncation-patch run: 8192 (original) -> too big, lossless_db GPU OOM in compute_log_probs(logits.clone()) 4096 (commit 2b75d3b) -> too big, same path 2048 (commit 08306fd) -> still OOM but in entropy.forward 1024 (this commit) -> per-iter peak ~halves the entropy subtraction; should fit alongside step-0 fragmentation If 1024 still OOMs the proper fix is to extend the chunked-entropy monkey-patch to also chunk the line-173 subtraction (and the subsequent exp / div paths). Doing that requires ~50 lines of patch because the in-place exp_/div_ pattern in the existing forward can't naively be applied to a chunked subtractor without losing the backward-pass requirement that ctx.save_for_backward gets the original vocab_parallel_logits. Reserving for a follow-up if 1024 isn't enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ure#5) Adds a monkey-patch that replaces the entire _VocabParallelEntropy.forward and backward in slime/utils/ppo_utils.py with vocab-chunked versions. Why this matters: the existing chunked-mul_reduce patch (sbatch patch Accenture#3) only chunks the very last reduction inside the entropy forward. The earlier line 173: normalized_vocab_parallel_logits = vocab_parallel_logits - logits_max allocates a full (N_tokens, vocab=152064) bf16 tensor. With our long- context agent samples (~14K tokens each, single-sample microbatches under slime's dynamic batching because per-sample > max-tokens-per-gpu), N is ~20K, the alloc is ~6 GB, and Vista's GH200 with SGLang holding ~30 GB of CUDA graph state has no contiguous 6 GB free at train iter 1 after step 0's PyTorch cache fragmentation. Smokes 727417 and 727488 (2026-05-28) both reproduced this: rollout 0 + train step 0 succeed, train step 1 OOMs in the line-173 subtraction, regardless of whether max-tokens-per-gpu is 4096, 2048, or 1024 (the per-microbatch token count is set by sample length, not the cap). The new patch eliminates the (N, V) intermediate entirely: forward (two vocab-chunked passes): Pass 1 accumulates sum_exp = sum_v(exp(z_v - max)) chunk by chunk. Pass 2 accumulates sum_softmax_times_logits = sum_v(softmax_v * z_v) chunk by chunk. Per-chunk tensors are (N, 16384) ~= 625 MB, not 6 GB. Saves only logits_max + sum_exp + sum_softmax_times_logits + vocab_parallel_logits for backward (no softmax_logits buffer kept alive). backward (vocab-chunked, in-place): Recomputes softmax chunk-wise from sum_exp + logits_max + chunk of saved logits, applies the standard -softmax*(z - sum)*grad_out formula, writes the result into the vocab_parallel_logits buffer in-place. Safe because slime always calls compute_entropy_from_ logits(logits.clone(), tp_group), so the buffer is private. Math identity: same as upstream. sum_exp and sum_softmax_times_logits match to the last bit (modulo floating-point reduction order). entropy formula unchanged. Backward formula unchanged. Memory: peak alloc inside entropy.forward drops from ~12 GB (vocab_parallel_logits + normalized_vocab_parallel_logits) to ~8 GB (vocab_parallel_logits + ~625 MB per-chunk intermediate). That ~4 GB of saved peak is exactly what we need to fit alongside the post-step-0 fragmentation that crashed the smoke runs. The legacy chunked-mul_reduce patch (Accenture#3) is kept directly below as a fallback in case slime upstream changes _VocabParallelEntropy's source in a way that breaks the new patch's needle match. If the new patch succeeds, the old patch's needle (which targets the original `def mul_reduce` inside the original forward) will no longer match and it will print a harmless WARN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The needle-based patch from 6b801e1 / 5203349 didn't match on vista even after the heredoc fix, because slime/utils/ppo_utils.py already had the legacy chunked-mul_reduce patch applied from previous runs. My needle included the original `@torch.compile\n def mul_reduce(a, b)\n return (a * b).sum(...)` block, but the file's actual mul_reduce is now the multi-line chunked version from sbatch patch Accenture#3. Result: needle missed silently, only legacy patch was active, training still OOMed at original line 186 (the `vocab_parallel_logits - logits_max` subtraction). Job 729970 (2026-05-30) reproduced this: train step 0 completes (pg_loss=-0.449, entropy=0.401), step 1 OOMs at the same line 186 as before, with 14.09 GiB reserved-but-unallocated fragmentation and only 6.24 GiB free trying to allocate 5.54 GiB. Identical pattern to 728933. Rewrite the patch to locate the class boundaries by text markers (`class _VocabParallelEntropy(torch.autograd.Function):` start, `def compute_entropy_from_logits(` end) and replace the entire class wholesale. This is robust to whatever the legacy patch did to the method body, and idempotent via the chunked_entropy_full marker inside the new class. When this patch succeeds, the legacy chunked-mul_reduce patch running after will find neither its marker nor its needle (the mul_reduce inner function no longer exists in the chunked forward), print a harmless WARN, and skip. If this patch fails (class boundaries somehow not found because slime renamed the class or compute_entropy_from_logits), the legacy patch still applies as before — partial relief, training will still OOM, but at least we're no worse than today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dependabot Bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump transformers from 4.57.1 to 5.0.0rc3#3

Bump transformers from 4.57.1 to 5.0.0rc3#3
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/transformers-5.0.0rc3

dependabot Bot commented on behalf of github Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

dependabot Bot commented on behalf of github Apr 10, 2026

Release candidate v5.0.0rc3

New models:

What's Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants