Skip to content

[build] fix: restore ModelOpt-compatible TransformerEngine revision#4615

Closed
yaoyu-33 wants to merge 2 commits into
r0.5.0from
yuya/mcore-release-r0.5.0-autofix-20260701-pr4613
Closed

[build] fix: restore ModelOpt-compatible TransformerEngine revision#4615
yaoyu-33 wants to merge 2 commits into
r0.5.0from
yuya/mcore-release-r0.5.0-autofix-20260701-pr4613

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Mirrors the automated release bump from #4613 and restores the TransformerEngine override to the ModelOpt-compatible d64bc14dc87eb658ab98839e4b7687595ee53e2d revision.

  • Target: release-r0.5.0 (base branch r0.5.0)
  • Classification: Bridge broke itself
  • Guards: none added or removed

Root cause

On 2026-06-26, release PR #4535 updated the Bridge override to TransformerEngine b9d690e0 while retaining nvidia-modelopt==0.44.0rc5. TransformerEngine now passes m_splits as an explicit grouped-linear argument, but that ModelOpt release still reads the first non_tensor_args item as the split sequence and raises TypeError: object of type 'bool' has no len().

The 2026-07-01 automated MCore bump #4613 exposes the already-present incompatibility in both H100 and GB200 Qwen3 MoE quantization jobs.

Fix

This is the release-line counterpart of #4600, whose H100 and GB200 Qwen quantization jobs both pass with the same TransformerEngine revision.

Validation

  • uv run pre-commit run --all-files — passed on 2026-07-01.
  • CW interactive job 13313466 on 2026-07-01:
    • uv lock --check — passed (344 packages).
    • NVTE_CUDA_ARCHS=90 uv sync --locked --group dev --group test --extra te — passed and installed TransformerEngine 2.16.0+d64bc14d.
    • uv run --no-sync python -m pytest tests/unit_tests/models/gpt/test_gpt_builder.py tests/unit_tests/models/test_gpt_provider.py -v — 81 passed.
    • Installed grouped-linear compatibility smoke — passed; signature is (ctx, inp, non_tensor_args, *weights_and_biases).

dimapihtar and others added 2 commits July 1, 2026 06:46
…-07-01)

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 requested a review from a team as a code owner July 1, 2026 15:12
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33

yaoyu-33 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7ea9c2f

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LGTM — clean, surgical dependency fix.

The PR correctly reverts the TE override to the ModelOpt-compatible revision (d64bc14d) while keeping the automated MCore bump from #4613. The lockfile delta is consistent: only the intended TE change plus expected transitive movements (e.g. ast-serialize 0.5.0→0.6.0, bracex 2.6→3.0) from floating CVE floors.

The full-test-suite label is applied, which is appropriate for a TE+MCore bump on a release branch.

Suggested test cases: No perf tests impacted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants