Skip to content

chore(beep boop 🤖): Bump uv.lock (r0.5.0, mcore-core_r0.18.0) (2026-07-03)#4642

Open
svcnvidia-nemo-ci wants to merge 1 commit into
r0.5.0from
bump-ci-container-2026-07-03-r0.5.0-core_r0.18.0
Open

chore(beep boop 🤖): Bump uv.lock (r0.5.0, mcore-core_r0.18.0) (2026-07-03)#4642
svcnvidia-nemo-ci wants to merge 1 commit into
r0.5.0from
bump-ci-container-2026-07-03-r0.5.0-core_r0.18.0

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

🚀 PR to bump uv.lock in r0.5.0.

🤖 This PR will be merged automatically once CI passes.

…-07-03)

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor Author

/ok to test a93bcdc

@copy-pr-bot

copy-pr-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33

yaoyu-33 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

MCore bump auto-fix status for release-r0.5.0:

Classification: Bridge broke itself for the deterministic H100/GB200 Qwen quantization failures; no code fix for the separate GB200 GPT-OSS signal-11 failure because current evidence indicates a transient CI/GPU-runtime failure.
Evidence: On 2026-07-03, CICD NeMo run https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/28655262318 failed L2_Launch_models_qwen_quantization (job https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/28655262318/job/84995090998) and gb200_L2_Launch_models_qwen_quantization (job https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/28655262318/job/84995090925) in modelopt/torch/quantization/plugins/transformer_engine.py:178 with TypeError: object of type bool has no len(). The MCore range d30c93ffae858b22eece3fa71c734c8f43161eff...458c8d0ecafdf6d9e36771600d62ade27f2a67b7 is two commits and introduces TransformerEngine b9d690e042b1c4e455214e7dab65d6d3512c05d6; the live r0.5.0 branch still combines that revision with nvidia-modelopt==0.44.0rc5. The GB200 GPT-OSS job (https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/28655262318/job/84995091352) exited with signal 11 after NCCL initialization, while the H100 counterpart passed on 2026-07-03 and the same GB200 job passed on PR #4627 on 2026-07-02.
Fix PR: not opened. Closed-unmerged PR #4615 already covers this exact release failure and MCore target SHA by restoring the ModelOpt-compatible TransformerEngine revision. Main-branch fix #4600 merged on 2026-07-02, but it was not backported and the live r0.5.0 pin remains unchanged. Per the duplicate/replacement policy, a replacement PR needs maintainer direction.
Guards: none; this is a dependency-pin compatibility issue.
Validation: PR #4642 completed on 2026-07-03 with import, core unit, lint, installation, and all functional checks passing except the two deterministic Qwen quantization jobs, the GB200 GPT-OSS signal-11 job, and the aggregate summary. No new local or CW interactive validation was run because the closed-unmerged #4615 blocks an unauthorized replacement. Prior #4615 validation on 2026-07-01 passed uv run pre-commit run --all-files, uv lock --check, 81 focused unit tests, and a grouped-linear compatibility smoke test in CW interactive job 13313466.
Next action: maintainer decision needed — reopen/rebase #4615 onto #4642, explicitly backport #4600 to r0.5.0, or authorize a replacement release fix PR. Rerun only gb200_L1_Launch_recipes_gpt_oss if the signal-11 failure needs confirmation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants