Skip to content

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150

Open
phu0ngng wants to merge 8 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl
Open

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150
phu0ngng wants to merge 8 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl

Conversation

@phu0ngng

Copy link
Copy Markdown
Collaborator

Description

Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR bumps the 3rdparty/nccl submodule to pick up the MAX_SUPPORTED_TOKENS_PER_RANK fix, and bundles two related improvements: reordering NCCL discovery in setup.py so ldconfig is consulted before well-known prefixes and the pip wheel (reducing ABI mismatch risk), and adding NVLink presence guards to all three EP test/bench scripts so they skip gracefully on hardware without NVLink P2P.

  • Submodule bump (3rdparty/nccl): advances from 808d2433 to a6b5de08 to carry the upstream NCCL fix for MAX_SUPPORTED_TOKENS_PER_RANK.
  • setup.py NCCL discovery reorder: ldconfig is now the first probe after NCCL_HOME, matching the dynamic loader's runtime resolution order and avoiding ABI mismatches when a pip wheel shadows a system NCCL.
  • NVLink guard in shell scripts: all three EP scripts (run_test_ep.sh, run_ep_bench.sh, multi_process_launch_ep.sh) now call nvidia-smi nvlink --status and skip with exit 0 when NVLink is absent, preventing misleading failures on NVLink-less CI runners.

Confidence Score: 5/5

Safe to merge — the submodule bump carries a targeted upstream fix, the setup.py reorder aligns build-time and runtime NCCL resolution, and the NVLink guards are defensive skip-on-absence checks.

All changes are narrowly scoped: the submodule update is a single-commit upstream fix, the setup.py probe reordering preserves existing fallback paths and adds no new failure modes, and the NVLink shell guards only cause a test to skip gracefully rather than alter any test logic.

No files require special attention.

Important Files Changed

Filename Overview
3rdparty/nccl Submodule bump from 808d2433 to a6b5de08 to pick up the MAX_SUPPORTED_TOKENS_PER_RANK fix in NCCL.
setup.py Reorders NCCL discovery: ldconfig is now consulted before well-known prefixes, and the pip wheel (nvidia-nccl-cu*) is moved to last resort; prevents ABI mismatches when the system loader resolves a different libnccl than the one used at build time.
tests/cpp_distributed/run_test_ep.sh Adds an NVLink detection guard that skips the EP test when NVLink P2P is unavailable; uses consistent [[ ]] bash double-bracket style throughout.
examples/jax/ep/bench/run_ep_bench.sh Adds the same NVLink detection guard as the test scripts; uses POSIX single-bracket for $? and -z checks, mixing styles with [[ ]] for pattern matching — inconsistent but functionally correct.
tests/jax/multi_process_launch_ep.sh Adds NVLink detection guard matching the same pattern as the other scripts; mixed bracket styles (POSIX [ ] alongside [[ ]]) are consistent with the rest of this file.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_discover_nccl_home] --> B{NCCL_HOME\nenv var set?}
    B -- yes --> C{nccl.h + libnccl\nexist at path?}
    C -- yes --> D[Return NCCL_HOME]
    C -- no --> E[Warning: NCCL_HOME\nset but invalid]
    B -- no --> F[ldconfig -p probe\nnew: checked first]
    E --> F
    F -- found --> G[Return ldconfig path]
    F -- not found --> H[Well-known prefix scan\n/opt/nvidia/nccl\n/usr/local/nccl\n/usr]
    H -- found --> I[Return prefix path]
    H -- not found --> J[pip wheel probe\nnvidia.nccl site-packages\nnew: last resort]
    J -- found --> K[Return pip wheel path]
    J -- not found --> L[RuntimeError:\nCould not locate NCCL]

    style F fill:#d4edda,stroke:#28a745
    style J fill:#fff3cd,stroke:#ffc107
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[_discover_nccl_home] --> B{NCCL_HOME\nenv var set?}
    B -- yes --> C{nccl.h + libnccl\nexist at path?}
    C -- yes --> D[Return NCCL_HOME]
    C -- no --> E[Warning: NCCL_HOME\nset but invalid]
    B -- no --> F[ldconfig -p probe\nnew: checked first]
    E --> F
    F -- found --> G[Return ldconfig path]
    F -- not found --> H[Well-known prefix scan\n/opt/nvidia/nccl\n/usr/local/nccl\n/usr]
    H -- found --> I[Return prefix path]
    H -- not found --> J[pip wheel probe\nnvidia.nccl site-packages\nnew: last resort]
    J -- found --> K[Return pip wheel path]
    J -- not found --> L[RuntimeError:\nCould not locate NCCL]

    style F fill:#d4edda,stroke:#28a745
    style J fill:#fff3cd,stroke:#ffc107
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into update_nccl" | Re-trigger Greptile

Comment thread 3rdparty/nccl Outdated
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

phu0ngng added 3 commits June 29, 2026 01:30
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@jberchtold-nvidia jberchtold-nvidia left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants