Skip to content

Add more tests#537

Open
micmelesse wants to merge 2 commits into
ROCm:mainfrom
micmelesse:micmelesse/ccl_capture_main
Open

Add more tests#537
micmelesse wants to merge 2 commits into
ROCm:mainfrom
micmelesse:micmelesse/ccl_capture_main

Conversation

@micmelesse

@micmelesse micmelesse commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Motivation

This pr expands test coverage from a single eager call to back to back calls with both eager and graph capture. We also add a flag for varying input. The additional tests fail locally. It seems there is a correctness issue with the kernels when running in non eager modes.

Technical Details

Test Plan

Test Result

Submission Checklist

@micmelesse micmelesse force-pushed the micmelesse/ccl_capture_main branch from bca0178 to cc51e19 Compare June 17, 2026 23:37
@micmelesse micmelesse force-pushed the micmelesse/ccl_capture_main branch from cc51e19 to dda5cc2 Compare June 17, 2026 23:43
@micmelesse micmelesse marked this pull request as ready for review June 17, 2026 23:44
@micmelesse micmelesse requested a review from mawad-amd as a code owner June 17, 2026 23:44
Copilot AI review requested due to automatic review settings June 17, 2026 23:44
@micmelesse micmelesse requested review from BKP and neoblizz as code owners June 17, 2026 23:44
@micmelesse micmelesse requested a review from aamarnat June 17, 2026 23:44
@micmelesse micmelesse changed the title Add tests for varying graph replay Add more tests Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Expands collective-communication test coverage to validate correctness across different implementations and execution modes, including repeated graph replays with varying inputs.

Changes:

  • Refactors all-gather and all-to-all tests into a replay-based harness that checks per-rank/per-chunk correctness across many iterations.
  • Adds parameterization over implementation (torch/triton/gluon), execution mode (eager w/ barrier, eager w/o barrier, graph), and whether inputs vary between replays.
  • Introduces CUDA graph capture + replay paths for the Iris-based collectives (and torch in all-gather).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
tests/ccl/test_all_to_all_gluon.py Adds replay/graph coverage for all-to-all across triton vs gluon and varying inputs, with per-source-chunk validation.
tests/ccl/test_all_gather_gluon.py Refactors all-gather into a shared replay harness, expands parameter matrix (incl. torch), and adds graph replay + varying-input coverage.

Comment thread tests/ccl/test_all_to_all_gluon.py
Comment thread tests/ccl/test_all_to_all_gluon.py
Comment thread tests/ccl/test_all_gather_gluon.py Outdated
Comment thread tests/ccl/test_all_gather_gluon.py
Comment thread tests/ccl/test_all_gather_gluon.py
Comment thread tests/ccl/test_all_gather_gluon.py
Comment thread tests/ccl/test_all_gather_gluon.py Outdated
Comment thread tests/ccl/test_all_to_all_gluon.py
@micmelesse micmelesse force-pushed the micmelesse/ccl_capture_main branch from 9b4c800 to a0360ec Compare June 18, 2026 00:20

@mawad-amd mawad-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some issues with CI the past couple of days. Should be fixed now. Re-triggering the tests.

@micmelesse

Copy link
Copy Markdown
Contributor Author

@mawad-amd The test failures are the cases that I added. @aamarnat wanted me to see if I could recreate the correctness issues that I was seeing on top of main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants