Redesign GPU-STUMP correlation computation using sliding covariance#1145
Redesign GPU-STUMP correlation computation using sliding covariance#1145Tejaswa-Shrivastava wants to merge 6 commits into
Conversation
Replace QT-based correlation computation in GPU-STUMP with direct sliding covariance based Pearson correlation. This redesign removes QT buffer dependencies while preserving output behavior and API compatibility. Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.
|
Review these changes at https://app.gitnotebooks.com/stumpy-dev/stumpy/pull/1145 |
Updated the docstring for the _compute_and_update_PI_kernel and gpu_stump functions to provide detailed parameter descriptions and usage examples.
@Tejaswa-Shrivastava Can you be more specific when you say "validated"? Did you run this on an NVIDIA GPU? How much faster is it? Did you also execute the GPU tests? |
Thanks for asking for clarification. No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable. To validate the implementation, I used Numba's CUDA simulator ( pytest tests/test_gpu_stump.pyAll 39 GPU tests passed successfully. In addition, the complete STUMPY test suite passed (1580 passed). I also compared the redesigned implementation against the current implementation and verified that: Matrix profile values match within floating-point precision (~1e-15). Matrix profile indices match exactly across all evaluated datasets. Self-joins, AB-joins, top-k computations, and multi-GPU partitioning logic produced identical outputs. Regarding performance, I have not benchmarked the implementation on real NVIDIA hardware, so I cannot make definitive claims about GPU runtime improvements yet. The benchmarks included in the PR were obtained under the CUDA simulator and therefore should not be interpreted as representative of real GPU performance. My primary motivation for this redesign was to investigate a direct sliding covariance based correlation computation while preserving correctness and potentially improving numerical stability by avoiding dependence on the QT recurrence. |
@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions. |
| if p_norm < profile[i, -1]: | ||
| idx = core._gpu_searchsorted_right(profile[i], p_norm, bfs, nlevel) | ||
| for g in range(k - 1, idx, -1): | ||
| idx_pos = core._gpu_searchsorted_right(profile[i], p_norm, bfs, nlevel) |
There was a problem hiding this comment.
why did you change this from idx to idx_pos? what is the benefit?
| profile_L_fname : str | ||
| The file name for the left matrix profile | ||
|
|
||
| Notes |
There was a problem hiding this comment.
Why did you remove the note?
| T_subseq_isconstant = np.load(T_subseq_isconstant_fname, allow_pickle=False) | ||
|
|
||
| nlevel = np.floor(np.log2(k) + 1).astype(np.int64) | ||
| # number of levels in binary search tree from which `bfs` is constructed. |
There was a problem hiding this comment.
why did you remove this comment?
| ) # See Definition 3 and Figure 3 | ||
|
|
||
| # Precalculate for sliding covariance | ||
| M_T_clean, _ = core.compute_mean_std(T_B, m) |
There was a problem hiding this comment.
Why are we appending _clean to the end? How is this consistent with the stump function? This code feels arbitrary and A.I. generated/aided.
Updated variable names and comments to reflect inverse standard deviation usage.
|
@Tejaswa-Shrivastava Instead of submitting new code commits, can you please address my comments and questions first? Otherwise, this defeats the purpose of the review and we will be unable to accept your PR. |
|
@seanlaw Thanks for the suggestion! I reran the implementation on Google Colab using an NVIDIA GPU (instead of the CUDA simulator) and executed the GPU test suite successfully. I also benchmarked the implementation against the current
The redesigned implementation continued to produce matrix profiles that matched the baseline implementation within floating-point precision, and the GPU test suite completed successfully. From these benchmarks, the redesign does not show a consistent runtime improvement on real NVIDIA hardware. While it performs better for the 5,000-point workload, it is slower for the larger workloads that I evaluated. Based on these results, I agree that the current implementation does not provide sufficient performance benefits to justify replacing the existing QT-based approach. I appreciate the suggestion to validate on real GPU hardware—it was very helpful in evaluating the redesign beyond the CUDA simulator. |
Replace QT-based correlation computation in GPU-STUMP with direct sliding covariance based Pearson correlation.
This redesign removes QT buffer dependencies while preserving output behavior and API compatibility.
Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.
This change fixes the issue #256
Changes
QT_even,QT_odd, andQT_firstfor correlation computation.Pull Request Checklist
Below is a simple checklist but please do not hesitate to ask for assistance!
black(i.e.,python -m pip install blackorconda install -c conda-forge black)flake8(i.e.,python -m pip install flake8orconda install -c conda-forge flake8)pytest-cov(i.e.,python -m pip install pytest-covorconda install -c conda-forge pytest-cov)black --exclude=".*\.ipynb" --extend-exclude=".venv" --diff ./in the root stumpy directoryflake8 --extend-exclude=.venv ./in the root stumpy directory./setup.sh dev && ./test.shin the root stumpy directory