Skip to content

Redesign GPU-STUMP correlation computation using sliding covariance#1145

Closed
Tejaswa-Shrivastava wants to merge 6 commits into
stumpy-dev:mainfrom
Tejaswa-Shrivastava:feature/pearson-correlation
Closed

Redesign GPU-STUMP correlation computation using sliding covariance#1145
Tejaswa-Shrivastava wants to merge 6 commits into
stumpy-dev:mainfrom
Tejaswa-Shrivastava:feature/pearson-correlation

Conversation

@Tejaswa-Shrivastava

@Tejaswa-Shrivastava Tejaswa-Shrivastava commented Jul 2, 2026

Copy link
Copy Markdown

Replace QT-based correlation computation in GPU-STUMP with direct sliding covariance based Pearson correlation.

This redesign removes QT buffer dependencies while preserving output behavior and API compatibility.

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

This change fixes the issue #256

Changes

  • Removed dependence on QT_even, QT_odd, and QT_first for correlation computation.
  • Replaced QT-based correlation computation with direct sliding covariance based Pearson correlation.
  • Updated GPU memory allocation and temporary buffer management to support covariance buffers.
  • Preserved the existing matrix profile distance computation and public API.

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

  • Fork, clone, and checkout the newest version of the code
  • Create a new branch
  • Make necessary code changes
  • Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • Run black --exclude=".*\.ipynb" --extend-exclude=".venv" --diff ./ in the root stumpy directory
  • Run flake8 --extend-exclude=.venv ./ in the root stumpy directory
  • Run ./setup.sh dev && ./test.sh in the root stumpy directory
  • Reference a Github issue (and create one if one doesn't already exist)

Replace QT-based correlation computation in GPU-STUMP with direct sliding covariance based Pearson correlation.

This redesign removes QT buffer dependencies while preserving output behavior and API compatibility.

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.
@gitnotebooks

gitnotebooks Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review these changes at https://app.gitnotebooks.com/stumpy-dev/stumpy/pull/1145

Updated the docstring for the _compute_and_update_PI_kernel and gpu_stump functions to provide detailed parameter descriptions and usage examples.
@seanlaw

seanlaw commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

@Tejaswa-Shrivastava Can you be more specific when you say "validated"? Did you run this on an NVIDIA GPU? How much faster is it?

Did you also execute the GPU tests?

@Tejaswa-Shrivastava

Tejaswa-Shrivastava commented Jul 2, 2026

Copy link
Copy Markdown
Author

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

@Tejaswa-Shrivastava Can you be more specific when you say "validated"? Did you run this on an NVIDIA GPU? How much faster is it?

Did you also execute the GPU tests?

Thanks for asking for clarification.

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.

To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

pytest tests/test_gpu_stump.py

All 39 GPU tests passed successfully. In addition, the complete STUMPY test suite passed (1580 passed).

I also compared the redesigned implementation against the current implementation and verified that:

Matrix profile values match within floating-point precision (~1e-15).

Matrix profile indices match exactly across all evaluated datasets.

Self-joins, AB-joins, top-k computations, and multi-GPU partitioning logic produced identical outputs.

Regarding performance, I have not benchmarked the implementation on real NVIDIA hardware, so I cannot make definitive claims about GPU runtime improvements yet. The benchmarks included in the PR were obtained under the CUDA simulator and therefore should not be interpreted as representative of real GPU performance.

My primary motivation for this redesign was to investigate a direct sliding covariance based correlation computation while preserving correctness and potentially improving numerical stability by avoiding dependence on the QT recurrence.

@seanlaw

seanlaw commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.

To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions.

Comment thread stumpy/gpu_stump.py Outdated
if p_norm < profile[i, -1]:
idx = core._gpu_searchsorted_right(profile[i], p_norm, bfs, nlevel)
for g in range(k - 1, idx, -1):
idx_pos = core._gpu_searchsorted_right(profile[i], p_norm, bfs, nlevel)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this from idx to idx_pos? what is the benefit?

Comment thread stumpy/gpu_stump.py
profile_L_fname : str
The file name for the left matrix profile

Notes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the note?

Comment thread stumpy/gpu_stump.py
T_subseq_isconstant = np.load(T_subseq_isconstant_fname, allow_pickle=False)

nlevel = np.floor(np.log2(k) + 1).astype(np.int64)
# number of levels in binary search tree from which `bfs` is constructed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this comment?

Comment thread stumpy/gpu_stump.py Outdated
) # See Definition 3 and Figure 3

# Precalculate for sliding covariance
M_T_clean, _ = core.compute_mean_std(T_B, m)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we appending _clean to the end? How is this consistent with the stump function? This code feels arbitrary and A.I. generated/aided.

@seanlaw

seanlaw commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@Tejaswa-Shrivastava Instead of submitting new code commits, can you please address my comments and questions first? Otherwise, this defeats the purpose of the review and we will be unable to accept your PR.

@Tejaswa-Shrivastava

Copy link
Copy Markdown
Author

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.
To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions.

@Tejaswa-Shrivastava

Copy link
Copy Markdown
Author

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.
To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions.

@seanlaw Thanks for the suggestion!

I reran the implementation on Google Colab using an NVIDIA GPU (instead of the CUDA simulator) and executed the GPU test suite successfully.

I also benchmarked the implementation against the current main branch using the same Colab environment and GPU across a range of time series lengths.

Time Series Length Current Implementation (s) Proposed Implementation (s)
1,000 0.154 0.184
5,000 1.092 0.844
10,000 1.511 2.099
25,000 4.131 4.726
50,000 8.299 9.144

The redesigned implementation continued to produce matrix profiles that matched the baseline implementation within floating-point precision, and the GPU test suite completed successfully.

From these benchmarks, the redesign does not show a consistent runtime improvement on real NVIDIA hardware. While it performs better for the 5,000-point workload, it is slower for the larger workloads that I evaluated.

Based on these results, I agree that the current implementation does not provide sufficient performance benefits to justify replacing the existing QT-based approach. I appreciate the suggestion to validate on real GPU hardware—it was very helpful in evaluating the redesign beyond the CUDA simulator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants