Redesign GPU-STUMP correlation computation using sliding covariance by Tejaswa-Shrivastava · Pull Request #1145 · stumpy-dev/stumpy

Tejaswa-Shrivastava · 2026-07-02T19:35:10Z

Replace QT-based correlation computation in GPU-STUMP with direct sliding covariance based Pearson correlation.

This redesign removes QT buffer dependencies while preserving output behavior and API compatibility.

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

This change fixes the issue #256

Changes

Removed dependence on QT_even, QT_odd, and QT_first for correlation computation.
Replaced QT-based correlation computation with direct sliding covariance based Pearson correlation.
Updated GPU memory allocation and temporary buffer management to support covariance buffers.
Preserved the existing matrix profile distance computation and public API.

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

Replace QT-based correlation computation in GPU-STUMP with direct sliding covariance based Pearson correlation. This redesign removes QT buffer dependencies while preserving output behavior and API compatibility. Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

gitnotebooks · 2026-07-02T19:35:15Z

Review these changes at https://app.gitnotebooks.com/stumpy-dev/stumpy/pull/1145

Updated the docstring for the _compute_and_update_PI_kernel and gpu_stump functions to provide detailed parameter descriptions and usage examples.

seanlaw · 2026-07-02T20:37:11Z

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

@Tejaswa-Shrivastava Can you be more specific when you say "validated"? Did you run this on an NVIDIA GPU? How much faster is it?

Did you also execute the GPU tests?

Tejaswa-Shrivastava · 2026-07-02T21:19:01Z

Validated against the baseline implementation with matching matrix profiles and indices. Full STUMPY test suite passed successfully.

@Tejaswa-Shrivastava Can you be more specific when you say "validated"? Did you run this on an NVIDIA GPU? How much faster is it?

Did you also execute the GPU tests?

Thanks for asking for clarification.

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.

To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

pytest tests/test_gpu_stump.py

All 39 GPU tests passed successfully. In addition, the complete STUMPY test suite passed (1580 passed).

I also compared the redesigned implementation against the current implementation and verified that:

Matrix profile values match within floating-point precision (~1e-15).

Matrix profile indices match exactly across all evaluated datasets.

Self-joins, AB-joins, top-k computations, and multi-GPU partitioning logic produced identical outputs.

Regarding performance, I have not benchmarked the implementation on real NVIDIA hardware, so I cannot make definitive claims about GPU runtime improvements yet. The benchmarks included in the PR were obtained under the CUDA simulator and therefore should not be interpreted as representative of real GPU performance.

My primary motivation for this redesign was to investigate a direct sliding covariance based correlation computation while preserving correctness and potentially improving numerical stability by avoiding dependence on the QT recurrence.

seanlaw · 2026-07-03T08:37:33Z

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.

To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions.

seanlaw · 2026-07-03T09:13:11Z

        if p_norm < profile[i, -1]:
-            idx = core._gpu_searchsorted_right(profile[i], p_norm, bfs, nlevel)
-            for g in range(k - 1, idx, -1):
+            idx_pos = core._gpu_searchsorted_right(profile[i], p_norm, bfs, nlevel)


why did you change this from idx to idx_pos? what is the benefit?

seanlaw · 2026-07-03T09:14:33Z

+    profile_L_fname : str
+        The file name for the left matrix profile

-    Notes


Why did you remove the note?

seanlaw · 2026-07-03T09:15:21Z

    T_subseq_isconstant = np.load(T_subseq_isconstant_fname, allow_pickle=False)

    nlevel = np.floor(np.log2(k) + 1).astype(np.int64)
-    # number of levels in binary search tree from which `bfs` is constructed.


why did you remove this comment?

seanlaw · 2026-07-03T09:17:49Z

    )  # See Definition 3 and Figure 3

+    # Precalculate for sliding covariance
+    M_T_clean, _ = core.compute_mean_std(T_B, m)


Why are we appending _clean to the end? How is this consistent with the stump function? This code feels arbitrary and A.I. generated/aided.

Updated variable names and comments to reflect inverse standard deviation usage.

seanlaw · 2026-07-04T07:56:37Z

@Tejaswa-Shrivastava Instead of submitting new code commits, can you please address my comments and questions first? Otherwise, this defeats the purpose of the review and we will be unable to accept your PR.

Tejaswa-Shrivastava · 2026-07-04T10:14:41Z

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.
To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions.

Tejaswa-Shrivastava · 2026-07-04T10:27:04Z

No, I did not run this on physical NVIDIA GPU hardware since development was done on macOS, where CUDA is unavailable.
To validate the implementation, I used Numba's CUDA simulator (NUMBA_ENABLE_CUDASIM=1). Under the simulator, I executed the GPU-specific test suite:

@Tejaswa-Shrivastava for this issue, it is insufficient to use the CUDA simulator. Can you try using Google Colab, which has NVIDIA GPUs, to run our unit test suite on and compare the performance on a variety of time series lengths and report back? Please see/copy this example notebook and ask follow up questions.

@seanlaw Thanks for the suggestion!

I reran the implementation on Google Colab using an NVIDIA GPU (instead of the CUDA simulator) and executed the GPU test suite successfully.

I also benchmarked the implementation against the current main branch using the same Colab environment and GPU across a range of time series lengths.

Time Series Length	Current Implementation (s)	Proposed Implementation (s)
1,000	0.154	0.184
5,000	1.092	0.844
10,000	1.511	2.099
25,000	4.131	4.726
50,000	8.299	9.144

The redesigned implementation continued to produce matrix profiles that matched the baseline implementation within floating-point precision, and the GPU test suite completed successfully.

From these benchmarks, the redesign does not show a consistent runtime improvement on real NVIDIA hardware. While it performs better for the 5,000-point workload, it is slower for the larger workloads that I evaluated.

Based on these results, I agree that the current implementation does not provide sufficient performance benefits to justify replacing the existing QT-based approach. I appreciate the suggestion to validate on real GPU hardware—it was very helpful in evaluating the redesign beyond the CUDA simulator.

Tejaswa-Shrivastava requested a review from seanlaw as a code owner July 2, 2026 19:35

Enhance documentation for GPU STOMP functions

8a2b6bd

Updated the docstring for the _compute_and_update_PI_kernel and gpu_stump functions to provide detailed parameter descriptions and usage examples.

seanlaw reviewed Jul 3, 2026

View reviewed changes

Tejaswa-Shrivastava added 4 commits July 3, 2026 19:31

Refactor variable names and enhance docstrings

2d1ecc8

Refactor to use inverse standard deviation for Q and T

ad95373

Updated variable names and comments to reflect inverse standard deviation usage.

Refactor covariance calculations in gpu_stump.py

a3fc022

Enhance covariance calculations

e18e6ff

Tejaswa-Shrivastava closed this Jul 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign GPU-STUMP correlation computation using sliding covariance#1145

Redesign GPU-STUMP correlation computation using sliding covariance#1145
Tejaswa-Shrivastava wants to merge 6 commits into
stumpy-dev:mainfrom
Tejaswa-Shrivastava:feature/pearson-correlation

Tejaswa-Shrivastava commented Jul 2, 2026 •

edited

Loading

Uh oh!

gitnotebooks Bot commented Jul 2, 2026

Uh oh!

seanlaw commented Jul 2, 2026

Uh oh!

Tejaswa-Shrivastava commented Jul 2, 2026 •

edited by seanlaw

Loading

Uh oh!

seanlaw commented Jul 3, 2026 •

edited

Loading

Uh oh!

seanlaw Jul 3, 2026

Uh oh!

seanlaw Jul 3, 2026

Uh oh!

seanlaw Jul 3, 2026

Uh oh!

seanlaw Jul 3, 2026

Uh oh!

seanlaw commented Jul 4, 2026

Uh oh!

Tejaswa-Shrivastava commented Jul 4, 2026

Uh oh!

Tejaswa-Shrivastava commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Tejaswa-Shrivastava commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Pull Request Checklist

Uh oh!

gitnotebooks Bot commented Jul 2, 2026

Uh oh!

seanlaw commented Jul 2, 2026

Uh oh!

Tejaswa-Shrivastava commented Jul 2, 2026 • edited by seanlaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

seanlaw Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

seanlaw Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

seanlaw Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

seanlaw commented Jul 4, 2026

Uh oh!

Tejaswa-Shrivastava commented Jul 4, 2026

Uh oh!

Tejaswa-Shrivastava commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tejaswa-Shrivastava commented Jul 2, 2026 •

edited

Loading

Tejaswa-Shrivastava commented Jul 2, 2026 •

edited by seanlaw

Loading

seanlaw commented Jul 3, 2026 •

edited

Loading