Refactor gpu_stump to Use Covariance-Based Pearson Correlation#1146
Refactor gpu_stump to Use Covariance-Based Pearson Correlation#1146Tejaswa-Shrivastava wants to merge 1 commit into
Conversation
Overview This PR transitions the core distance computation within the gpu_stump implementation from the original sliding dot-product (QT) approach to a sliding covariance-based Pearson correlation approach. The original sliding dot-product implementation was mathematically sound for most typical scenarios but was susceptible to catastrophic cancellation in extreme edge cases (where QT and m * μ_Q * M_T are both very large numbers but their difference is very small). By maintaining a sliding covariance directly within the kernel, we avoid this cancellation and guarantee significantly higher numerical stability for extreme-valued inputs. Key Changes Covariance Kernel Implementation (_compute_and_update_PI_kernel): Replaced the inner sliding QT loop with a direct sliding cov_out update. Introduced μ_Q_m_1 and M_T_m_1 as kernel arguments to correctly calculate the sliding update across moving windows. Computes Pearson correlation and subsequent Euclidean distances directly from the stabilized sliding covariance arrays. NaN and Inf Data Correctness: Maintained rigorous correctness for NaN and Inf inputs by ensuring the algebraic mean of the zero-filled overlapping region (μ_Q_m_1) is read explicitly from global memory. core.preprocess explicitly flags NaN-containing windows with np.inf. By retaining the dedicated μ_Q_m_1 array (which operates cleanly on the zero-filled T_A_pre underneath), we prevent these np.inf flags from poisoning the sliding covariance diagonals, preserving STUMPY's expected NaN masking logic and unit test parity. Multi-GPU / Process Pool Updates: Plumbed the new required precalculated sliding-mean arrays (μ_Q_m_1_fname, M_T_m_1_fname) and base covariance files (cov_fname, cov_first_fname) through the _gpu_stump multi-process driver loop and temporary file system. Performance & Math Considerations: The shift from a dot-product update to a full covariance update inherently adds an ~8% computational and memory bandwidth overhead (due to the physically unavoidable μ_Q_m_1 extra global memory load per thread required to maintain NaN-stability). The inner loop mathematics use a 5-op formula (adj_cov_a_j * cov_b_i - adj_cov_c_j * cov_d_i) to compute the exact covariance differential while preserving all necessary algebraic boundaries. Testing ✅ Passed black, isort, and flake8 compliance. ✅ Custom docstring.py fully conforms with the updated kernel signatures. ✅ Successfully passes all NUMBA_ENABLE_CUDASIM=1 tests (test_gpu_stump.py). ✅ Perfect parity with stumpy.stump (CPU) matrix profile results (including robust test_gpu_stump_nan_inf_A_B_join validation tests). ✅ Maintained 100% Code Coverage across the test suite. Impact This brings the numerical stability of gpu_stump perfectly in line with STUMPY's CPU implementations, eliminating catastrophic cancellation vulnerabilities at the cost of a minor (~8%) acceptable kernel overhead.
|
Review these changes at https://app.gitnotebooks.com/stumpy-dev/stumpy/pull/1146 |
|
@seanlaw I created a fresh PR with the latest changes and reran the validation on Google Colab using an NVIDIA GPU. I also compared the implementation against the current
The GPU test suite also passed, and the matrix profiles and profile indices matched the baseline implementation, so the behavior remains consistent. While the redesign didn't show a consistent performance improvement on real GPU hardware, running it on Colab helped validate the implementation beyond the CUDA simulator. Let me know if you'd like me to evaluate any other workloads or benchmarks. |
Thanks @Tejaswa-Shrivastava. I think that this exploration was valuable in helping us close out the issue #256 (as "the proposed enhancement does not meaningfully improve the performance). The only benefit is that the Based on your exploration, I think this issue is resolved. |
Refactor
gpu_stumpto Use Covariance-Based Pearson Correlation #256Overview
This PR transitions the core distance computation within the
gpu_stumpimplementation from the original sliding dot-product (QT) approach to a sliding covariance-based Pearson correlation approach.The original sliding dot-product implementation was mathematically sound for most typical scenarios but was susceptible to catastrophic cancellation in extreme edge cases (where
QTandm * μ_Q * M_Tare both very large numbers but their difference is very small). By maintaining a sliding covariance directly within the kernel, we avoid this cancellation and guarantee significantly higher numerical stability for extreme-valued inputs.Key Changes
Covariance Kernel Implementation (
_compute_and_update_PI_kernel):QTloop with a direct slidingcov_outupdate.μ_Q_m_1andM_T_m_1as kernel arguments to correctly calculate the sliding update across moving windows.NaN and Inf Data Correctness:
NaNandInfinputs by ensuring the algebraic mean of the zero-filled overlapping region (μ_Q_m_1) is read explicitly from global memory.core.preprocessexplicitly flagsNaN-containing windows withnp.inf. By retaining the dedicatedμ_Q_m_1array (which operates cleanly on the zero-filledT_A_preunderneath), we prevent thesenp.infflags from poisoning the sliding covariance diagonals, preserving STUMPY's expectedNaNmasking logic and unit test parity.Multi-GPU / Process Pool Updates:
μ_Q_m_1_fname,M_T_m_1_fname) and base covariance files (cov_fname,cov_first_fname) through the_gpu_stumpmulti-process driver loop and temporary file system.Performance & Math Considerations:
μ_Q_m_1extra global memory load per thread required to maintain NaN-stability).adj_cov_a_j * cov_b_i - adj_cov_c_j * cov_d_i) to compute the exact covariance differential while preserving all necessary algebraic boundaries.Testing
black,isort, andflake8compliance.docstring.pyfully conforms with the updated kernel signatures.NUMBA_ENABLE_CUDASIM=1tests (test_gpu_stump.py).stumpy.stump(CPU) matrix profile results (including robusttest_gpu_stump_nan_inf_A_B_joinvalidation tests).Impact
This brings the numerical stability of
gpu_stumpperfectly in line with STUMPY's CPU implementations, eliminating catastrophic cancellation vulnerabilities at the cost of a minor (~8%) acceptable kernel overhead.Pull Request Checklist
Below is a simple checklist but please do not hesitate to ask for assistance!
black(i.e.,python -m pip install blackorconda install -c conda-forge black)flake8(i.e.,python -m pip install flake8orconda install -c conda-forge flake8)pytest-cov(i.e.,python -m pip install pytest-covorconda install -c conda-forge pytest-cov)black --exclude=".*\.ipynb" --extend-exclude=".venv" --diff ./in the root stumpy directoryflake8 --extend-exclude=.venv ./in the root stumpy directory./setup.sh dev && ./test.shin the root stumpy directory