Improved sliding windows#1116
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1116 +/- ##
==========================================
+ Coverage 76.55% 76.63% +0.08%
==========================================
Files 63 63
Lines 9067 9122 +55
Branches 1521 1534 +13
==========================================
+ Hits 6941 6991 +50
- Misses 1541 1543 +2
- Partials 585 588 +3
🚀 New features to boost your workflow:
|
|
Hey @schrf, thank you for your contribution! Quite a lot of the changes are inconsequential (removing comments, renaming things, etc) which I'd guess are introduced by some LLM. While that's totally acceptable, I'd be great if you could clean up this PR to only touch things that are actually relevant to the new functionality you're implementing. Thank you! |
|
yes also if you could make the notebook changes in a separate PR for the notebook repo I would appreciate it |
|
Hey, thanks for the feedback. I'll simplify the changes as soon as I have time. @selmanozleyen I made a pull request for the notebook, I just didn't mention it here. It is #156 |
|
@timtreis I cleaned up the code. Let me know what you think |
| If True, drop windows that are smaller than the window size at the borders. | ||
| partial_windows: Literal["adaptive", "drop", "split"] | None | ||
| If None, possibly small windows at the edges are kept. | ||
| If `adaptive`, all windows might be shrunken a bit to avoid small windows at the edges. |
There was a problem hiding this comment.
About adaptive mode. This mode still yields incomplete windows right? See this example:
Example (window_size=40, extent 0–100): number_x_windows = ceil(100/40) = 3, x_window_size = ceil(100/3) = 34 → starts [0, 34, 68], ends [34, 68, 100]. The last window is 32 wide, the others 34.
If all window sizes aren't guaranteed to be the same this should be noted here. We also need better explanation how the new window size is going to be calculated. Also we need to point that the main goal here is to make all window sizes closer to each other. This can be useful in your workflow but I'd like to see how this can be useful for researchers in general. Could you explain the use case with a concrete example to me?
| indices = ((x_start, x_middle, y_start, y_end), (x_middle, x_end, y_start, y_end)) | ||
| else: | ||
| # horizontal split | ||
| y_middle = sub_coord_y_sorted.loc[middle_pos, y_col] |
There was a problem hiding this comment.
Please be consistent with these and choose one of them
x_middle = sub_coord_x_sorted[x_col].iloc[middle_pos]
y_middle = sub_coord_y_sorted.loc[middle_pos, y_col]
| @@ -45,8 +47,14 @@ def sliding_window( | |||
| overlap: int | |||
There was a problem hiding this comment.
overlap and window_size should be optional if they are not required by partial_windows=split . It doesn't make sense to expect window_size from the user if aren't going to even use it right?
so it should be
window_size: int | None = None,
also the same for overlap.
| """ | ||
| x_col, y_col = coord_columns | ||
|
|
||
| mask = ( |
There was a problem hiding this comment.
The inclusive boundaries here can cause an infinite recursion. I made claude create a counterexample
The counterexample
Trigger: any window where more than max_nr_cells cells share the same coordinate on the split axis — i.e. a dense cluster of tied/duplicate coordinates. With quantized or integer-binned spatial data this is realistic, not just pathological.
Minimal case: 10 cells all at (0,0), max_nr_cells=5 → RecursionError.
Realistic case: a normal spread of points plus 8 cells piled at integer (7,7), max_nr_cells=5 → RecursionError.
Why it never terminates
Walk the recursion on a window that contains the tied pile (say all cells share x = m, width > height so it's a vertical split):
middle_pos = len // 2, x_middle = ...iloc[middle_pos] → x_middle = m (line 423).
Children are (x_start, m, …) and (m, x_end, …) (line 425).
The mask is inclusive on both ends (>= and <=, _get_window_mask:243-248), so every cell at x = m falls into the first child (x_start, m).
That child has the same cell count as the parent → step 1 again, with no progress. → infinite recursion until the stack limit.
The base case at line 407 (n_cells <= max_cells) can never be reached because the split fails to reduce the count. The same happens on the horizontal branch (line 428).
Root cause
The median value is used as a shared, inclusive boundary, so cells equal to the median can't be partitioned. A robust fix splits by rank/index rather than value (and makes the boundary half-open), or bails out when a window can't be reduced. Rough options:
Half-open boundaries: make sub-windows [start, mid) and [mid, end) so each cell lands in exactly one side. (Needs care so the overall coverage still includes max_x/max_y.)
Terminate on no-progress: if a child ends up with the same n_cells as its parent (all cells tied on both axes), stop and emit it as a single window even though it exceeds max_cells — and ideally log() a warning that the cap couldn't be honored.
Guard up front: detect that the split made no progress (child count == parent count) and break.
|
One thing I noticed in the notebooks is the Nans here: Is this expected? Previous notebook didn't have these nans. Also @FrancescaDr , do you ever use this function at large scale? I feel like this dense representation might not be very scalable. Also another point I wan't to make is for |
| partial_windows: Literal["adaptive", "drop", "split"] | None | ||
| If None, possibly small windows at the edges are kept. | ||
| If `adaptive`, all windows might be shrunken a bit to avoid small windows at the edges. | ||
| If `drop`, possibly small windows at the edges are removed. |
There was a problem hiding this comment.
drop argument shouldn't be so easily removed. We should first deprecate it and warn users if it's being used. Then in another release we can remove it. Because this is a breaking change.
selmanozleyen
left a comment
There was a problem hiding this comment.
@schrf could you apply the changes I mentioned? Also please don't include the notebook changes yet in this PR. Also the "drop, adaptive, split" names aren't very clear.
@FrancescaDr @timtreis @mossishahi , this function doesn't seem scalable but I am curious of your experience. If you had problems, there is a chance that you will have worse problems when trying to split by the number of cells evenly.

Description
replaced the
drop_partial_windowsargument withpartial_windows, adding two new options:max_nr_cells.The old arguments are
Noneanddropnow.How has this been tested?
Tested using imc pancreas and visium datasets, running sliding_windows with different argument combinations and visually inspecting the windows by plotting them.
Closes
closes #908