fix: Improve consistency of per-column stats on FilterExec output#22718
Conversation
asolimando
left a comment
There was a problem hiding this comment.
LGTM, thanks @neilconway for the PR, I left a few comments but nothing major/blocking
| /// input, rows where that column is NULL cannot survive. | ||
| /// | ||
| /// This analysis is conservative; for example, OR clauses are not considered | ||
| /// null-rejecting; neither are indirect operands like `a + 1 < 10`. |
There was a problem hiding this comment.
Nit: this is incomplete and it's fine, but we might add support at least for IS NOT NULL as it should be both easy and what people would intuitively expect. wdyt?
There was a problem hiding this comment.
Sure, that's pretty easy to add.
It's a shame that there is extract_null_rejecting_columns in the logical planner, but we can't easily reuse it...
| /// filtered row estimate, since a column cannot have more nulls or distinct | ||
| /// values than it has rows. Known counts are demoted to inexact because the | ||
| /// filtered row count is itself an estimate. | ||
| fn cap_at_rows( |
There was a problem hiding this comment.
Not a must-have, but you might be interested in checking what is implemented in Apache Calcite for the same case: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L317
This is a more refined estimator than the proposed one, but we can postpone to a follow-up PR (this was on my radar already, I will get to it at some point).
There was a problem hiding this comment.
Cool -- yeah, I agree we could do better here. Makes sense to defer to a followup PR.
|
@asolimando Thanks for the review! |
Which issue does this PR close?
Rationale for this change
#21081 capped the NDV at the row count when computing statistics for several operators. This PR extends that work and ensures that per-column statistics for filter operators are consistent with the estimated output row count. In particular:
We also extend the analysis to consider null-rejecting predicates; for example, the clause
a = 10as a top-level conjunct implies that the null-count of the surviving rows is exactly 0.What changes are included in this PR?
Are these changes tested?
Yes; new tests added.
Are there any user-facing changes?
No.