Skip to content

fix: regex simplification of anchored patterns produces wrong results#22727

Open
lyne7-sc wants to merge 5 commits into
apache:mainfrom
lyne7-sc:fix/predicates_regex
Open

fix: regex simplification of anchored patterns produces wrong results#22727
lyne7-sc wants to merge 5 commits into
apache:mainfrom
lyne7-sc:fix/predicates_regex

Conversation

@lyne7-sc
Copy link
Copy Markdown
Contributor

@lyne7-sc lyne7-sc commented Jun 3, 2026

Which issue does this PR close?

Rationale for this change

The regex simplification rule rewrites anchored regex matches (^literal$, ^(a|b)$) into cheaper = / IN / LIKE expressions. Two bugs in that path:

  1. The literal was always built as Utf8 via lit(...), so on a Utf8View / LargeUtf8 column the rewritten comparison failed at execution with Invalid comparison operation: Utf8View == Utf8.
  2. A ~* (case-insensitive) anchored literal was rewritten to a case-sensitive =, silently dropping rows that differ only in case.

What changes are included in this PR?

  • Build the extracted literal with string_scalar.to_expr(...) so its type follows the column type (Utf8 / LargeUtf8 / Utf8View), consistent with the existing LIKE branches.
  • Rewrite ~* anchored literals to ILIKE instead of =. The existing is_safe_for_like guard ensures the literal has no % / _, so this is an exact case-insensitive match. (Anchored alternations under ~* still fall back to regex evaluation.)

Are these changes tested?

Yes. predicates.slt now covers anchored ~ / ~*, single literals and alternations, over both Utf8 and Utf8View columns. Existing regex.rs unit tests still pass.

Are there any user-facing changes?

Yes, bug fixes only

@github-actions github-actions Bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jun 3, 2026
Barrr

query T
SELECT * FROM test WHERE column1 ~* '^(barrr|bazzz)$'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests with negation+regex are missing (!~ and !~*).

Bazzz

statement ok
CREATE TABLE test_regex_utf8view(s VARCHAR) AS VALUES ('foo'), ('Bazzz');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question to educate myself: How the values here are Utf8View ?
I'd expect some casting to achieve that.

query T
SELECT * FROM test_regex_utf8view WHERE s ~* '^bazzz$'
----
Bazzz
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this asserts the expected result ?
Neither the optimization nor the type is asserted.
Maybe use EXPLAIN ... and assert its output instead ?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex simplification of anchored patterns produces wrong results

3 participants