Adrien Lavoillotte created SPARK-22183:
------------------------------------------
Summary: Inconsistency in LIKE escaping between literal values and column-based
ones
Key: SPARK-22183
URL: https://issues.apache.org/jira/browse/SPARK-22183
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.0
Reporter: Adrien Lavoillotte
Priority: Minor
I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to have filters &
join conditions like:
* Column A's value contains column B's
* Column A's value contains some literal string
So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since SparkSQL does not
support {{LIKE expr ESCAPE char}}, I need to escape using \, and presumably also \ itself
(twice in the case of literals, since '\\' represents a single \).
But it seems that in a {{LIKE}} expression literal does not have quite the same escaping as
other literal strings or non-literals {{LIKE}} expressions, seemingly depending on whether
the left-hand side and/or right-hand side are literals or columns.
Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in the body of this
description, I'm purposedly using zero-width spaces to avoid Jira transforming my \.
On Spark 2.2.0:
{code}
// both LHS & RHS literals
scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
+---+---------+
| \|\ LIKE \\|
+---+---------+
| \| true|
+---+---------+
scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not allowed to end
with the escape character;
at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
...
scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
+---+-------------+
|a\b|a\b LIKE a\\b|
+---+-------------+
|a\b| true|
+---+-------------+
scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the escape character
is not allowed to precede 'b';
at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
...
// test data
spark.sql("""SELECT * FROM test""").show()
+----+----+
| _1| _2|
+----+----+
| ok| ok|
| Ok| ok|
| a_b| a_b|
| aab| a_b|
| c%d| c%d|
|caad| c%d|
|e\nf|e\nf|
| e
f|e\nf|
+----+----+
// both column-based
// not escaping \
scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])',
'\\\\$1')""").show()
ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the escape character
is not allowed to precede 'n';
at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
at org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
...
// escaping \
scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])',
'\\\\$1')""").show()
+----+----+
| _1| _2|
+----+----+
| ok| ok|
| a_b| a_b|
| c%d| c%d|
|e\nf|e\nf|
+----+----+
// LHS column-based, RHS literal
scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
+----+----+
| _1| _2|
+----+----+
|e\nf|e\nf|
+----+----+
scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
+---+---+
| _1| _2|
+---+---+
+---+---+
{code}
So in Spark 2.2:
* Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we get an error. So
far so good.
* If LHS is also a literal, same applies.
* If RHS is a literal and LHS is a column, it seems that the string literal escaping of \
interferes with the {{LIKE}} escaping, since re-escaping \ does not match anymore. I would
expect needing \\\\ between quotes to match a single \ (one escaping for the string
literal, one for the {{LIKE}} escaping).
On Spark 1.6 (and it seems until Spark 2.2.0):
{code}
// both LHS & RHS literals
scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
+---+-----+
|_c0| _c1|
+---+-----+
| \|false|
+---+-----+
scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
+---+-----+
|_c0| _c1|
+---+-----+
| \|false|
+---+-----+
scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
+---+----+
|_c0| _c1|
+---+----+
|a\b|true|
+---+----+
scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
+---+----+
|_c0| _c1|
+---+----+
|a\b|true|
+---+----+
// Same test data as for 2.2
// both column-based
// not escaping \
scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])',
'\\\\$1')""").show()
+----+----+
| _1| _2|
+----+----+
| ok| ok|
| a_b| a_b|
| c%d| c%d|
|e\nf|e\nf|
+----+----+
// escaping \
scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])',
'\\\\$1')""").show()
+----+----+
| _1| _2|
+----+----+
| ok| ok|
| a_b| a_b|
| c%d| c%d|
|e\nf|e\nf|
+----+----+
// LHS column-based, RHS literal
scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
+----+----+
| _1| _2|
+----+----+
|e\nf|e\nf|
+----+----+
scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
+---+---+
| _1| _2|
+---+---+
+---+---+
{code}
So in Spark 1.6:
* Cannot match a single-character literal string of \
* Matching a string (literal or column) containing \ works whether you escape \ or not, which
makes me think there is room for ambiguity (would \\n match \n?)
* If RHS is a literal and LHS is a column, same issue as Spark 2.2
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|