spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Lavoillotte (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-22183) Inconsistency in LIKE escaping between literal values and column-based ones
Date Mon, 02 Oct 2017 13:28:00 GMT
Adrien Lavoillotte created SPARK-22183:
------------------------------------------

             Summary: Inconsistency in LIKE escaping between literal values and column-based
ones
                 Key: SPARK-22183
                 URL: https://issues.apache.org/jira/browse/SPARK-22183
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
            Reporter: Adrien Lavoillotte
            Priority: Minor


I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to have filters &
join conditions like:
* Column A's value contains column B's
* Column A's value contains some literal string

So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since SparkSQL does not
support {{LIKE expr ESCAPE char}}, I need to escape using \, and presumably also \ itself
(twice in the case of literals, since '\​\' represents a single \​).

But it seems that in a {{LIKE}} expression literal does not have quite the same escaping as
other literal strings or non-literals {{LIKE}} expressions, seemingly depending on whether
the left-hand side and/or right-hand side are literals or columns.

Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in the body of this
description, I'm purposedly using zero-width spaces to avoid Jira transforming my \​.

On Spark 2.2.0:
{code}
// both LHS & RHS literals

scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
+---+---------+
|  \|\ LIKE \\|
+---+---------+
|  \|     true|
+---+---------+

scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not allowed to end
with the escape character;
  at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
  at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
  at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
  at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
  at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
  at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
  at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
  ...

scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
+---+-------------+
|a\b|a\b LIKE a\\b|
+---+-------------+
|a\b|         true|
+---+-------------+

scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the escape character
is not allowed to precede 'b';
  at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
  at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
  at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
  ...

// test data
spark.sql("""SELECT * FROM test""").show()
+----+----+
|  _1|  _2|
+----+----+
|  ok|  ok|
|  Ok|  ok|
| a_b| a_b|
| aab| a_b|
| c%d| c%d|
|caad| c%d|
|e\nf|e\nf|
| e
f|e\nf|
+----+----+

// both column-based

// not escaping \
scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])',
'\\\\$1')""").show()
ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the escape character
is not allowed to precede 'n';
	at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
	at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
	at org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
	...

// escaping \
scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])',
'\\\\$1')""").show()
+----+----+
|  _1|  _2|
+----+----+
|  ok|  ok|
| a_b| a_b|
| c%d| c%d|
|e\nf|e\nf|
+----+----+

// LHS column-based, RHS literal

scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
+----+----+
|  _1|  _2|
+----+----+
|e\nf|e\nf|
+----+----+

scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
+---+---+
| _1| _2|
+---+---+
+---+---+
{code}

So in Spark 2.2:
* Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we get an error. So
far so good.
* If LHS is also a literal, same applies.
* If RHS is a literal and LHS is a column, it seems that the string literal escaping of \
interferes with the {{LIKE}} escaping, since re-escaping \ does not match anymore. I would
expect needing \\​​\​\​ between quotes to match a single \ (one escaping for the string
literal, one for the {{LIKE}} escaping).

On Spark 1.6 (and it seems until Spark 2.2.0):
{code}
// both LHS & RHS literals

scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
+---+-----+
|_c0|  _c1|
+---+-----+
|  \|false|
+---+-----+

scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
+---+-----+
|_c0|  _c1|
+---+-----+
|  \|false|
+---+-----+

scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
+---+----+
|_c0| _c1|
+---+----+
|a\b|true|
+---+----+

scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
+---+----+
|_c0| _c1|
+---+----+
|a\b|true|
+---+----+

// Same test data as for 2.2

// both column-based

// not escaping \
scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])',
'\\\\$1')""").show()
+----+----+
|  _1|  _2|
+----+----+
|  ok|  ok|
| a_b| a_b|
| c%d| c%d|
|e\nf|e\nf|
+----+----+

// escaping \
scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])',
'\\\\$1')""").show()
+----+----+
|  _1|  _2|
+----+----+
|  ok|  ok|
| a_b| a_b|
| c%d| c%d|
|e\nf|e\nf|
+----+----+

// LHS column-based, RHS literal

scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
+----+----+
|  _1|  _2|
+----+----+
|e\nf|e\nf|
+----+----+

scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
+---+---+
| _1| _2|
+---+---+
+---+---+
{code}

So in Spark 1.6:
* Cannot match a single-character literal string of \
* Matching a string (literal or column) containing \ works whether you escape \ or not, which
makes me think there is room for ambiguity (would \​\​n match \n?)
* If RHS is a literal and LHS is a column, same issue as Spark 2.2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message