spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
Date Fri, 23 Sep 2016 21:13:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517589#comment-15517589
] 

Josh Rosen commented on SPARK-17647:
------------------------------------

I think that the first case is clearly a bug (and have a fix) but I'm not so sure about the
second case. Consider:

{code}
scala> ".*\\\\\\\\.*".r.findFirstIn("\\\\")
res8: Option[String] = Some(\\)
{code}

In a regular expression, two backslashes denote an escaped backslash. Setting Java strings
aside for a moment, consider using pencil/paper to writing a regex which matches a single
backslash character: in the context of a regex a backslash character acts as an escape character,
so you need two consecutive backslashes. When we take our handwritten regex with two backslashes
and encode this into a Java string we need to add an additional layer of backslash escaping
to work around the character escaping for Java strings, yielding four consecutive backslashes.

One illustration of this is the fact that the Java string literal {code}"\\"{code} is not
considered a valid regex:

{code}
scala> "\\".r
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
\
 ^
  at java.util.regex.Pattern.error(Pattern.java:1955)
  at java.util.regex.Pattern.compile(Pattern.java:1702)
  at java.util.regex.Pattern.<init>(Pattern.java:1351)
  at java.util.regex.Pattern.compile(Pattern.java:1028)
  at scala.util.matching.Regex.<init>(Regex.scala:191)
  at scala.collection.immutable.StringLike$class.r(StringLike.scala:284)
  at scala.collection.immutable.StringOps.r(StringOps.scala:29)
  at scala.collection.immutable.StringLike$class.r(StringLike.scala:273)
  at scala.collection.immutable.StringOps.r(StringOps.scala:29)
  ... 28 elided
{code}

The second example returns {{true}} on MySQL.

On MySQL, running {code}select '\\' rlike '\\'{code} will fail with a syntax error because
this will be interpreted as a trailing escape character rather than as a backslash literal,
while {code}select '\\' rlike '\\\\'{code} will return true.

> SQL LIKE/RLIKE do not handle backslashes correctly
> --------------------------------------------------
>
>                 Key: SPARK-17647
>                 URL: https://issues.apache.org/jira/browse/SPARK-17647
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Xiangrui Meng
>              Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '\\\\' like '%\\%';
> select '\\\\' rlike '.*\\\\\\\\.*';
> {code}
> The first returned false and the second returned true. Both are wrong.
> cc: [~yhuai] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message