spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng Lian (JIRA)" <>
Subject [jira] [Created] (SPARK-6319) DISTINCT doesn't work for binary type
Date Fri, 13 Mar 2015 12:10:38 GMT
Cheng Lian created SPARK-6319:

             Summary: DISTINCT doesn't work for binary type
                 Key: SPARK-6319
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.2.1, 1.1.1, 1.0.2, 1.3.0
            Reporter: Cheng Lian

Spark shell session for reproduction:
scala> import sqlContext.implicits._
scala> import org.apache.spark.sql.types._
scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" cast BinaryType)
CAST(c, BinaryType)
Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared
by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}}
and its {{.contains}} method to check for duplicated values. These two facts together cause
the problem.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message