spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <l...@databricks.com>
Subject Re: Dataframe schema...
Date Fri, 21 Oct 2016 22:15:20 GMT
Hi Muthu,

What is the version of Spark are you using? This seems to be a bug in 
the analysis phase.

Cheng


On 10/21/16 12:50 PM, Muthu Jayakumar wrote:
> Sorry for the late response. Here is what I am seeing...
>
>
> Schema from parquet file.
> d1.printSchema()
> root
>  |-- task_id: string (nullable = true)
>  |-- task_name: string (nullable = true)
>  |-- some_histogram: struct (nullable = true)
>  |    |-- values: array (nullable = true)
>  |    |    |-- element: double (containsNull = true)
>  |    |-- freq: array (nullable = true)
>  |    |    |-- element: long (containsNull = true)
>
> d2.printSchema() //Data created using dataframe and/or processed before writing to 
> parquet file.
> root
>  |-- task_id: string (nullable = true)
>  |-- task_name: string (nullable = true)
>  |-- some_histogram: struct (nullable = true)
>  |    |-- values: array (nullable = true)
>  |    |    |-- element: double (containsNull = false)
>  |    |-- freq: array (nullable = true)
>  |    |    |-- element: long (containsNull = false)
>
> d1.union(d2).printSchema()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> unresolved operator 'Union;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Please advice,
> Muthu
>
> On Thu, Oct 20, 2016 at 1:46 AM, Michael Armbrust 
> <michael@databricks.com <mailto:michael@databricks.com>> wrote:
>
>     What is the issue you see when unioning?
>
>     On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar
>     <babloo80@gmail.com <mailto:babloo80@gmail.com>> wrote:
>
>         Hello Michael,
>
>         Thank you for looking into this query. In my case there seem
>         to be an issue when I union a parquet file read from disk
>         versus another dataframe that I construct in-memory. The only
>         difference I see is the containsNull = true. In fact, I do not
>         see any errors with union on the simple schema of "col1 thru
>         col4" above. But the problem seem to exist only on that
>         "some_histogram" column which contains the mixed containsNull
>         = true/false.
>         Let me know if this helps.
>
>         Thanks,
>         Muthu
>
>
>
>         On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust
>         <michael@databricks.com <mailto:michael@databricks.com>> wrote:
>
>             Nullable is just a hint to the optimizer that its
>             impossible for there to be a null value in this column, so
>             that it can avoid generating code for null-checks. When in
>             doubt, we set nullable=true since it is always safer to
>             check.
>
>             Why in particular are you trying to change the nullability
>             of the column?
>
>             On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar
>             <babloo80@gmail.com <mailto:babloo80@gmail.com>> wrote:
>
>                 Hello there,
>
>                 I am trying to understand how and when does DataFrame
>                 (or Dataset) sets nullable = true vs false on a schema.
>
>                 Here is my observation from a sample code I tried...
>
>
>                 scala> spark.createDataset(Seq((1, "a", 2.0d), (2,
>                 "b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",
>                 "col3").withColumn("col4", lit("bla")).printSchema()
>                 root
>                  |-- col1: integer (nullable = false)
>                  |-- col2: string (nullable = true)
>                  |-- col3: double (nullable = false)
>                  |-- col4: string (nullable = false)
>
>
>                 scala> spark.createDataset(Seq((1, "a", 2.0d), (2,
>                 "b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",
>                 "col3").withColumn("col4",
>                 lit("bla")).write.parquet("/tmp/sample.parquet")
>
>                 scala>
>                 spark.read.parquet("/tmp/sample.parquet").printSchema()
>                 root
>                  |-- col1: integer (nullable = true)
>                  |-- col2: string (nullable = true)
>                  |-- col3: double (nullable = true)
>                  |-- col4: string (nullable = true)
>
>
>                 The place where this seem to get me into trouble is
>                 when I try to union one data-structure from in-memory
>                 (notice that in the below schema the highlighted
>                 element is represented as 'false' for in-memory
>                 created schema) and one from file that starts out with
>                 a schema like below...
>
>                  |-- some_histogram: struct (nullable = true)
>                  |    |-- values: array (nullable = true)
>                  |    |  |-- element: double (containsNull = true)
>                  |    |-- freq: array (nullable = true)
>                  |    |  |-- element: long (containsNull = true)
>
>                 Is there a way to convert this attribute from true to
>                 false without running any mapping / udf on that column?
>
>                 Please advice,
>                 Muthu
>
>
>
>
>


Mime
View raw message