spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <huaiyin....@gmail.com>
Subject Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
Date Thu, 27 Nov 2014 05:37:30 GMT
Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive
table. Hive does not have the notion of "containsNull" for array values.
So, for a Hive table, the containsNull will be always true for an array and
we should ignore this field for Hive. This issue has been fixed by
https://issues.apache.org/jira/browse/SPARK-4245, which will be released
with 1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan <jonathak@amazon.com>
wrote:

> After playing around with this a little more, I discovered that:
>
> 1. If test.json contains something like {"values":[null,1,2,3]}, the
> schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
> (containsNull = true)", and then
> SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
> makes sense but doesn't really help).
> 2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
> StructType(Seq(StructField("values", ArrayType(IntegerType, true),
> true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
> work, though as I mentioned before, this is less than ideal.
>
> Why don't saveAsTable/insertInto work when the containsNull properties
> don't match?  I can understand how inserting data with containsNull=true
> into a column where containsNull=false might fail, but I think the other
> way around (which is the case here) should work.
>
> ~ Jonathan
>
>
> On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jonathak@amazon.com> wrote:
>
> >I've noticed some strange behavior when I try to use
> >SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
> >that contains elements with nested arrays.  For example, with a file
> >test.json that contains the single line:
> >
> >       {"values":[1,2,3]}
> >
> >and with code like the following:
> >
> >scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> >scala> val test = sqlContext.jsonFile("test.json")
> >scala> test.saveAsTable("test")
> >
> >it creates the table but fails when inserting the data into it.  Here¹s
> >the exception:
> >
> >scala.MatchError: ArrayType(IntegerType,true) (of class
> >org.apache.spark.sql.catalyst.types.ArrayType)
> >       at
> >org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
> >2
> >47)
> >       at
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> >       at
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> >       at
> >org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
> >a
> >:84)
> >       at
> >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
> >l
> >y(Projection.scala:66)
> >       at
> >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
> >l
> >y(Projection.scala:50)
> >       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >       at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
> $apache$spark$s
> >q
> >l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
> >a
> >la:149)
> >       at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
> >e
> >File$1.apply(InsertIntoHiveTable.scala:158)
> >       at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
> >e
> >File$1.apply(InsertIntoHiveTable.scala:158)
> >       at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> >       at org.apache.spark.scheduler.Task.run(Task.scala:54)
> >       at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> >       at
> >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
> >1
> >145)
> >       at
> >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
> >:
> >615)
> >       at java.lang.Thread.run(Thread.java:745)
> >
> >I'm guessing that this is due to the slight difference in the schemas of
> >these tables:
> >
> >scala> test.printSchema
> >root
> > |-- values: array (nullable = true)
> > |    |-- element: integer (containsNull = false)
> >
> >
> >scala> sqlContext.table("test").printSchema
> >root
> > |-- values: array (nullable = true)
> > |    |-- element: integer (containsNull = true)
> >
> >If I reload the file using the schema that was created for the Hive table
> >then try inserting the data into the table, it works:
> >
> >scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
> >sqlContext.table("test").schema).insertInto("test")
> >scala> sqlContext.sql("select * from test").collect().foreach(println)
> >[ArrayBuffer(1, 2, 3)]
> >
> >Does this mean that there is a bug with how the schema is being
> >automatically determined when you use HiveContext.jsonFile() for JSON
> >files that contain nested arrays?  (i.e., should containsNull be true for
> >the array elements?)  Or is there a bug with how the Hive table is created
> >from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
> >probably get around this by defining the schema myself rather than using
> >auto-detection, but for now I¹d like to use auto-detection.
> >
> >By the way, I'm using Spark 1.1.0.
> >
> >Thanks,
> >Jonathan
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message