spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelly, Jonathan" <jonat...@amazon.com>
Subject SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
Date Thu, 27 Nov 2014 01:23:58 GMT
I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

	{"values":[1,2,3]}

and with code like the following:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val test = sqlContext.jsonFile("test.json")
scala> test.saveAsTable("test")

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
	at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2
47)
	at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
	at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
	at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala
:84)
	at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:66)
	at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca
la:149)
	at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
	at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.scheduler.Task.run(Task.scala:54)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
	at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
	at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala> test.printSchema
root
 |-- values: array (nullable = true)
 |    |-- element: integer (containsNull = false)


scala> sqlContext.table("test").printSchema
root
 |-- values: array (nullable = true)
 |    |-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table
then try inserting the data into the table, it works:

scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
sqlContext.table("test").schema).insertInto("test")
scala> sqlContext.sql("select * from test").collect().foreach(println)
[ArrayBuffer(1, 2, 3)]

Does this mean that there is a bug with how the schema is being
automatically determined when you use HiveContext.jsonFile() for JSON
files that contain nested arrays?  (i.e., should containsNull be true for
the array elements?)  Or is there a bug with how the Hive table is created
from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
probably get around this by defining the schema myself rather than using
auto-detection, but for now I¹d like to use auto-detection.

By the way, I'm using Spark 1.1.0.

Thanks,
Jonathan


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message