spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <ewan.le...@realitymine.com>
Subject Selecting column in dataframe created with incompatible schema causes AnalysisException
Date Wed, 02 Mar 2016 09:44:26 GMT
When you create a dataframe using the sqlContext.read.schema() API, if you pass in a schema
that's compatible with some of the records, but incompatible with others, it seems you can't
do a .select on the problematic columns, instead you get an AnalysisException error.

I know loading the wrong data isn't good behaviour, but if you're reading data from (for example)
JSON files, there's going to be malformed files along the way. I think it would be nice to
handle this error in a nicer way, though I don't know the best way to approach it.

Before I raise a JIRA ticket about it, would people consider this to be a bug or expected
behaviour?

I've attached a couple of sample JSON files and the steps below to reproduce it, by taking
the inferred schema from the simple1.json file, and applying it to a union of simple1.json
and simple2.json. You can visually see the data has been parsed as I think you'd want if you
do a .select on the parent column and print out the output, but when you do a select on the
problem column you instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at <console>:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
StructField(name,StringType,true)),true),true), StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], [null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], [WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' due to data type
mismatch: argument 2 requires integral type, however, 'onetype' is of string type.;
                at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
                at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
                at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
                at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
                at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)

(The full exception is attached too).

What do people think, is this a bug?

Thanks,
Ewan

Mime
View raw message