spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <ewan.le...@realitymine.com>
Subject Re: Selecting column in dataframe created with incompatible schema causes AnalysisException
Date Wed, 02 Mar 2016 19:51:34 GMT
Hi Reynold, yes that would be perfect for our use case.

I assume it doesn't exist though, otherwise I really need to go re-read the docs!

Thanks to both of you for replying by the way, I know you must be hugely busy.

Ewan

Are you looking for "relaxed" mode that simply return nulls for fields that doesn't exist
or have incompatible schema?


On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.leith@realitymine.com<mailto:ewan.leith@realitymine.com>>
wrote:

Thanks Michael, it's not a great example really, as the data I'm working with has some source
files that do fit the schema, and some that don't (out of millions that do work, perhaps 10
might not).

In an ideal world for us the select would probably return the valid records only.

We're trying out the new dataset APIs to see if we can do some pre-filtering that way.

Thanks,
Ewan

-dev +user

StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
StructField(name,StringType,true)),true),true), StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
StructField(id,LongType,true)),true),true)),true),true)),true),true))

Its not a great error message, but as the schema above shows, stuff is an array, not a struct.
 So, you need to pick a particular element (using []) before you can pull out a specific field.
 It would be easier to see this if you ran sqlContext.read.json(s1Rdd).printSchema(), which
gives you a tree view.  Try the following.

sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")

On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.leith@realitymine.com<mailto:ewan.leith@realitymine.com>>
wrote:
When you create a dataframe using the sqlContext.read.schema() API, if you pass in a schema
that’s compatible with some of the records, but incompatible with others, it seems you can’t
do a .select on the problematic columns, instead you get an AnalysisException error.

I know loading the wrong data isn’t good behaviour, but if you’re reading data from (for
example) JSON files, there’s going to be malformed files along the way. I think it would
be nice to handle this error in a nicer way, though I don’t know the best way to approach
it.

Before I raise a JIRA ticket about it, would people consider this to be a bug or expected
behaviour?

I’ve attached a couple of sample JSON files and the steps below to reproduce it, by taking
the inferred schema from the simple1.json file, and applying it to a union of simple1.json
and simple2.json. You can visually see the data has been parsed as I think you’d want if
you do a .select on the parent column and print out the output, but when you do a select on
the problem column you instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at <console>:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
StructField(name,StringType,true)),true),true), StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], [null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], [WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' due to data type
mismatch: argument 2 requires integral type, however, 'onetype' is of string type.;
                at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
                at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
                at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
                at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
                at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)

(The full exception is attached too).

What do people think, is this a bug?

Thanks,
Ewan


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<mailto:dev-unsubscribe@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<mailto:dev-help@spark.apache.org>



Mime
View raw message