spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <ewan.le...@realitymine.com>
Subject RE: Dataframe nested schema inference from Json without type conflicts
Date Mon, 05 Oct 2015 16:04:33 GMT
I've done some digging today and, as a quick and ugly fix, altering the case statement of the
JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | VALUE_FALSE =>
StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the samplingRatio
setting, with the default being to behave as it currently works, does anyone think a pull
request would plausibly get into the Spark main codebase?

Thanks,
Ewan



From: Ewan Leith [mailto:ewan.leith@realitymine.com]
Sent: 02 October 2015 01:57
To: yhuai@databricks.com
Cc: rxin@databricks.com; dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



------ Original message------

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: rxin@databricks.com;dev@spark.apache.org<mailto:rxin@databricks.com;dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure of your data
(basically you want spark sql to infer the type of complex values like arrays and structs
but keep the type of primitive values as strings), right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith <ewan.leith@realitymine.com<mailto:ewan.leith@realitymine.com>>
wrote:

We could, but if a client sends some unexpected records in the schema (which happens more
than I'd like, our schema seems to constantly evolve), its fantastic how Spark picks up on
that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what we'll probably
have to adopt if we can't come up with a way to keep the inference working.



Thanks,

Ewan



------ Original message------

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith <ewan.leith@realitymine.com<mailto:ewan.leith@realitymine.com>>
wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but when we're
using Spark Streaming on small batches of data, we sometimes find that Spark infers a more
specific type than it should use, for example if the json in that small batch only contains
integer values for a String field, it'll class the field as an Integer type on one Streaming
batch, then a String on the next one.

Instead, we'd rather match every value as a String type, then handle any casting to a desired
type later in the process.

I don't think there's currently any simple way to avoid this that I can see, but we could
add the functionality in the JacksonParser.scala file, probably in convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan



Mime
View raw message