spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL
Date Mon, 05 Dec 2016 10:51:31 GMT
Hi Michael,

" Personally, I usually take a small sample of data and use schema
inference on that.  I then hardcode that schema into my program.  This
makes your spark jobs much faster and removes the possibility of the schema
changing underneath the covers."

This may or may not work for us. Not all rows have the same schema. The
number of distinct schemas we have now may be small but going forward this
can go to any number moreover a distinct call can lead to a table scan
which can be billions of rows for us.

I also would agree to keep the API consistent than making an exception
however I wonder if it make sense to provide an action call to infer the
schema which would return a new dataframe after the action call finishes
(after schema inference)? For example, something like below ?

val inferedDF = df.inferSchema(col1);

Thanks,




On Mon, Nov 28, 2016 at 6:12 PM, Michael Armbrust <michael@databricks.com>
wrote:

> You could open up a JIRA to add a version of from_json that supports
> schema inference, but unfortunately that would not be super easy to
> implement.  In particular, it would introduce a weird case where only this
> specific function would block for a long time while we infer the schema
> (instead of waiting for an action).  This blocking would be kind of odd for
> a call like df.select(...).  If there is enough interest, though, we
> should still do it.
>
> To give a little more detail, your version of the code is actually doing
> two passes over the data: one to infer the schema and a second for whatever
> processing you are asking it to do.  We have to know the schema at each
> step of DataFrame construction, so we'd have to do this even before you
> called an action.
>
> Personally, I usually take a small sample of data and use schema inference
> on that.  I then hardcode that schema into my program.  This makes your
> spark jobs much faster and removes the possibility of the schema changing
> underneath the covers.
>
> Here's some code I use to build the static schema code automatically
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1128172975083446/2840265927289860/latest.html>
> .
>
> Would that work for you? If not, why not?
>
> On Wed, Nov 23, 2016 at 2:48 AM, kant kodali <kanth909@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Looks like all from_json functions will require me to pass schema and
>> that can be little tricky for us but the code below doesn't require me to
>> pass schema at all.
>>
>> import org.apache.spark.sql._
>> val rdd = df2.rdd.map { case Row(j: String) => j }
>> spark.read.json(rdd).show()
>>
>>
>> On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> The first release candidate should be coming out this week. You can
>>> subscribe to the dev list if you want to follow the release schedule.
>>>
>>> On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth909@gmail.com> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> I only see spark 2.0.2 which is what I am using currently. Any idea on
>>>> when 2.1 will be released?
>>>>
>>>> Thanks,
>>>> kant
>>>>
>>>> On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> In Spark 2.1 we've added a from_json
>>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902>
>>>>> function that I think will do what you want.
>>>>>
>>>>> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth909@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This seem to work
>>>>>>
>>>>>> import org.apache.spark.sql._
>>>>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>>>>> spark.read.json(rdd).show()
>>>>>>
>>>>>> However I wonder if this any inefficiency here ? since I have to
>>>>>> apply this function for billion rows.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message