Hi Michael,

Personally, I usually take a small sample of data and use schema inference on that.  I then hardcode that schema into my program.  This makes your spark jobs much faster and removes the possibility of the schema changing underneath the covers."

This may or may not work for us. Not all rows have the same schema. The number of distinct schemas we have now may be small but going forward this can go to any number moreover a distinct call can lead to a table scan which can be billions of rows for us.

I also would agree to keep the API consistent than making an exception however I wonder if it make sense to provide an action call to infer the schema which would return a new dataframe after the action call finishes (after schema inference)? For example, something like below ?

val inferedDF = df.inferSchema(col1); 

Thanks,




On Mon, Nov 28, 2016 at 6:12 PM, Michael Armbrust <michael@databricks.com> wrote:
You could open up a JIRA to add a version of from_json that supports schema inference, but unfortunately that would not be super easy to implement.  In particular, it would introduce a weird case where only this specific function would block for a long time while we infer the schema (instead of waiting for an action).  This blocking would be kind of odd for a call like df.select(...).  If there is enough interest, though, we should still do it.

To give a little more detail, your version of the code is actually doing two passes over the data: one to infer the schema and a second for whatever processing you are asking it to do.  We have to know the schema at each step of DataFrame construction, so we'd have to do this even before you called an action.

Personally, I usually take a small sample of data and use schema inference on that.  I then hardcode that schema into my program.  This makes your spark jobs much faster and removes the possibility of the schema changing underneath the covers.

Here's some code I use to build the static schema code automatically.

Would that work for you? If not, why not?

On Wed, Nov 23, 2016 at 2:48 AM, kant kodali <kanth909@gmail.com> wrote:
Hi Michael,

Looks like all from_json functions will require me to pass schema and that can be little tricky for us but the code below doesn't require me to pass schema at all.

import org.apache.spark.sql._
val rdd = df2.rdd.map { case Row(j: String) => j }
spark.read.json(rdd).show()

On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <michael@databricks.com> wrote:
The first release candidate should be coming out this week. You can subscribe to the dev list if you want to follow the release schedule.

On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth909@gmail.com> wrote:
Hi Michael,

I only see spark 2.0.2 which is what I am using currently. Any idea on when 2.1 will be released?

Thanks,
kant

On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <michael@databricks.com> wrote:
In Spark 2.1 we've added a from_json function that I think will do what you want.

On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth909@gmail.com> wrote:
This seem to work 

import org.apache.spark.sql._
val rdd = df2.rdd.map { case Row(j: String) => j }
spark.read.json(rdd).show()
However I wonder if this any inefficiency here ? since I have to apply this function for billion rows.