spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Re: DataFrame#show cost 2 Spark Jobs ?
Date Tue, 25 Aug 2015 00:11:16 GMT
Hi Cheng,

I know that sqlContext.read will trigger one spark job to infer the schema.
What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3
jobs.

Here's the command I use:

>> val df =
sqlContext.read.json("file:///Users/hadoop/github/spark/examples/src/main/resources/people.json")
       // trigger one spark job to infer schema
>> df.show()            // trigger 2 spark jobs which is weird




On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao <hao.cheng@intel.com> wrote:

> The first job is to infer the json schema, and the second one is what you
> mean of the query.
>
> You can provide the schema while loading the json file, like below:
>
>
>
> sqlContext.read.schema(xxx).json(“…”)?
>
>
>
> Hao
>
> *From:* Jeff Zhang [mailto:zjffdu@gmail.com]
> *Sent:* Monday, August 24, 2015 6:20 PM
> *To:* user@spark.apache.org
> *Subject:* DataFrame#show cost 2 Spark Jobs ?
>
>
>
> It's weird to me that the simple show function will cost 2 spark jobs.
> DataFrame#explain shows it is a very simple operation, not sure why need 2
> jobs.
>
>
>
> == Parsed Logical Plan ==
>
> Relation[age#0L,name#1]
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
>
>
>
> == Analyzed Logical Plan ==
>
> age: bigint, name: string
>
> Relation[age#0L,name#1]
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
>
>
>
> == Optimized Logical Plan ==
>
> Relation[age#0L,name#1]
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
>
>
>
> == Physical Plan ==
>
> Scan
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
>
>
>
>
>
>
>
> --
>
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang

Mime
View raw message