spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Spark querying parquet data partitioned in S3
Date Wed, 05 Jul 2017 11:36:07 GMT

> On 29 Jun 2017, at 17:44, fran <> wrote:
> We have got data stored in S3 partitioned by several columns. Let's say
> following this hierarchy:
> s3://bucket/data/column1=X/column2=Y/parquet-files
> We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
> following:
> A) - When we declare the initial dataframe to be the whole dataset (val df =
>"s3://bucket/data/) then the driver splits the job
> into several tasks (259) that are performed by the executors and we believe
> the driver gets back the parquet metadata.
> Question: The above takes about 25 minutes for our dataset, we believe it
> should be a lazy query (as we are not performing any actions) however it
> looks like something is happening, all the executors are reading from S3. We
> have tried mergeData=false and setting the schema explicitly via
> .schema(someSchema). Is there any way to speed this up?
> B) - When we declare the initial dataframe to be scoped by the first column
> (val df ="s3://bucket/data/column1=X) then it seems
> that all the work (getting the parquet metadata) is done by the driver and
> there is no job submitted to Spark. 
> Question: Why does (A) send the work to executors but (B) does not?
> The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.

Split calculation can be very slow against object stores, especially if the directory structure
is deep: the treewalking done here is pretty inefficient against the object store.

Then there's the schema merge, which looks at the tail of every file, so has to do a seek()
against all of them. That is something which it parallelises around the cluster, before your
job actually gets scheduled. 

Turning that off with spark.sql.parquet.mergeSchema = false should make it go away, but clearly

Aa call to jstack against the driver will show where it is at: you'll probably have to start
from there

I know if you are using EMR you are stuck using Amazon's own s3 ciients; if you were on Apache's
own artifacts you could move up to Hadoop 2.8 and set the spark.hadoop.fs.s3a.experimental.fadvise=random
option for high speed random access. You can also turn off job summary creation in Spark

To unsubscribe e-mail:

View raw message