spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Spark querying parquet data partitioned in S3
Date Wed, 05 Jul 2017 11:36:07 GMT

> On 29 Jun 2017, at 17:44, fran <francisco.blaya@hivehome.com> wrote:
> 
> We have got data stored in S3 partitioned by several columns. Let's say
> following this hierarchy:
> s3://bucket/data/column1=X/column2=Y/parquet-files
> 
> We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
> following:
> 
> A) - When we declare the initial dataframe to be the whole dataset (val df =
> sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job
> into several tasks (259) that are performed by the executors and we believe
> the driver gets back the parquet metadata.
> 
> Question: The above takes about 25 minutes for our dataset, we believe it
> should be a lazy query (as we are not performing any actions) however it
> looks like something is happening, all the executors are reading from S3. We
> have tried mergeData=false and setting the schema explicitly via
> .schema(someSchema). Is there any way to speed this up?
> 
> B) - When we declare the initial dataframe to be scoped by the first column
> (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems
> that all the work (getting the parquet metadata) is done by the driver and
> there is no job submitted to Spark. 
> 
> Question: Why does (A) send the work to executors but (B) does not?
> 
> The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.
> 
> 

Split calculation can be very slow against object stores, especially if the directory structure
is deep: the treewalking done here is pretty inefficient against the object store.

Then there's the schema merge, which looks at the tail of every file, so has to do a seek()
against all of them. That is something which it parallelises around the cluster, before your
job actually gets scheduled. 

Turning that off with spark.sql.parquet.mergeSchema = false should make it go away, but clearly
not.

Aa call to jstack against the driver will show where it is at: you'll probably have to start
from there


I know if you are using EMR you are stuck using Amazon's own s3 ciients; if you were on Apache's
own artifacts you could move up to Hadoop 2.8 and set the spark.hadoop.fs.s3a.experimental.fadvise=random
option for high speed random access. You can also turn off job summary creation in Spark


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message