spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Bridgett <adr...@opensignal.com>
Subject coalesce serialising earlier work
Date Tue, 09 Aug 2016 07:11:13 GMT
In short:  df.coalesce(1).write seems to make all the earlier 
calculations about the dataframe go through a single task (rather than 
on multiple tasks and then the final dataframe to be sent through a 
single worker).  Any idea how we can force the job to run in parallel?

In more detail:

We have a job that we wish to write out as a single CSV file.  We have 
two approaches (code below):

df = ....(filtering, calculations)
df.coalesce(num).write.
       format("com.databricks.spark.csv").
       option("codec", "org.apache.hadoop.io.compress.GzipCodec").
       save(output_path)

Option A: (num=100)
- dataframe calculated in parallel
- upload in parallel
- gzip in parallel
- but we then have to run "hdfs dfs -getmerge" to download all data and 
then write it back again.

Option B: (num=1)
- single gzip (but gzip is pretty quick)
- uploads go through a single machine
- no HDFS commands
- dataframe is _not_ calculated in parallel (we can see filters getting 
just a single task)

What I'm not sure is why spark (1.6.1) is deciding to run just a single 
task for the calculation - and what we can do about it? We really want 
the df to be calculated in parallel and then this is _then_ coalesced 
before being written.  (It may be that the -getmerge approach will still 
be faster)

df.coalesce(100).coalesce(1).write.....  doesn't look very likely to help!

Adrian

-- 
*Adrian Bridgett*

Mime
View raw message