spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <>
Subject Re: performance of saveAsTextFile moving files from _temporary
Date Wed, 28 Jan 2015 09:28:28 GMT
Upon completion of the 2 hour part of the run, the files did not exist in
the output directory? One thing that is done serially is deleting any
remaining files from _temporary, so perhaps there was a lot of data
remaining in _temporary but the committed data had already been moved.

I am, unfortunately, not aware of other issues that would cause this to be
so slow.

On Tue, Jan 27, 2015 at 6:54 PM, Josh Walton <> wrote:

> I'm not sure how to confirm how the moving is happening, however, one of
> the jobs just completed that I was talking about with 9k files of 4mb each.
> Spark UI showed the job being complete after ~2 hours. The last four hours
> of the job was just moving the files from _temporary to their final
> destination. The tasks for the write were definitely shown as complete, no
> logging is happening on the master or workers. The last line of my java
> code logs, but the job sits there as the moving of files happens.
> On Tue, Jan 27, 2015 at 7:24 PM, Aaron Davidson <>
> wrote:
>> This renaming from _temporary to the final location is actually done by
>> executors, in parallel, for saveAsTextFile. It should be performed by each
>> task individually before it returns.
>> I have seen an issue similar to what you mention dealing with Hive code
>> which did the renaming serially on the driver, which is very slow for S3
>> (and possibly Google Storage as well), as it actually copies the data
>> rather than doing a metadata-only operation during rename. However, this
>> should not be an issue in this case.
>> Could you confirm how the moving is happening -- i.e., on the executors
>> or the driver?
>> On Tue, Jan 27, 2015 at 4:31 PM, jwalton <> wrote:
>>> We are running spark in Google Compute Engine using their One-Click
>>> Deploy.
>>> By doing so, we get their Google Cloud Storage connector for hadoop for
>>> free
>>> meaning we can specify gs:// paths for input and output.
>>> We have jobs that take a couple of hours, end up with ~9k partitions
>>> which
>>> means 9k output files. After the job is "complete" it then moves the
>>> output
>>> files from our $output_path/_temporary to $output_path. That process can
>>> take longer than the job itself depending on the circumstances. The job I
>>> mentioned previously outputs ~4mb files, and so far has copied 1/3 of the
>>> files in 1.5 hours from _temporary to the final destination.
>>> Is there a solution to this besides reducing the number of partitions?
>>> Anyone else run into similar issues elsewhere? I don't remember this
>>> being
>>> an issue with Map Reduce jobs and hadoop, however, I probably wasn't
>>> tracking the transfer of the output files like I am with Spark.
>>> --
>>> View this message in context:
>>> Sent from the Apache Spark User List mailing list archive at
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

View raw message