spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Output Committers for S3
Date Tue, 21 Feb 2017 14:32:54 GMT

On 21 Feb 2017, at 14:15, Steve Loughran <stevel@hortonworks.com<mailto:stevel@hortonworks.com>>
wrote:

What your patch has made me realise is that I could also do a delayed-commit copy by reading
in a file, doing a multipart put to its final destination, and again, postponing the final
commit. this is something which tasks could do in their commit rather than a normal COPY+DELETE
 rename, passing the final pending commit information to the job committer. This'd make the
rename() slower as it will read and write the data again, rather than the 6-10 MB/s of in-S3
copies, but as these happen in-task-commit, rather than in-job-commit, they slow down the
overall job less. That could be used for the absolute path commit phase.


though as you can do specify a copy-range in a multipart put, you could do a parallelized
copies of parts of a file in the s3 filestore itself and leave the result pending, reducing
copy time in seconds to ~ filesize / (parts * 6e6), the same as you get from a parallel copy
in s3 today. That is: same time as a rename, merely not visible until the final job chooses
to materialize the object

Mime
View raw message