spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: S3A + EMR failure when writing Parquet?
Date Mon, 05 Sep 2016 16:41:49 GMT

On 4 Sep 2016, at 18:05, Everett Anderson <everett@nuna.com<mailto:everett@nuna.com>>
wrote:

My impression from reading your various other replies on S3A is that it's also best to use
mapreduce.fileoutputcommitter.algorithm.version=2 (which might someday be the default<https://issues.apache.org/jira/browse/MAPREDUCE-6336>)
and,

for now yes; there's work under way by various people to implement consistency and cache performance:
S3guard https://issues.apache.org/jira/browse/HADOOP-13345  . That'll need to come with a
new commit algorithm which works with it and other object stores with similar semantics (Azure
WASB). I want an O(1) commit there with a very small (1).

presumably if your data fits well in memory, use fs.s3a.fast.upload=true. Is that right?


as of last week: no.

Having written a test to upload multi-GB files generated at the speed of memory copies, I
think that is at both scale. If you are generating data faster than it can be uploaded, you
will OOM.


Small datasets running in-EC2 on large instances, or installations where you have a local
object store supporting S3 API, you should get away with it. Bulk uploads over long-haul networks:
no.

Keep an eye on : https://issues.apache.org/jira/browse/HADOOP-13560



Mime
View raw message