spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: S3A + EMR failure when writing Parquet?
Date Mon, 05 Sep 2016 16:41:49 GMT

On 4 Sep 2016, at 18:05, Everett Anderson <<>>

My impression from reading your various other replies on S3A is that it's also best to use
mapreduce.fileoutputcommitter.algorithm.version=2 (which might someday be the default<>)

for now yes; there's work under way by various people to implement consistency and cache performance:
S3guard  . That'll need to come with a
new commit algorithm which works with it and other object stores with similar semantics (Azure
WASB). I want an O(1) commit there with a very small (1).

presumably if your data fits well in memory, use Is that right?

as of last week: no.

Having written a test to upload multi-GB files generated at the speed of memory copies, I
think that is at both scale. If you are generating data faster than it can be uploaded, you
will OOM.

Small datasets running in-EC2 on large instances, or installations where you have a local
object store supporting S3 API, you should get away with it. Bulk uploads over long-haul networks:

Keep an eye on :

View raw message