spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Bad Digest error while doing aws s3 put
Date Tue, 09 Feb 2016 11:13:27 GMT

> On 9 Feb 2016, at 07:19, lmk <> wrote:
> Hi Dhimant,
> As I had indicated in my next mail, my problem was due to disk getting full
> with log messages (these were dumped into the slaves) and did not have
> anything to do with the content pushed into s3. So, looks like this error
> message is very generic and is thrown for various reasons. You may probably
> have to do some more research to find out the cause of your problem..
> Please keep me posted once you fix this issue. Sorry, I could not be of much
> help to you..
> Regards

that's fun.

s3n/s3a buffer their output until close() is called, then they do a full upload

this breaks every assumption people have about file IO:

-especially the bits in the code about close() being fast and harmless; now its O(data) and
bad news if it fails.

If your close() was failing due to lack of HDD space, it means that your tmp dir and log dir
were on the same disk/volume, and that ran out of capacity

HADOOP-11183 added an output variant which buffers in memory, primarily for faster output
to rack-local storage supporting the s3 protocol. This is in ASF Hadoop 2.7, recent HDP and
CDH releases. 

I don't know if it's in amazon EMR, because they have their own closed source EMR client (believed
to be a modified ASF one with some special hooks to unstable s3 APIs)

Anyway: I would run, not walk, to using s3a on Hadoop 2.7+, as its already better than s3a
and getting better with every release

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message