spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Semenov <>
Subject Re: Spark, S3A, and 503 SlowDown / rate limit issues
Date Wed, 05 Jul 2017 13:40:10 GMT
Are you sure that you use S3A?
Because EMR says that they do not support S3A
> Amazon EMR does not currently support use of the Apache Hadoop S3A file

I think that the HEAD requests come from the `createBucketIfNotExists` in
the AWS S3 library that checks if the bucket exists every time you do a PUT
request, i.e. creates a HEAD request.

You can disable that by setting `fs.s3.buckets.create.enabled` to `false`

On Thu, Jun 29, 2017 at 4:56 PM, Everett Anderson <>

> Hi,
> We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O
> from/to S3 from our Spark jobs. We set mapreduce.
> fileoutputcommitter.algorithm.version=2 and are using encrypted S3
> buckets.
> This has been working fine for us, but perhaps as we've been running more
> jobs in parallel, we've started getting errors like
> Status Code: 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error
> Code: SlowDown, AWS Error Message: Please reduce your request rate., S3
> Extended Request ID: ...
> We enabled CloudWatch S3 request metrics for one of our buckets and I was
> a little alarmed to see spikes of over 800k S3 requests over a minute or
> so, with the bulk of them HEAD requests.
> We read and write Parquet files, and most tables have around 50
> shards/parts, though some have up to 200. I imagine there's additional
> parallelism when reading a shard in Parquet, though.
> Has anyone else encountered this? How did you solve it?
> I'd sure prefer to avoid copying all our data in and out of HDFS for each
> job, if possible.
> Thanks!

View raw message