spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <>
Subject Re: How to read a multipart s3 file?
Date Wed, 07 May 2014 20:44:33 GMT
One way to ensure Spark writes more partitions is by using
RDD#repartition() to make each partition smaller. One Spark partition
always corresponds to one file in the underlying store, and it's usually a
good idea to have each partition size range somewhere between 64 MB to 256
MB. Too few partitions leads to other problems, such as too little
concurrency -- Spark can only run as many tasks as there are partitions, so
if you don't have enough partitions, your cluster will be underutilized.

On Tue, May 6, 2014 at 7:07 PM, kamatsuoka <> wrote:

> Yes, I'm using s3:// for both. I was using s3n:// but I got frustrated by
> how
> slow it is at writing files. In particular the phases where it moves the
> temporary files to their permanent location takes as long as writing the
> file itself.  I can't believe anyone uses this.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message