spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: How to read a multipart s3 file?
Date Wed, 07 May 2014 20:44:33 GMT
One way to ensure Spark writes more partitions is by using
RDD#repartition() to make each partition smaller. One Spark partition
always corresponds to one file in the underlying store, and it's usually a
good idea to have each partition size range somewhere between 64 MB to 256
MB. Too few partitions leads to other problems, such as too little
concurrency -- Spark can only run as many tasks as there are partitions, so
if you don't have enough partitions, your cluster will be underutilized.


On Tue, May 6, 2014 at 7:07 PM, kamatsuoka <kenjim@gmail.com> wrote:

> Yes, I'm using s3:// for both. I was using s3n:// but I got frustrated by
> how
> slow it is at writing files. In particular the phases where it moves the
> temporary files to their permanent location takes as long as writing the
> file itself.  I can't believe anyone uses this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5470.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message