Note that you may have minSplits set to more than the number of cores in the cluster, and Spark will just run as many as possible at a time. This is better if certain nodes may be slow, for instance.

In general, it is not necessarily the case that doubling the number of cores doing IO will double the throughput, because you could be saturating the throughput with fewer cores. However, S3 is odd in that each connection gets way less bandwidth than your network link can provide, and it does seem to scale linearly with the number of connections. So, yes, taking minSplits up to 4 (or higher) will likely result in a 2x performance improvement.

saveAsTextFile() will use as many partitions (aka splits) as the RDD it's being called on. So for instance:

sc.textFile(myInputFile, 15).map(lambda x: x + "!!!").saveAsTextFile(myOutputFile)

will use 15 partitions to read the text file (i.e., up to 15 cores at a time) and then again to save back to S3.



On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <nicholas.chammas@gmail.com> wrote:
So setting minSplits will set the parallelism on the read in SparkContext.textFile(), assuming I have the cores in the cluster to deliver that level of parallelism. And if I don't explicitly provide it, Spark will set the minSplits to 2.

So for example, say I have a cluster with 4 cores total, and it takes 40 minutes to read a single file from S3 with minSplits at 2. Tt should take roughly 20 minutes to read the same file if I up minSplits to 4.

Did I understand that correctly?

RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing that's not an operation the user can tune.


On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson <ilikerps@gmail.com> wrote:
Spark will only use each core for one task at a time, so doing

sc.textFile(<s3 location>, <num reducers>) 

where you set "num reducers" to at least as many as the total number of cores in your cluster, is about as fast you can get out of the box. Same goes for saveAsTextFile.


On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <nicholas.chammas@gmail.com> wrote:
Howdy-doody,

I have a single, very large file sitting in S3 that I want to read in with sc.textFile(). What are the best practices for reading in this file as quickly as possible? How do I parallelize the read as much as possible?

Similarly, say I have a single, very large RDD sitting in memory that I want to write out to S3 with RDD.saveAsTextFile(). What are the best practices for writing this file out as quickly as possible?

Nick