spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Best practices: Parallelized write to / read from S3
Date Mon, 31 Mar 2014 16:29:43 GMT
Spark will only use each core for one task at a time, so doing

sc.textFile(<s3 location>, <num reducers>)

where you set "num reducers" to at least as many as the total number of
cores in your cluster, is about as fast you can get out of the box. Same
goes for saveAsTextFile.


On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Howdy-doody,
>
> I have a single, very large file sitting in S3 that I want to read in with
> sc.textFile(). What are the best practices for reading in this file as
> quickly as possible? How do I parallelize the read as much as possible?
>
> Similarly, say I have a single, very large RDD sitting in memory that I
> want to write out to S3 with RDD.saveAsTextFile(). What are the best
> practices for writing this file out as quickly as possible?
>
> Nick
>
>
> ------------------------------
> View this message in context: Best practices: Parallelized write to /
> read from S3<http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>

Mime
View raw message