Hi,
you may referer this http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism 
and http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections ,both of which are about the RDD partitions.As you are going to load data from hdfs, so you maybe also need to know http://spark.apache.org/docs/latest/programming-guide.html#external-datasets.


On Thu, Jul 31, 2014 at 1:07 PM, Sameer Tilak <sstilak@live.com> wrote:
Hi All,

From the documention RDDs are already partitioned distributed. However, there is a way to repartition a given RDD using the following function. Can someone please point out the best practices for using this. I have a 10 GB TSV file stored in HDFS and I have a 4 node cluster with 1 master and 3 workers. Each worker has 15 GB memory and 4 cores. My processing pipeline is not very deep as of now. Can someone please tell me when repartitioning is recommended? When the documentation says balance doe to refer to memory usage or compute load or I/O?

repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.