spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haiyang Fu <>
Subject Re: Spark partition
Date Thu, 31 Jul 2014 06:29:33 GMT
you may referer this
,both of which are about the RDD partitions.As you are going to load data
from hdfs, so you maybe also need to know

On Thu, Jul 31, 2014 at 1:07 PM, Sameer Tilak <> wrote:

> Hi All,
> From the documention RDDs are already partitioned distributed. However,
> there is a way to repartition a given RDD using the following function. Can
> someone please point out the best practices for using this. I have a 10 GB
> TSV file stored in HDFS and I have a 4 node cluster with 1 master and 3
> workers. Each worker has 15 GB memory and 4 cores. My processing pipeline
> is not very deep as of now. Can someone please tell me when repartitioning
> is recommended? When the documentation says balance doe to refer to memory
> usage or compute load or I/O?
> *repartition*(*numPartitions*)Reshuffle the data in the RDD randomly to
> create either more or fewer partitions and balance it across them. This
> always shuffles all data over the network.

View raw message