spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haiyang Fu <haiyangfu...@gmail.com>
Subject Re: Spark partition
Date Thu, 31 Jul 2014 06:29:33 GMT
Hi,
you may referer this
http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
and
http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections
,both of which are about the RDD partitions.As you are going to load data
from hdfs, so you maybe also need to know
http://spark.apache.org/docs/latest/programming-guide.html#external-datasets
.


On Thu, Jul 31, 2014 at 1:07 PM, Sameer Tilak <sstilak@live.com> wrote:

> Hi All,
>
> From the documention RDDs are already partitioned distributed. However,
> there is a way to repartition a given RDD using the following function. Can
> someone please point out the best practices for using this. I have a 10 GB
> TSV file stored in HDFS and I have a 4 node cluster with 1 master and 3
> workers. Each worker has 15 GB memory and 4 cores. My processing pipeline
> is not very deep as of now. Can someone please tell me when repartitioning
> is recommended? When the documentation says balance doe to refer to memory
> usage or compute load or I/O?
>
> *repartition*(*numPartitions*)Reshuffle the data in the RDD randomly to
> create either more or fewer partitions and balance it across them. This
> always shuffles all data over the network.
>
>
>
>

Mime
View raw message