spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Re: Preferred RDD Size
Date Mon, 12 May 2014 22:51:06 GMT
At the minimum to get decent parallelization you'd want to have some data
on every machine.  If you're reading from HDFS, then the smallest you'd
want is one HDFS block per server in your cluster.

Note that Spark will work at smaller sizes, but in order to make use of all
your machines when your partition count is less than your node count, you'd
want to repartition to a higher partition count.


On Wed, May 7, 2014 at 3:52 AM, Sai Prasanna <>wrote:

> Hi,
> Is there any lower-bound on the size of RDD to optimally utilize the
> in-memory framework Spark.
> Say creating RDD for very small data set of some 64 MB is not as efficient
> as that of some 256 MB, then accordingly the application can be tuned.
> So is there a soft-lowerbound related to hadoop-block size or something
> else ?
> Thanks in Advance !

View raw message