spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Re: Optimal Partition Strategy
Date Fri, 26 Sep 2014 03:23:10 GMT
Hi Vinay,

What I'm guessing is happening is that Spark is taking the locality of
files into account and you don't have node-local data on all your
machines.  This might be the case if you're reading out of HDFS and your
600 files are somehow skewed to only be on about 200 of your 400 machines.
A possible cause could be that your cluster doesn't have both Spark and an
HDFS data node on ever server.  It could also happen if you're loading data
into HDFS from just those 200 machines -- the HDFS block placement strategy
is to store one block locally and then further replicas of that block
elsewhere on the cluster.

If you want to force the data to be evenly distributed across your
executors you can run .repartition() right after the textFile() call.  This
will shuffle data across the network in a hash partitioned way before
continuing with the rest of your computation.

Hope that helps!

On Thu, Sep 25, 2014 at 10:37 AM, Muttineni, Vinay <>

>  Hello,
> A bit of a background.
> I have a dataset with about 200 million records and around 10 columns. The
> size of this dataset is around 1.5Tb and is split into around 600 files.
> When I read this dataset, using sparkContext, by default it creates around
> 3000 partitions if I do not specify the number of partitions in the
> textFile() command.
> Now I see that even though my spark application has around 400 executors
> assigned to it, the data is spread out only to about 200 of them. I am
> using .cache() method to hold my data in-memory.
> Each of these 200 executors, each with a total available memory of 6Gb,
> are now having multiple blocks and are thus using up their entire memory by
> caching the data.
> Even though I have about 400 machines, only about 200 of them are actually
> being used.
> Now, my question is:
> How do I partition my data so all 400 of the executors have some chunks of
> the data, thus better parallelizing my work?
> So, instead of only about 200 machines having about 6Gb of data each, I
> would like to have 400 machines with about 3Gb data each.
> Any idea on how I can set about achieving the above?
> Thanks,
> Vinay

View raw message