spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Muttineni, Vinay" <>
Subject Optimal Partition Strategy
Date Thu, 25 Sep 2014 17:37:27 GMT
A bit of a background.
I have a dataset with about 200 million records and around 10 columns. The size of this dataset
is around 1.5Tb and is split into around 600 files.
When I read this dataset, using sparkContext, by default it creates around 3000 partitions
if I do not specify the number of partitions in the textFile() command.
Now I see that even though my spark application has around 400 executors assigned to it, the
data is spread out only to about 200 of them. I am using .cache() method to hold my data in-memory.
Each of these 200 executors, each with a total available memory of 6Gb, are now having multiple
blocks and are thus using up their entire memory by caching the data.

Even though I have about 400 machines, only about 200 of them are actually being used.
Now, my question is:

How do I partition my data so all 400 of the executors have some chunks of the data, thus
better parallelizing my work?
So, instead of only about 200 machines having about 6Gb of data each, I would like to have
400 machines with about 3Gb data each.

Any idea on how I can set about achieving the above?

View raw message