spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SK <>
Subject Spark performance optimization examples
Date Tue, 25 Nov 2014 02:32:17 GMT

Is there any document that provides some guidelines with some examples that
illustrate when different performance optimizations would be useful? I am
interested in knowing the guidelines for using optimizations like cache(),
persist(), repartition(), coalesce(), and broadcast variables.  I studied
the online programming guide, but I would like some more details (something
along the lines of Aaron Davidson's talk which illustrates the use of
repartition() with an example during the Spark summit).

In particular, I have a dataset that is about 1.2TB (about 30 files) that I
am trying to load using sc.textFile on a cluster with a total memory of 3TB
(170 GB per node). But I am not able to successfully complete the loading.
THe program is continuously active in the mapPartitions task but  does not
get past that even after a long time. I have tried some of the above
optimizations. But that has not helped and I am not sure if I am using these
optimizations in the right way or which of the above optimizations would be
most appropriate to this problem.  So I would appreciate any guidelines. 


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message