spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nsareen <>
Subject input size too large | Performance issues with Spark
Date Sat, 28 Mar 2015 14:03:40 GMT
Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB & the
Shuffle Write is 10 GB & Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous. 

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to make
a virtual 10-15 node cluster. 

Even then the input size & shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures & design 

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps & Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD, have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume when
required in the future.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message