spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Mayo <Neil.M...@velocityww.com>
Subject Re: Spark driver using Spark Streaming shows increasing memory/CPU usage
Date Wed, 01 Jul 2015 17:51:08 GMT
Hi Tathagata,

Thanks for your quick reply! I’ll add some more detail below about what I’m doing - I’ve
tried a lot of variations on the code to debug this, with monitoring enabled, but I didn’t
want to overwhelm the issue description to start with ;-)


On 30 Jun 2015, at 19:30, Tathagata Das <tdas@databricks.com<mailto:tdas@databricks.com>>
wrote:

Could you give more information on the operations that you are using? The code outline?

And what do you mean by "Spark Driver receiver events"? If the driver is receiving events,
how is it being sent to the executors.

The events are just objects that represent actions a user takes. They contain a user id, a
type and some other info, and get dumped into a MongoDB and then picked out by the Receiver.
This Receiver<BSONObject> runs a thread which periodically polls the db, processes new
events into DBObjects and calls Receiver.store() to hand each one off to an Executor.


BTW, for memory usages, I strongly recommend using jmap --histo:live to see what are the type
of objects that is causing most memory usage?

I’ve been running both jconsole and VisualVM to monitor the processes, and when memory usage
is high it is overwhelmingly due to byte arrays. I’ve read that sometimes performing operations
like sorting an RDD can lead to unreachable byte arrays (https://spark-project.atlassian.net/browse/SPARK-1001).
I’ve not come across any reports that quite match our use case though. The groupByKey step
seems to be a significant creator of byte arrays in my case.

I’ll attach an outline of the code I’m using - I’ve tried to reduce this to the essentials;
it won’t compile but should display ok in an IDE.

Mime
View raw message