spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Bridgett <>
Subject very high maxresults setting (no collect())
Date Mon, 19 Sep 2016 21:05:50 GMT

We've recently started seeing a huge increase in 
spark.driver.maxResultSize - we are starting to set it at 3GB (and 
increase our driver memory a lot to 12GB or so).  This is on v1.6.1 with 
Mesos scheduler.

All the docs I can see is that this is to do with .collect() being 
called on a large RDD (which isn't the case AFAIK - certainly nothing in 
the code) and it's rather puzzling me as to what's going on.  I thought 
that the number of tasks was coming into it (about 14000 tasks in each 
of about a dozen stages).  Adding a coalesce seemed to help but now we 
are hitting the problem again after a few minor code tweaks.

What else could be contributing to this?   Thoughts I've had:
- number of tasks
- metrics?
- um, a bit stuck!

The code looks like this:
val rows = df.count()

// actually we loop over this a few times
val output = df. groupBy("id").agg(

Cheers for any help/pointers!  There are a couple of memory leak tickets 
fixed in v1.6.2 that may affect the driver so I may try an upgrade (the 
executors are fine).


To unsubscribe e-mail:

View raw message