spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <andrew.row...@thomsonreuters.com>
Subject Driver running out of memory - caused by many tasks?
Date Thu, 27 Aug 2015 10:53:16 GMT
I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks (it’s reading
a few TB of data). The job itself is fairly simple - it’s just getting a list of distinct
values:

    val days = spark
      .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
      .sample(withReplacement = false, fraction = 0.01)
      .map(row => row._1.getTimestamp.toString("yyyy-MM-dd"))
      .distinct()
      .collect()

The cardinality of the ‘day’ is quite small - there’s only a handful. However, I’m
frequently running into OutOfMemory issues on the driver. I’ve had it fail with 24GB RAM,
and am currently nudging it upwards to find out where it works. The ratio between input and
shuffle write in the distinct stage is about 3TB:7MB. On a smaller dataset, it works without
issue on a smaller (4GB) heap. In YARN cluster mode, I get a failure message similar to:

    Container [pid=36844,containerID=container_e15_1438040390147_4982_01_000001] is running
beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical memory used; 29.5
GB of 56.7 GB virtual memory used. Killing container.


Is the driver running out of memory simply due to the number of tasks, or is there something
about the job program that’s causing it to put a lot of data into the driver heap and go
oom? If the former, is there any general guidance about the amount of memory to give to the
driver as a function of how many tasks there are?

Andrew
Mime
View raw message