spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang Cao <>
Subject physical memory usage keep increasing for spark app on Yarn
Date Fri, 20 Jan 2017 09:35:43 GMT
Hi all,

I am running a spark application on YARN-client mode with 6 executors (each 4 cores and executor
memory = 6G and Overhead = 4G, spark version: 1.6.3 / 2.1.0). I find that my executor memory
keeps increasing until get killed by node manager; and give out the info that tells me to
boost spark.yarn.excutor.memoryOverhead. I know that this param mainly control the size of
memory allocated off-heap. But I don’t know when and how the spark engine will use this
part of memory. Also increase that part of memory not always solve my problem. sometimes works
sometimes not. It trends to be useless when the input data is large.

FYI, my app’s logic is quite simple. It means to combine the small files generated in one
single day (one directory one day) into a single one and write back to hdfs. Here is the core
val df ="m = ${ts.month} AND d = ${}").coalesce(400)
val dropDF = df.drop("hh").drop("mm").drop("mode").drop("y").drop("m").drop("d")
The source file may have hundreds to thousands level’s partition. And the total parquet
file is around 1to 5 gigs. Also I find that in the step that shuffle reading data from different
machines, The size of shuffle read is about 4 times larger than the input size, Which is wired
or some principle I don’t know. 

Anyway, I have done some search myself for this problem. Some article said that it’s on
the direct buffer memory (I don’t set myself). Some article said that people solve it with
more frequent full GC. Also I find one people on SO with very similar situation:
This guy claimed that it’s a bug with parquet but comment questioned him. People in this
mail list may also receive an email hours ago from blondowski who described this problem while
writing json:

So it looks like to be common question for different output format. I hope someone with experience
about this problem could make an explanation about this issue. Why this happen and what is
a reliable way to solve this problem. 


View raw message