spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk
Date Sat, 07 Apr 2018 18:37:37 GMT
As far as I know the TableSnapshotInputFormat relies on a temporary folder
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html


Unfortunately some inputformats need a (local) tmp Directory. Sometimes this cannot be avoided.

See also the source:
https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java

> On 7. Apr 2018, at 20:26, Saad Mufti <saad.mufti@gmail.com> wrote:
> 
> Hi,
> 
> I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input data is
HBase files in AWS S3 using EMRFS, but there is no HBase running on the Spark cluster itself.
It is restoring the HBase snapshot into files on disk in another S3 folder used for temporary
storage, then creating an RDD over those files using HBase's TableSnapsotInputFormat class.
There is a large number of HBase regions, around 12000, and each region gets translated to
one Spark task/partition. We are running in YARN mode, with one core per executor, so on our
120 node cluster we have around 1680 executors running (not the full 1960 as YARN only gives
us so many containers due to memory limits).
> 
> This is a simple ETL job that transforms the HBase data into Avro/Parquet and writes
to disk, there are no reduces or joins of any kind. The output Parquet data is using Snappy
compression, the total output is around 7 TB while we have about 28 TB total disk provisioned
in the cluster. The Spark UI shows no disk storage being used for cached data, and not much
heap being used for caching either, which makes sense because in this simple job we have no
need to do RDD.cache as the RDD is not reused at all.
> 
> So lately the job has started failing because close to finishing, some of the YARN nodes
start running low on disk and YARN marks them as unhealthy, then kills all the executors on
that node. But the problem just moves to another node where the tasks are relaunched for another
attempt until after 4 failures for a given task the whole job fails.
> 
> So I am trying to understand where all this disk usage is coming from? I can see in Ganglia
that disk is running low the longer the job runs no matter which node I look at. Like I said
the total output size of the final output in hdfs is only around 7 TB while we have around
28 TB of disk provisioned for hdfs.
> 
> Any advice or pointers for where to look for the large disk usage would be most appreciated.
> 
> Thanks.
> 
> ----
> Saad
> 

Mime
View raw message