spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantinos Kougios <>
Subject Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()
Date Thu, 11 Jun 2015 12:01:24 GMT
Now I am profiling the executor.

There seems to be a memory leak.

20 mins after the run there were:

  157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k for  17MB
487k for 11MB

An hour after the run I got :

186k byte[] for 106MB
863k Finalizer for 52MB
475k Inflater for 29MB
354k Deflater for 24MB
829k ZStreamRef for 19MB

I don't see why those zip classes are leaking. I am not doing any 
compression myself (I am reading plain text xml files, extracting few 
elements and reducing them), I assume it must be the hadoop streams 
maybe when I do rdd.saveAsObjectFile()

I am using hadoop 2.7.0 with spark 1.3.1-hadoop


On 10/06/15 17:14, Marcelo Vanzin wrote:
> So, I don't have an explicit solution to your problem, but...
> On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios 
> < <>> 
> wrote:
>     I am profiling the driver. It currently has 564MB of strings which
>     might be
>     the 1mil file names. But also it has 2.34 GB of long[] ! That's so
>     far, it
>     is still running. What are those long[] used for?
> When Spark lists files it also needs all the extra metadata about 
> where the files are in the HDFS cluster. That is a lot more than just 
> the file's name - see the "LocatedFileStatus" class in the Hadoop docs 
> for an idea.
> What you could try is to somehow break that input down into smaller 
> batches, if that's feasible for your app. e.g. organize the files by 
> directory and use separate directories in different calls to 
> "binaryFiles()", things like that.
> -- 
> Marcelo

View raw message