Now I am profiling the executor.
There seems to be a memory leak.
20 mins after the run there were:
157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for 17MB
487k java.util.zip.ZStreamRef for 11MB
An hour after the run I got :
186k byte[] for 106MB
863k Finalizer for 52MB
475k Inflater for 29MB
354k Deflater for 24MB
829k ZStreamRef for 19MB
I don't see why those zip classes are leaking. I am not doing any
compression myself (I am reading plain text xml files, extracting few
elements and reducing them), I assume it must be the hadoop streams
maybe when I do rdd.saveAsObjectFile()
I am using hadoop 2.7.0 with spark 1.3.1-hadoop
Cheers
On 10/06/15 17:14, Marcelo Vanzin wrote:
> So, I don't have an explicit solution to your problem, but...
>
> On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios
> <kostas.kougios@googlemail.com <mailto:kostas.kougios@googlemail.com>>
> wrote:
>
> I am profiling the driver. It currently has 564MB of strings which
> might be
> the 1mil file names. But also it has 2.34 GB of long[] ! That's so
> far, it
> is still running. What are those long[] used for?
>
>
> When Spark lists files it also needs all the extra metadata about
> where the files are in the HDFS cluster. That is a lot more than just
> the file's name - see the "LocatedFileStatus" class in the Hadoop docs
> for an idea.
>
> What you could try is to somehow break that input down into smaller
> batches, if that's feasible for your app. e.g. organize the files by
> directory and use separate directories in different calls to
> "binaryFiles()", things like that.
>
> --
> Marcelo
|