mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: spark-rowsimilarity java.lang.OutOfMemoryError: Java heap space
Date Tue, 19 May 2015 19:57:58 GMT
The way the code work is:
1) create a BiMap for every id space in the client code (users and items). This is non-distributed
code, typically run on the machine you launch from although in yarn-cluster mode the actual
machine may be different. In any case the heap used is associated with the driver itself,
not distributed code.
2) the BiMap is broadcast (copied) to every worker. This instantiates it in memory shared
with all executors on the worker so there is only one copy per machine. Since it may be large
this is the best way to handle it.

#1 requires that you have enough memory in the driver to create the BiMap. This memory is
allocated when the driver is launched and available as heap. If you are not using yarn this
would be JVM memory so the various methods for setting -Xmx4g (or however much you need).
This will be something like “export JAVA_OPTS= -Xmx4g” or something.  You would have to
have a giant BiMap to us that much memory. A Hashmap storage has an index and copy of every
key/value pair. A BiMap has two HashMaps. If your ID strings are very long this increases
the space required. So index aside the memory needed increases with the size of you ID strings,
ints are used as Mahout IDs.

If you are using spark-submit you can change executor memory there. You can change it in the
Spark conf files and using the driver’s -D:spark.executor.memory=4g. These use different
mechanisms to get the config changed but should all work. Feel free to try a different method
if you think -sem doesn’t.

Are you using yarn-client or yarn-cluster? Can you share your entire command line and console
error log? The line also states that you have 1.8g free so we need to pinpoint the memory
chunk that is being exhausted. Also is you could share a snippet or you data.

On May 18, 2015, at 6:10 AM, Xavier Rampino <xrampino@senscritique.com> wrote:

I just did that but I ran into the same problem, I feel like -sem doesn't
work with my setup. For instance I have :

15/05/18 13:44:39 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
localhost:60596 in memory (size: 2.7 KB, free: *1761.1 MB*)

(Maybe it's not related though)

On Wed, May 13, 2015 at 7:27 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> There is a bug in mahout 0.10.0 that you can fix if you are able to build
> from source. Get the source tar for 0.10.0, not the current master.
> 
> Got to
> https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157
> 
> remove the line that says: interactions.collect()
> 
> See this Jira https://issues.apache.org/jira/browse/MAHOUT-1707
> 
> There is one other thing that can cause this and is fixed by increasing
> you client JVM heap space but try the above first.
> 
> BTW setting the executor memory twice, is not necessary.
> 
> 
> On May 13, 2015, at 2:21 AM, Xavier Rampino <xrampino@senscritique.com>
> wrote:
> 
> Hello,
> 
> I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
> distribution and spark, and set up the PATH), and I stumble upon a Java
> Heap space error. My input file is ~100MB. It seems the various parameters
> I tried to give won't change this. I do :
> 
> ~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
> ~/query_result.tsv --output ~/work/result -sem 24g
> -D:spark.executor.memory=24g
> 
> Do I just need to input more memory, or is there another step I can do to
> solve this ?
> 
> 


Mime
View raw message