mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Vector creation - out of memory error
Date Tue, 21 Jul 2009 00:49:39 GMT



On Jul 20, 2009, at 2:40 PM, Florian Leibert wrote:

> Hi,
> I'm trying to create vectors with Mahout as explained in
> http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text 
> ,
> however I keep running out of heap. My heap is set to 2 GB already  
> and I use
> these parameters:
> "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind -- 
> output
> /user/florian/index-vectors-01 --field content --dictOut
> /user/florian/index-dict-01 --weight TF".

Hmm, 6GB isn't all that large, but the primary memory usage is going  
to be due to the CachedTermInfo, which loads all the terms into  
memory.  This is an interface that can be implemented in other,  
slower, ways, but we'll have to change the Driver program to allow for  
that.

How many unique terms do you have in the content field?

You have java -Xmx2000M set as the heap size?


>
> My index currently is about 6 GB large. Is there any way to compute  
> the
> vectors in a distributed manner?

There will be, but there isn't yet, I suspect.


> What's the largest index someone has
> created vectors from?

It's pretty new code, I've only tested it on relatively small indexes  
(few 100 mgs) so far, but the only gating issue memory wise is the  
CachedTermInfo.

Sorry I don't have better answers, but I am willing to help improve.   
I will try to use some bigger indexes soon.

-Grant

Mime
View raw message