mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Leibert <...@leibert.de>
Subject Re: Vector creation - out of memory error
Date Tue, 21 Jul 2009 00:54:31 GMT
Hi Grant,
thanks for your answers - it seems to work with a heap of 4GB - but is
fairly slow. I'd be interested in seeing if we could make this process
distributed? It's running as a standalone right now and thus is a
bottleneck...

Are there any attempts right now to implement it in a M/R fashion?

Thanks,
Florian

On Mon, Jul 20, 2009 at 5:49 PM, Grant Ingersoll <gsingers@apache.org>wrote:

>
>
>
> On Jul 20, 2009, at 2:40 PM, Florian Leibert wrote:
>
>  Hi,
>> I'm trying to create vectors with Mahout as explained in
>>
>> http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
>> ,
>> however I keep running out of heap. My heap is set to 2 GB already and I
>> use
>> these parameters:
>> "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output
>> /user/florian/index-vectors-01 --field content --dictOut
>> /user/florian/index-dict-01 --weight TF".
>>
>
> Hmm, 6GB isn't all that large, but the primary memory usage is going to be
> due to the CachedTermInfo, which loads all the terms into memory.  This is
> an interface that can be implemented in other, slower, ways, but we'll have
> to change the Driver program to allow for that.
>
> How many unique terms do you have in the content field?
>
> You have java -Xmx2000M set as the heap size?
>
>
>
>> My index currently is about 6 GB large. Is there any way to compute the
>> vectors in a distributed manner?
>>
>
> There will be, but there isn't yet, I suspect.
>
>
>  What's the largest index someone has
>> created vectors from?
>>
>
> It's pretty new code, I've only tested it on relatively small indexes (few
> 100 mgs) so far, but the only gating issue memory wise is the
> CachedTermInfo.
>
> Sorry I don't have better answers, but I am willing to help improve.  I
> will try to use some bigger indexes soon.
>
> -Grant
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message