mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 11:16:20 GMT
AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J  
directory under the model.  For me, this file is 1.04 GB.  The values  
in this file are loaded into a List of Doubles (which brings with it a  
whole log of auto-boxing, too).  It seems like that should fit in  
memory, especially since it is the first thing loaded, AFAICT.  I have  
not looked yet into the structure of the file itself.

I guess I will have to dig deeper, this code has changed a lot from  
when I first wrote it as a very simple naive bayes model to one that  
now appears to be weighted by TF-IDF, normalization, etc. and I need  
to understand it better.

On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:

> This is kind of surprising.  It would seem that this model shouldn't  
> have
> more than a few doubles per unique term and there should be <half a  
> million
> terms.  Even with pretty evil data structures, this really shouldn't  
> be more
> than a few hundred megs for the model alone.
> Sparsity *is* a virtue with these models and I always try to  
> eliminate terms
> that might as well have zero value, but that doesn't sound like the  
> root
> problem here.
> Regarding strings or Writables, strings have the wonderful  
> characteristic
> that they cache their hashed value.  This means that hash maps are  
> nearly as
> fast as arrays because you wind up indexing to nearly the right  
> place and
> then do a few (or one) integer compare to find the right value.   
> Custom data
> types rarely do this and thus wind up slow.
> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll  
> <>wrote:
>> I trained on a couple of categories (history and science) on quite  
>> a few
>> docs, but now the model is so big, I can't load it, even with  
>> almost 3 GB of
>> memory.
> -- 
> Ted Dunning, CTO
> DeepDyve

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message