mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 11:23:19 GMT
The other thing is, I don't even think Sigma_J is even used for Bayes,  
only Complementary Bayes.

On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:

> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J  
> directory under the model.  For me, this file is 1.04 GB.  The  
> values in this file are loaded into a List of Doubles (which brings  
> with it a whole log of auto-boxing, too).  It seems like that should  
> fit in memory, especially since it is the first thing loaded,  
> AFAICT.  I have not looked yet into the structure of the file itself.
> I guess I will have to dig deeper, this code has changed a lot from  
> when I first wrote it as a very simple naive bayes model to one that  
> now appears to be weighted by TF-IDF, normalization, etc. and I need  
> to understand it better.
> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>> This is kind of surprising.  It would seem that this model  
>> shouldn't have
>> more than a few doubles per unique term and there should be <half a  
>> million
>> terms.  Even with pretty evil data structures, this really  
>> shouldn't be more
>> than a few hundred megs for the model alone.
>> Sparsity *is* a virtue with these models and I always try to  
>> eliminate terms
>> that might as well have zero value, but that doesn't sound like the  
>> root
>> problem here.
>> Regarding strings or Writables, strings have the wonderful  
>> characteristic
>> that they cache their hashed value.  This means that hash maps are  
>> nearly as
>> fast as arrays because you wind up indexing to nearly the right  
>> place and
>> then do a few (or one) integer compare to find the right value.   
>> Custom data
>> types rarely do this and thus wind up slow.
>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll  
>> <>wrote:
>>> I trained on a couple of categories (history and science) on quite  
>>> a few
>>> docs, but now the model is so big, I can't load it, even with  
>>> almost 3 GB of
>>> memory.
>> -- 
>> Ted Dunning, CTO
>> DeepDyve

View raw message