mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 15:25:30 GMT

On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:

> Dear Grant,               Could you post some stats like the number of
> labels and features that you have and the number of unique  
> label,feature
> pair.

labels: history and science
Docs trained on: chunk 1 - 60 generated using the Wikipedia Splitter  
with the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64

Where are the <label,feature> values stored?

> Both Naive bayes and Complementary naive bayes use the same data
> except Sigma_j set.

So, why do I need to load it or even calculate it if I am using  
Bayes?  I think I would like to have the choice.  That is, if I plan  
on using both, then I can calculate/load both.  At a minimum, when  
classifying with Bayes, we should not be loading it, even if we did  
calculate it.

Could you add some writeup on 
  about the steps that are taken?  Also, I've read the CNB paper, but  
do you have a reference for the NB part using many of these values?

> But regardless the matrix stored is sparse. I am not
> surprised  that with a larger set like that you have taken, memory  
> limit was
> crossed. Another thing the number of unique terms in wikipedia is  
> quite
> large. So best choice for you right now is to use the Hbase  
> solution. The
> large matrix is stored easily on the it. I am currently writing the
> Distributed version of Hbase classification for parallelizing.

HBase isn't an option right now, as it isn't committed and I'm putting  
together a demo on current capabilities.

> Robin
> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll  
> <>wrote:
>> The other thing is, I don't even think Sigma_J is even used for  
>> Bayes, only
>> Complementary Bayes.
>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
>>> directory under the model.  For me, this file is 1.04 GB.  The  
>>> values in
>>> this file are loaded into a List of Doubles (which brings with it  
>>> a whole
>>> log of auto-boxing, too).  It seems like that should fit in memory,
>>> especially since it is the first thing loaded, AFAICT.  I have not  
>>> looked
>>> yet into the structure of the file itself.
>>> I guess I will have to dig deeper, this code has changed a lot  
>>> from when I
>>> first wrote it as a very simple naive bayes model to one that now  
>>> appears to
>>> be weighted by TF-IDF, normalization, etc. and I need to  
>>> understand it
>>> better.
>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>> This is kind of surprising.  It would seem that this model  
>>> shouldn't have
>>>> more than a few doubles per unique term and there should be <half a
>>>> million
>>>> terms.  Even with pretty evil data structures, this really  
>>>> shouldn't be
>>>> more
>>>> than a few hundred megs for the model alone.
>>>> Sparsity *is* a virtue with these models and I always try to  
>>>> eliminate
>>>> terms
>>>> that might as well have zero value, but that doesn't sound like  
>>>> the root
>>>> problem here.
>>>> Regarding strings or Writables, strings have the wonderful  
>>>> characteristic
>>>> that they cache their hashed value.  This means that hash maps  
>>>> are nearly
>>>> as
>>>> fast as arrays because you wind up indexing to nearly the right  
>>>> place and
>>>> then do a few (or one) integer compare to find the right value.   
>>>> Custom
>>>> data
>>>> types rarely do this and thus wind up slow.
>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <
>>>>> wrote:
>>>> I trained on a couple of categories (history and science) on  
>>>> quite a few
>>>>> docs, but now the model is so big, I can't load it, even with  
>>>>> almost 3
>>>>> GB of
>>>>> memory.
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message