mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 15:25:30 GMT

On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:

> Dear Grant,               Could you post some stats like the number of
> labels and features that you have and the number of unique  
> label,feature
> pair.

labels: history and science
Docs trained on: chunk 1 - 60 generated using the Wikipedia Splitter  
with the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64

Where are the <label,feature> values stored?

> Both Naive bayes and Complementary naive bayes use the same data
> except Sigma_j set.

So, why do I need to load it or even calculate it if I am using  
Bayes?  I think I would like to have the choice.  That is, if I plan  
on using both, then I can calculate/load both.  At a minimum, when  
classifying with Bayes, we should not be loading it, even if we did  
calculate it.

Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html 
  about the steps that are taken?  Also, I've read the CNB paper, but  
do you have a reference for the NB part using many of these values?

> But regardless the matrix stored is sparse. I am not
> surprised  that with a larger set like that you have taken, memory  
> limit was
> crossed. Another thing the number of unique terms in wikipedia is  
> quite
> large. So best choice for you right now is to use the Hbase  
> solution. The
> large matrix is stored easily on the it. I am currently writing the
> Distributed version of Hbase classification for parallelizing.
>

HBase isn't an option right now, as it isn't committed and I'm putting  
together a demo on current capabilities.


> Robin
>
> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll  
> <gsingers@apache.org>wrote:
>
>> The other thing is, I don't even think Sigma_J is even used for  
>> Bayes, only
>> Complementary Bayes.
>>
>>
>>
>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>>
>> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
>>> directory under the model.  For me, this file is 1.04 GB.  The  
>>> values in
>>> this file are loaded into a List of Doubles (which brings with it  
>>> a whole
>>> log of auto-boxing, too).  It seems like that should fit in memory,
>>> especially since it is the first thing loaded, AFAICT.  I have not  
>>> looked
>>> yet into the structure of the file itself.
>>>
>>> I guess I will have to dig deeper, this code has changed a lot  
>>> from when I
>>> first wrote it as a very simple naive bayes model to one that now  
>>> appears to
>>> be weighted by TF-IDF, normalization, etc. and I need to  
>>> understand it
>>> better.
>>>
>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>>
>>> This is kind of surprising.  It would seem that this model  
>>> shouldn't have
>>>> more than a few doubles per unique term and there should be <half a
>>>> million
>>>> terms.  Even with pretty evil data structures, this really  
>>>> shouldn't be
>>>> more
>>>> than a few hundred megs for the model alone.
>>>>
>>>> Sparsity *is* a virtue with these models and I always try to  
>>>> eliminate
>>>> terms
>>>> that might as well have zero value, but that doesn't sound like  
>>>> the root
>>>> problem here.
>>>>
>>>> Regarding strings or Writables, strings have the wonderful  
>>>> characteristic
>>>> that they cache their hashed value.  This means that hash maps  
>>>> are nearly
>>>> as
>>>> fast as arrays because you wind up indexing to nearly the right  
>>>> place and
>>>> then do a few (or one) integer compare to find the right value.   
>>>> Custom
>>>> data
>>>> types rarely do this and thus wind up slow.
>>>>
>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gsingers@apache.org
>>>>> wrote:
>>>>
>>>> I trained on a couple of categories (history and science) on  
>>>> quite a few
>>>>> docs, but now the model is so big, I can't load it, even with  
>>>>> almost 3
>>>>> GB of
>>>>> memory.
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve
>>>>
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message