mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 16:14:18 GMT

On Jul 22, 2009, at 11:50 AM, Robin Anil wrote:

> On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll  
> <gsingers@apache.org>wrote:
>
>>
>> On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:
>>
>> Dear Grant,               Could you post some stats like the number  
>> of
>>> labels and features that you have and the number of unique  
>>> label,feature
>>> pair.
>>>
>>
>> labels: history and science
>> Docs trained on: chunk 1 - 60 generated using the Wikipedia  
>> Splitter with
>> the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64
>>
>> Where are the <label,feature> values stored?
>>
>
> tf-Idf Folder part-****

That's 1.28 GB.  Count: 31216595

(FYI, I modified the SequenceFileDumper to spit out counts from a  
SeqFile)

>
>
>>
>>
>> Both Naive bayes and Complementary naive bayes use the same data
>>> except Sigma_j set.
>>>
>>
>> So, why do I need to load it or even calculate it if I am using  
>> Bayes?  I
>> think I would like to have the choice.  That is, if I plan on using  
>> both,
>> then I can calculate/load both.  At a minimum, when classifying  
>> with Bayes,
>> we should not be loading it, even if we did calculate it.

Thoughts on this?  Can I disable it for Bayes?

>>
>> Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html 
>>  about
>> the steps that are taken?  Also, I've read the CNB paper, but do  
>> you have a
>> reference for the NB part using many of these values?
>>
>
> Sure. I will ASAP
>
>>
>>
>> But regardless the matrix stored is sparse. I am not
>>> surprised  that with a larger set like that you have taken, memory  
>>> limit
>>> was
>>> crossed. Another thing the number of unique terms in wikipedia is  
>>> quite
>>> large. So best choice for you right now is to use the Hbase  
>>> solution. The
>>> large matrix is stored easily on the it. I am currently writing the
>>> Distributed version of Hbase classification for parallelizing.
>>>
>>>
>> HBase isn't an option right now, as it isn't committed and I'm  
>> putting
>> together a demo on current capabilities.
>>
>>
>>
>> Robin
>>>
>>> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll  
>>> <gsingers@apache.org
>>>> wrote:
>>>
>>> The other thing is, I don't even think Sigma_J is even used for  
>>> Bayes,
>>>> only
>>>> Complementary Bayes.
>>>>
>>>>
>>>>
>>>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>>>>
>>>> AFAICT, It is loading the Sum Feature Weights, stored in the  
>>>> Sigma_J
>>>>
>>>>> directory under the model.  For me, this file is 1.04 GB.  The  
>>>>> values in
>>>>> this file are loaded into a List of Doubles (which brings with  
>>>>> it a
>>>>> whole
>>>>> log of auto-boxing, too).  It seems like that should fit in  
>>>>> memory,
>>>>> especially since it is the first thing loaded, AFAICT.  I have not
>>>>> looked
>>>>> yet into the structure of the file itself.
>>>>>
>>>>> I guess I will have to dig deeper, this code has changed a lot  
>>>>> from when
>>>>> I
>>>>> first wrote it as a very simple naive bayes model to one that now
>>>>> appears to
>>>>> be weighted by TF-IDF, normalization, etc. and I need to  
>>>>> understand it
>>>>> better.
>>>>>
>>>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>>>>
>>>>> This is kind of surprising.  It would seem that this model  
>>>>> shouldn't
>>>>> have
>>>>>
>>>>>> more than a few doubles per unique term and there should be  
>>>>>> <half a
>>>>>> million
>>>>>> terms.  Even with pretty evil data structures, this really  
>>>>>> shouldn't be
>>>>>> more
>>>>>> than a few hundred megs for the model alone.
>>>>>>
>>>>>> Sparsity *is* a virtue with these models and I always try to  
>>>>>> eliminate
>>>>>> terms
>>>>>> that might as well have zero value, but that doesn't sound like 

>>>>>> the
>>>>>> root
>>>>>> problem here.
>>>>>>
>>>>>> Regarding strings or Writables, strings have the wonderful
>>>>>> characteristic
>>>>>> that they cache their hashed value.  This means that hash maps  
>>>>>> are
>>>>>> nearly
>>>>>> as
>>>>>> fast as arrays because you wind up indexing to nearly the right 

>>>>>> place
>>>>>> and
>>>>>> then do a few (or one) integer compare to find the right  
>>>>>> value.  Custom
>>>>>> data
>>>>>> types rarely do this and thus wind up slow.
>>>>>>
>>>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gsingers@apache.org
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>> I trained on a couple of categories (history and science) on  
>>>>>> quite a
>>>>>> few
>>>>>>
>>>>>>> docs, but now the model is so big, I can't load it, even with
 
>>>>>>> almost 3
>>>>>>> GB of
>>>>>>> memory.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ted Dunning, CTO
>>>>>> DeepDyve
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message