mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <stu24m...@yahoo.com>
Subject Re: Naive Bayes training filling up jobcache
Date Tue, 03 Apr 2012 20:08:34 GMT


Hmm.. I think it's bayes.* ?

I'm invoking the mahout command line command to train it, and using the 'trainclassifier',
text-driven one.
Is the vector-based one a better bet?

Btw, does either one have a thread-safe, non-mapreduce training API? I didn't see one in bayes
- just wondering because I currently dump all my text data out of HBase, into text files on
HDFS, then train on those - if I could write a threaded program that just scans HBase and
pulls strings, I could save a step, and control how much space gets used a little better...



Take care, 

   -stu



________________________________
 From: Robin Anil <robin.anil@gmail.com>
To: user@mahout.apache.org; Stuart Smith <stu24mail@yahoo.com> 
Cc: Mahout List <mahout-user@lucene.apache.org> 
Sent: Tuesday, April 3, 2012 1:00 PM
Subject: Re: Naive Bayes training filling up jobcache
 

which version are you using? bayes.* or naivebayes.*
------
Robin Anil



On Tue, Apr 3, 2012 at 2:26 PM, Stuart Smith <stu24mail@yahoo.com> wrote:

Hello all,
>
>  I've got Naive Bayes working pretty good. Now I want to train a much bigger model.
From about 100,000 samples in each category to about a million.
>
>
>Everything starts ok - then map/reduce workers keep fill up the jobcache, and therefore
the disk, and everything grinds to a halt.
>
>
>Granted, it may be more of a hadoop question... but it also seems that there's not much
you can do about it (posted responses to other people include "make sure you have bigger disks"
- but I don't...). Also, naive bayes is the only task I've run that fills up the jobcache
on the tasktrackes.. I have 40-50 GB free on the temp dir.. not great, but passable.
>
>So, I'm left with wondering:
>
>Is there any tuning I could to do the Naive Bayes Classifier to make it use less jobcache
space?
>
>Right now, I'm down to running 1 map task on every machine.. even with 5 it filled up
the jobcache. I can also run more, wait for it to fill up & crash, then clear the cache
out by hand, restart... it recovers and gets farther, then crashes, repeat... Not sure which
approach is faster at this point .. 1 map task per node goes slooow...
>
>Take care,
>  -stu
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message