mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <>
Subject Naive Bayes training filling up jobcache
Date Tue, 03 Apr 2012 19:26:36 GMT
Hello all,

  I've got Naive Bayes working pretty good. Now I want to train a much bigger model. From
about 100,000 samples in each category to about a million. 

Everything starts ok - then map/reduce workers keep fill up the jobcache, and therefore the
disk, and everything grinds to a halt. 

Granted, it may be more of a hadoop question... but it also seems that there's not much you
can do about it (posted responses to other people include "make sure you have bigger disks"
- but I don't...). Also, naive bayes is the only task I've run that fills up the jobcache
on the tasktrackes.. I have 40-50 GB free on the temp dir.. not great, but passable.

So, I'm left with wondering:

Is there any tuning I could to do the Naive Bayes Classifier to make it use less jobcache

Right now, I'm down to running 1 map task on every machine.. even with 5 it filled up the
jobcache. I can also run more, wait for it to fill up & crash, then clear the cache out
by hand, restart... it recovers and gets farther, then crashes, repeat... Not sure which approach
is faster at this point .. 1 map task per node goes slooow...

Take care,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message