mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Re: Incremental training of Classifier
Date Mon, 28 Dec 2009 19:24:11 GMT
Hi Mani,
            Short answer: Currently you need to retrain the model.

Long answer, You can use your 600 docs to test the classifier and see your
accuracy. Then retrain with the entire documents and then test a test data
set. So daily you can choose to include or exclude the 600 documents that
come and ensure that you keep your classifier at the top performance.  After
some amount of documents, you dont get much benefit of retraining. Further
training would only add over fitting errors.


On Tue, Dec 29, 2009 at 12:46 AM, Mani Kumar <>wrote:

> Hi All,
> I have ran 20newsgroups example. Got a very good idea of how cluster is
> working for a defined dataset.
> But i have a slightly different situation here.
> * I have few thousands of documents (50k).
> * Everyday i get some e.g. 1k documents and out of which 600 are already
> classified so i need to classify only 400 documents everyday.
> So my approach would be:
> 1. Get all the documents into hdfs
> 2. Train classifier based on data in hdfs
> 3. Classify new unclassified document.
> Right now i don't see a way to add more training documents (600 already
> classified docs) into system? Am i missing something?
> Also I don't want to remove and then create training model again.
> Thanks!
> Mani Kumar

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message