mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Engel <da...@istwok.net>
Subject Seeking classification advice
Date Thu, 07 Jun 2012 22:31:56 GMT
Hi,

I've been dabbling with Mahout off and on for a few months preparing
for a classification project.  It's now time to stop experimenting and
do something for real.  I've picked up a lot of things from following
this list, but would like some advice regarding a few things before
proceeding.  I'll start with a very brief description of the project
and then follow up with some questions.

We need to classify potentially millions of documents into about 100
or so categories.  Most documents will probably only belong to 1
category, but some will belong to several.  It's also possible for
some documents to not belong to any of the chosen categories.

As noted, we need to handle the case where a document belongs to
multiple categories.  My understanding is the classification
algorithms are primarily geared to classifying an item into one
category and we would need run multiple classifiers in parallel to
match multiple categories.  Is that correct?  I found something in the
subversion logs referencing "multilabel" support that sounded
interesting, but it was removed a few weeks ago.  Is that of any
relevance?

Also as noted, we need to handle the case where a document belongs to
no categories.  Do any of the classification algorithms support the
concept of an implicit "other" or "none" category or do we need to add
an explicit one?  If the latter, how many training samples do we need
to use compared to the number of samples for the target categories?

Finally, I recall seeing on this list that some of the classification
algorithms break down if more than 20 to 30 categories are used and
that multiple classifiers should be used hierarchically when more
categories are needed.  Is that still correct?  If so, is there any
preferred way to organize the cascaded classifiers?  I'm currently
analyzing the documents we will use for training to see which
categories often, seldom or never occur together.

David
-- 
David Engel
david@istwok.net

Mime
View raw message