mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Identify "less similar" documents
Date Wed, 13 Apr 2011 22:41:26 GMT
I suspect but of the problem might be creating the training set for
the 'other' since the documents are distinctly 'different' from
anything else, including from each other.
I guess the definition for the 'other' category is a 'low relevance
for everything yet trained' but not 'high relevance to some category
'other' .

As such, i think it is implied by definition that training for that
stuff is not possible, but perhaps some cut-off threshold on the
regressed posterior for all categories would help. But that's a
surgery on the learner itself, i can't recollect if it is exposed by
learner api?

On Wed, Apr 13, 2011 at 8:34 AM, Ted Dunning <> wrote:
> I think that what you are doing is inventing an "other" category and
> building a classifier for that category.
> Why not just train with those documents and put a category tag of "other" on
> them and run normal categorization?  If you can distinguish these documents
> by word frequencies, then this should do the trick.

View raw message