mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel McEnnis <>
Subject Re: Identify "less similar" documents
Date Wed, 13 Apr 2011 23:27:03 GMT
The official solution is to assign outliers in the training set to
other.  These are defined as high mean distance to other points.  A
hack to get this to work would be to perform a knn-like distance
comparison with all trained sets and classify as other anything that
exceeds the threshold distance - a variation of the same technique and
already mentioned.


On Wed, Apr 13, 2011 at 6:41 PM, Dmitriy Lyubimov <> wrote:
> I suspect but of the problem might be creating the training set for
> the 'other' since the documents are distinctly 'different' from
> anything else, including from each other.
> I guess the definition for the 'other' category is a 'low relevance
> for everything yet trained' but not 'high relevance to some category
> 'other' .
> As such, i think it is implied by definition that training for that
> stuff is not possible, but perhaps some cut-off threshold on the
> regressed posterior for all categories would help. But that's a
> surgery on the learner itself, i can't recollect if it is exposed by
> learner api?
> On Wed, Apr 13, 2011 at 8:34 AM, Ted Dunning <> wrote:
>> I think that what you are doing is inventing an "other" category and
>> building a classifier for that category.
>> Why not just train with those documents and put a category tag of "other" on
>> them and run normal categorization?  If you can distinguish these documents
>> by word frequencies, then this should do the trick.

View raw message