mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Identify "less similar" documents
Date Thu, 14 Apr 2011 00:05:04 GMT
I think that our estimation of whether this would work differs a bit.  In
the very high dimensional space that we are working in, proximities can be a
bit surprising.

For one thing, the bias term provides a mechanism so that a logistic
regression can attribute score to an other category.  This allows the
algorithm to implement something much like a score threshold.

For another, since the scores that come out of the SGD logistic regression
model always sum to 1, it is important to have an other category so that the
model can even express "low general relevance".

But, in this case as in many others, I would recommend you try it out.

On Wed, Apr 13, 2011 at 3:41 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> I guess the definition for the 'other' category is a 'low relevance
> for everything yet trained' but not 'high relevance to some category
> 'other' .
>
> As such, i think it is implied by definition that training for that
> stuff is not possible, but perhaps some cut-off threshold on the
> regressed posterior for all categories would help. But that's a
> surgery on the learner itself, i can't recollect if it is exposed by
> learner api?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message