mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rahman <drahman1...@googlemail.com>
Subject Re: confidence values of one (or more) feature(s)
Date Thu, 03 Nov 2011 15:43:19 GMT
Hi Ted,

thank you for the explanation.
For example imagine a term cloud, in which terms are presented. Some terms
are bigger than other, because they are more likely than the other terms. I
would need those results for analysis. We want to compare different
ML-algorithms and methods and/or compinations of them. And first I have to
gain some basic knowledge about Mahout.

For example, when I take the word 'social' as input I'd like to have that
result:

social                    1.0
social media           0.8
social networking    0.65
social news            0.6
facebook                0.5
...

(ignore those values, it's not correct, but it should show what I need)

The 20Newsgroup-example shows with the summary(int n) method the most
likely categorisation of a term (--> the most important feature). I would
like to have a list with the second, third, and so on important feature. I
imagine, while computing the features, only the most import ones are added
to the list and the less important features are rejected.

Thanks and regards,
David

2011/11/3 Ted Dunning <ted.dunning@gmail.com>

> There are no confidence values per se in the models computed by Mahout at
> this time.
>
> There are several issues here,
>
> 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
>
> 2) SGD logistic regresssion could compute confidence intervals, but I am
> not quite sure how to do that with stochastic gradient descent.
>
> 3) in most uses of Mahout's logistic regression, the issues are data size
> and feature set size.  Confidence values are typically used for selecting
> features which is typically not a viable strategy for problems with very
> large feature sets.  That is what the L1 regularization is all about.
>
> 4) with an extremely large number features, the noise on confidence
> intervals makes them very hard to understand
>
> 5) with hashed features and feature collisions it is hard enough to
> understand which feature is doing what, much less what the confidence
> interval means.
>
> Can you say more about your problem?  Is it small enough to use bayesglm in
> R?
>
> On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <drahman1985@googlemail.com
> >wrote:
>
> > Me again,
> >
> > can someone point me to right direction? How can I access these features?
> > I looked into the summary(int n) -method located in
> > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I don't
> > understand how it works.
> >
> > Could someone explain to me how it works? As I understand it, it returns
> > just the max-value of a feature.
> >
> > Thanks and regards,
> > David
> >
> > 2011/10/20 David Rahman <drahman1985@googlemail.com>
> >
> > > Hi,
> > >
> > > how can I access the confidence values of one (or more) feature(s) with
> > > its possibilities?
> > >
> > > In the 20Newsgroup-example, there is the dissect method, within there
> is
> > > used summary(int n), which returns the n most important features with
> > their
> > > weights. I want also the features which are placed second or third (or
> > > more). How can I access those?
> > >
> > > Regards,
> > > David
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message