mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rahman <drahman1...@googlemail.com>
Subject Re: confidence values of one (or more) feature(s)
Date Thu, 03 Nov 2011 19:19:45 GMT
Hi Ted,

I want to have the model explain why it classified documents in a certain
way. That should be enough at first.

I want to classify documents, each document has a corresponding set of
keywords. The model should be able to classify unknown documents and
provide a number of suggustions of keywords. Later on it should be possible
to build a search term recommender for a search engine with classified
documents as a basis.

At first we wanted to use the lucene data, but the existing data is build
with an older lucene version, so the data is provided in xml, for now. It's
like the wikipedia example, only with more possible keywords.

Hope it's understandable.

Thanks for your endurance and regards,
David

2011/11/3 Ted Dunning <ted.dunning@gmail.com>

> I am sorry for being dense, but I don't really understand what you are
> trying to do.
>
> As I see it,
>
> - the input is documents
>
> - the output is a category
>
> You want one or more of the following,
>
> - to have the model explain why it classified documents a certain way
>
> or
>
> - to classify non-document phrases a certain way
>
> or
>
> - to have the model show its internal structure to you
>
> or
>
> - something else entirely
>
> Can you say what you want in these terms?
>
> On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <drahman1985@googlemail.com
> >wrote:
>
> > Hi Ted,
> >
> > thank you for the explanation.
> > For example imagine a term cloud, in which terms are presented. Some
> terms
> > are bigger than other, because they are more likely than the other
> terms. I
> > would need those results for analysis. We want to compare different
> > ML-algorithms and methods and/or compinations of them. And first I have
> to
> > gain some basic knowledge about Mahout.
> >
> > For example, when I take the word 'social' as input I'd like to have that
> > result:
> >
> > social                    1.0
> > social media           0.8
> > social networking    0.65
> > social news            0.6
> > facebook                0.5
> > ...
> >
> > (ignore those values, it's not correct, but it should show what I need)
> >
> > The 20Newsgroup-example shows with the summary(int n) method the most
> > likely categorisation of a term (--> the most important feature). I would
> > like to have a list with the second, third, and so on important feature.
> I
> > imagine, while computing the features, only the most import ones are
> added
> > to the list and the less important features are rejected.
> >
> > Thanks and regards,
> > David
> >
> > 2011/11/3 Ted Dunning <ted.dunning@gmail.com>
> >
> > > There are no confidence values per se in the models computed by Mahout
> at
> > > this time.
> > >
> > > There are several issues here,
> > >
> > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > >
> > > 2) SGD logistic regresssion could compute confidence intervals, but I
> am
> > > not quite sure how to do that with stochastic gradient descent.
> > >
> > > 3) in most uses of Mahout's logistic regression, the issues are data
> size
> > > and feature set size.  Confidence values are typically used for
> selecting
> > > features which is typically not a viable strategy for problems with
> very
> > > large feature sets.  That is what the L1 regularization is all about.
> > >
> > > 4) with an extremely large number features, the noise on confidence
> > > intervals makes them very hard to understand
> > >
> > > 5) with hashed features and feature collisions it is hard enough to
> > > understand which feature is doing what, much less what the confidence
> > > interval means.
> > >
> > > Can you say more about your problem?  Is it small enough to use
> bayesglm
> > in
> > > R?
> > >
> > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> drahman1985@googlemail.com
> > > >wrote:
> > >
> > > > Me again,
> > > >
> > > > can someone point me to right direction? How can I access these
> > features?
> > > > I looked into the summary(int n) -method located in
> > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I
> > don't
> > > > understand how it works.
> > > >
> > > > Could someone explain to me how it works? As I understand it, it
> > returns
> > > > just the max-value of a feature.
> > > >
> > > > Thanks and regards,
> > > > David
> > > >
> > > > 2011/10/20 David Rahman <drahman1985@googlemail.com>
> > > >
> > > > > Hi,
> > > > >
> > > > > how can I access the confidence values of one (or more) feature(s)
> > with
> > > > > its possibilities?
> > > > >
> > > > > In the 20Newsgroup-example, there is the dissect method, within
> there
> > > is
> > > > > used summary(int n), which returns the n most important features
> with
> > > > their
> > > > > weights. I want also the features which are placed second or third
> > (or
> > > > > more). How can I access those?
> > > > >
> > > > > Regards,
> > > > > David
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message