mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: confidence values of one (or more) feature(s)
Date Thu, 03 Nov 2011 20:24:49 GMT
OK.

So the simplest design in Mahout terms is a binary classifier for each
keyword (if the keywords are not mutually exclusive).  If you can define a
useful ordering for terms or have some logical entailment, you may want to
allow the presence of some terms to be features for certain other terms.

So the question boils down to how to ask a binary logistic regression how
it came to its conclusion.

You are correct to look to the model dissector for the function you want,
but you will have to call it in a little bit unusual way because it is
really intended to describe a model rather than a single decision.  The
logistic regression functions in Mahout don't actually expose quite as much
information as you need for this, but if you add this method, you should
get the basic information you need:

        /**
   * Return the element-wise product of the feature vector versus each
column
   * of the beta matrix.  This can then be used to extract the most
interesting
   * features for a decision for each alternative output.
   * @param instance  A feature vector
   * @return   A matrix like beta but with each column multiplied by
instance.
   */
  public Matrix explain(Vector instance) {
    regularize(instance);
    Matrix r = beta.like().assign(beta);
    for (int column = 0; column < r.columnSize(); column++) {
      r.viewColumn(column).assign(instance, Functions.MULT);
    }
    return r;
  }


Then to explain your binary model, you probably want some code like this:

   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
   Vector instance = encode(data, traceDictionary)
   Matrix b = model.explain(instance);

   ModelDissector md = new ModelDissector();
   // get positive terms
   ModelDissector.update(b.getColumn(0), td, model);
   // scan through the top terms
   ...

   md = new ModelDissector();
   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
model);
   // scan through the most negative terms
   ...

Note that all of this code is untested and I could be out to lunch here.




On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <drahman1985@googlemail.com>wrote:

> Hi Ted,
>
> I want to have the model explain why it classified documents in a certain
> way. That should be enough at first.
>
> I want to classify documents, each document has a corresponding set of
> keywords. The model should be able to classify unknown documents and
> provide a number of suggustions of keywords. Later on it should be possible
> to build a search term recommender for a search engine with classified
> documents as a basis.
>
> At first we wanted to use the lucene data, but the existing data is build
> with an older lucene version, so the data is provided in xml, for now. It's
> like the wikipedia example, only with more possible keywords.
>
> Hope it's understandable.
>
> Thanks for your endurance and regards,
> David
>
> 2011/11/3 Ted Dunning <ted.dunning@gmail.com>
>
> > I am sorry for being dense, but I don't really understand what you are
> > trying to do.
> >
> > As I see it,
> >
> > - the input is documents
> >
> > - the output is a category
> >
> > You want one or more of the following,
> >
> > - to have the model explain why it classified documents a certain way
> >
> > or
> >
> > - to classify non-document phrases a certain way
> >
> > or
> >
> > - to have the model show its internal structure to you
> >
> > or
> >
> > - something else entirely
> >
> > Can you say what you want in these terms?
> >
> > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <drahman1985@googlemail.com
> > >wrote:
> >
> > > Hi Ted,
> > >
> > > thank you for the explanation.
> > > For example imagine a term cloud, in which terms are presented. Some
> > terms
> > > are bigger than other, because they are more likely than the other
> > terms. I
> > > would need those results for analysis. We want to compare different
> > > ML-algorithms and methods and/or compinations of them. And first I have
> > to
> > > gain some basic knowledge about Mahout.
> > >
> > > For example, when I take the word 'social' as input I'd like to have
> that
> > > result:
> > >
> > > social                    1.0
> > > social media           0.8
> > > social networking    0.65
> > > social news            0.6
> > > facebook                0.5
> > > ...
> > >
> > > (ignore those values, it's not correct, but it should show what I need)
> > >
> > > The 20Newsgroup-example shows with the summary(int n) method the most
> > > likely categorisation of a term (--> the most important feature). I
> would
> > > like to have a list with the second, third, and so on important
> feature.
> > I
> > > imagine, while computing the features, only the most import ones are
> > added
> > > to the list and the less important features are rejected.
> > >
> > > Thanks and regards,
> > > David
> > >
> > > 2011/11/3 Ted Dunning <ted.dunning@gmail.com>
> > >
> > > > There are no confidence values per se in the models computed by
> Mahout
> > at
> > > > this time.
> > > >
> > > > There are several issues here,
> > > >
> > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > > >
> > > > 2) SGD logistic regresssion could compute confidence intervals, but I
> > am
> > > > not quite sure how to do that with stochastic gradient descent.
> > > >
> > > > 3) in most uses of Mahout's logistic regression, the issues are data
> > size
> > > > and feature set size.  Confidence values are typically used for
> > selecting
> > > > features which is typically not a viable strategy for problems with
> > very
> > > > large feature sets.  That is what the L1 regularization is all about.
> > > >
> > > > 4) with an extremely large number features, the noise on confidence
> > > > intervals makes them very hard to understand
> > > >
> > > > 5) with hashed features and feature collisions it is hard enough to
> > > > understand which feature is doing what, much less what the confidence
> > > > interval means.
> > > >
> > > > Can you say more about your problem?  Is it small enough to use
> > bayesglm
> > > in
> > > > R?
> > > >
> > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > drahman1985@googlemail.com
> > > > >wrote:
> > > >
> > > > > Me again,
> > > > >
> > > > > can someone point me to right direction? How can I access these
> > > features?
> > > > > I looked into the summary(int n) -method located in
> > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow
I
> > > don't
> > > > > understand how it works.
> > > > >
> > > > > Could someone explain to me how it works? As I understand it, it
> > > returns
> > > > > just the max-value of a feature.
> > > > >
> > > > > Thanks and regards,
> > > > > David
> > > > >
> > > > > 2011/10/20 David Rahman <drahman1985@googlemail.com>
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > how can I access the confidence values of one (or more)
> feature(s)
> > > with
> > > > > > its possibilities?
> > > > > >
> > > > > > In the 20Newsgroup-example, there is the dissect method, within
> > there
> > > > is
> > > > > > used summary(int n), which returns the n most important features
> > with
> > > > > their
> > > > > > weights. I want also the features which are placed second or
> third
> > > (or
> > > > > > more). How can I access those?
> > > > > >
> > > > > > Regards,
> > > > > > David
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message