mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From optimusfan <optimus...@yahoo.com>
Subject Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Date Mon, 02 Dec 2013 16:55:37 GMT
Ted-

Thanks for the response.  Just getting back after the holiday weekend and am catching up
on this.  Let me be more specific in what we're doing and what we're seeing in terms of results.
 Our goal was to created a classifier that could assign one or more of 46 categories to various
documents that it sees.  To accomplish this, we used AdaptiveLogisticRegression and trained
46 binary classification models.  Our approach has been to do an 80/20 split on the data,
holding the 20% back for cross-validation of the models we generate.

We've been playing around with a number of different parameters, feature selection, etc. and
are able to achieve pretty good results in cross-validation.  We have a ton of different
metrics we're tracking on the results, most significant to this discussion is that it looks
like we're achieving very good precision (typically >.85 or .9) and a good f1-score (typically
again >.85 or .9).  However, when we then take the models generated and try to apply them
to some new documents, we're getting many more false positives than we would expect.  Documents
that should have 2 categories are testing positive for 16, which is well above what I'd expect.
 By my math I should expect 2 true positives, plus maybe 4.4 (.10 false positives * 44 classes)
additional false positives.

We suspected that perhaps our models were underfitting or overfitting, hence this post.  However,
I'll take any and all suggestions for anything else we should be looking at.

Thanks,
Ian



On Thursday, November 28, 2013 2:20 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
 
Yes.  Exactly.



On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi
<vishal.santoshi@gmail.com>wrote:

> Absolutely. I will read through.  The idea is to first  fix the learning
> rate update equation in OLR.
> I think this code  in  OnlineLogisticRegression is the current equation ?
>
> @Override
>
>   public double currentLearningRate() {
>
>     return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
> stepOffset, forgettingExponent);
>
>   }
>
>
> I presume that you would like  Adagrad-like solution to replace the above ?
>
>
>
>
>
>
> On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > vishal.santoshi@gmail.com>
> >
> > >
> > >
> > > Are we to assume that SGD is still a work in progress and
> > implementations (
> > > Cross Fold, Online, Adaptive ) are too flawed to be realistically used
> ?
> > >
> >
> > They are too raw to be accepted uncritically, for sure.  They have been
> > used successfully in production.
> >
> >
> > > The evolutionary algorithm seems to be the core of
> > > OnlineLogisticRegression,
> > > which in turn builds up to Adaptive/Cross Fold.
> > >
> > > >>b) for truly on-line learning where no repeated passes through the
> > data..
> > >
> > > What would it take to get to an implementation ? How can any one help ?
> > >
> >
> > Would you like to help on this?  The amount of work required to get a
> > distributed asynchronous learner up is moderate, but definitely not huge.
> >
> > I think that OnlineLogisticRegression is basically sound, but should get
> a
> > better learning rate update equation.  That would largely make the
> > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > distributed asynchronous learner.
> >
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message