mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Date Mon, 30 Dec 2013 01:45:35 GMT
:-)

Many leaks are *very* subtle.

One leak that had me going for weeks was in a news wire corpus.  I couldn't
figure out why the cross validation was so good and running the classifier
on new data was soooo much worse.

The answer was that the training corpus had near-duplicate articles.  This
means that there was leakage between the training and test corpora.  This
wasn't quite a target leak, but it was a leak.

For target leaks, it is very common to have partial target leaks due to the
fact that you learn more about positive cases after the moment that you had
to select which case to investigate.  Suppose, for instance you are
targeting potential customers based on very limited information.  If you
make an enticing offer to the people you target, then those who accept the
offer will buy something from you.  You will also learn some particulars
such as name and address from those who buy from you.

Looking retrospectively, it looks like you can target good customers who
have names or addresses that are not null.  Without a good snapshot of each
customer record at exactly the time that the targeting was done, you cannot
know that *all* customers have a null name and address before you target
them.  This sort of time machine leak can be enormously more subtle than
this.



On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gkhncpn@gmail.com> wrote:

> Gokhan
>
>
> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > vishal.santoshi@gmail.com>
> >
> > >
> > >
> > > Are we to assume that SGD is still a work in progress and
> > implementations (
> > > Cross Fold, Online, Adaptive ) are too flawed to be realistically used
> ?
> > >
> >
> > They are too raw to be accepted uncritically, for sure.  They have been
> > used successfully in production.
> >
> >
> > > The evolutionary algorithm seems to be the core of
> > > OnlineLogisticRegression,
> > > which in turn builds up to Adaptive/Cross Fold.
> > >
> > > >>b) for truly on-line learning where no repeated passes through the
> > data..
> > >
> > > What would it take to get to an implementation ? How can any one help ?
> > >
> >
> > Would you like to help on this?  The amount of work required to get a
> > distributed asynchronous learner up is moderate, but definitely not huge.
> >
>
> Ted, do you describe a generic distributed learner for all kinds of online
> algorithms? Possibly zookeeper-coordinated and with #predict and
> #getFeedbackAndUpdateTheModel methods?
>
> >
> > I think that OnlineLogisticRegression is basically sound, but should get
> a
> > better learning rate update equation.  That would largely make the
> > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > distributed asynchronous learner.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message