mahout-user mailing list archives

Site index · List index
Message view
Top
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Logistic Regression Tutorial
Date Fri, 29 Apr 2011 11:36:17 GMT
```With some help from Ted (which I plan to turn into a checked-in tool
if he doesn't get there first), I'm running LR on my initial small
example.

I adapted Ted's rcv1 sample to digest a directory containing
subdirectories containing exemplars.

Ted's delightfully small program pushes all of the data into the model
'n' times (n is 10 in my current variation. It displays the best
learner's accuracy at each iteration.

The example is 1000 docs in 10 categories.

With 20k-features, I note that the accuracy scores get worse on each
iteration of pushing the data into the model.

After the first pass, the model hasn't trained yet. After the second,
accuracy is 95.6%, and then if drifts gracefully downward with each
additional iteration, landing at .83.

I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
but this pattern is not intuitive to me.

On Thu, Apr 28, 2011 at 5:59 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> And, of course, the current SGD learner handles the multinomial case.
>
> On Thu, Apr 28, 2011 at 2:52 PM, Mike Nute <mike.nute@gmail.com> wrote:
>
>> Once you do the vectorization, that becomes the feature vector for your
>> GLM.  The problem with doing multinomial logit is that if you have a
>> feature
>> vector of size K and N different categories, you end up with K*(N-1)
>> separate parameters to fit which can be nasty, though there are ways to get
>> around that by constraining them.  The N-way case is equivalent to doing
>> (N-1) separate binomial logits.
>>
>> Does that help with the connection between the vectorization process and
>> LR?
>>
>> MN
>>
>> On Thu, Apr 28, 2011 at 5:07 PM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>>
>> > THanks, all. I'm get frustrated really fast when trying to read a PDF.
>> > I guess I'm a fossil.
>> >
>> > On Thu, Apr 28, 2011 at 4:54 PM, Ted Dunning <ted.dunning@gmail.com>
>> > wrote:
>> > > The TrainNewsGroups class does this not quite as nicely as is possible
>> > (it
>> > > avoids the TextValueEncoder).
>> > >
>> > > I will post a simplified example on github that I just worked up for
>> > RCV1.
>> > >
>> > >
>> > >
>> > > On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <chris@cellixis.com>
>> > wrote:
>> > >
>> > >> Benson,
>> > >>
>> > >> Chapter 14 and 15 discuss the 20 newsgroups classification example
>> using
>> > >> bad-of-words.  In this implementation of LR, you have to manually
>> create
>> > the
>> > >> feature vectors when iterating through the files.  The features are
>> > hashed
>> > >> into a vector of predetermined length.  The examples are very clear
>> and
>> > easy
>> > >> to setup.  I can send you some code I wrote for a similar problem
if
>> it
>> > will
>> > >> help.
>> > >>
>> > >> Chris
>> > >>
>> > >> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
>> > >>
>> > >> > Chris,
>> > >> >
>> > >> > I'm looking a recently-purchased MIA.
>> > >> >
>> > >> > The LR example is all about the donut file, which has features
that
>> > >> > don't look anything like, even remotely, a full-up bag-of-words
>> > >> > vector.
>> > >> >
>> > >> > I'm lacking the point of connection between the vectorization
>> process
>> > >> > (which we have some experience here with running canopy/kmeans)
and
>> > >> > the LR example. It's probably some simple principle that I'm failing
>> > >> > to grasp.
>> > >> >
>> > >> > --benson
>> > >> >
>> > >> >
>> > >> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <
>> chris@cellixis.com>
>> > >> wrote:
>> > >> >> Benson,
>> > >> >>
>> > >> >> The latest chapters in Mahout in Action cover document
>> classification
>> > >> using LR very well.
>> > >> >>
>> > >> >> Chris
>> > >> >>
>> > >> >>
>> > >> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
>> > >> >>
>> > >> >>> Mike,
>> > >> >>>
>> > >> >>> in the time available for the experiment I want to perform,
all I
>> > can
>> > >> >>> imagine doing is turning each document into a bag-of-words
feature
>> > >> >>> vector. So, I want to run the pipeline of lucene->vectors->...
and
>> > >> >>> train a model. I confess that I don't have the time to
try to
>> absorb
>> > >> >>> the underlying math, indeed, I have some co-workers who
can help
>> me
>> > >> >>> with that. My problem is entirely plumbing at this point.
>> > >> >>>
>> > >> >>> --benson
>> > >> >>>
>> > >> >>>
>> > >> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mike.nute@gmail.com>
>> > >> wrote:
>> > >> >>>> Benson,
>> > >> >>>>
>> > >> >>>> Lecture 3 in this one is a good intro to the logit
model:
>> > >> >>>>
>> > >> >>>>
>> > >>
>> >
>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>> > >> >>>>
>> > >> >>>> The lecture notes are pretty solid too so that might
be faster.
>> > >> >>>>
>> > >> >>>> The short version: Logistic Regression is a GLM with
the link
>> > f^-1(x)
>> > >> =
>> > >> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You
can
>> > >> alternatively use
>> > >> >>>> Batch or Stochastic Gradient Descent.
>> > >> >>>>
>> > >> >>>> I've never done document classification before though,
so I'm not
>> > much
>> > >> help
>> > >> >>>> with more complicated things like choosing the feature
vector.
>> > >> >>>>
>> > >> >>>> Good Luck,
>> > >> >>>> Mike Nute
>> > >> >>>>
>> > >> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies
<
>> > >> bimargulies@gmail.com>wrote:
>> > >> >>>>
>> > >> >>>>> Is there a logistic regression tutorial in the
house? I've got a
>> > >> stack
>> > >> >>>>> of files (Arabic ones, no less) and I want to
train and score a
>> > >> >>>>> classifier.
>> > >> >>>>>
>> > >> >>>>
>> > >> >>>>
>> > >> >>>>
>> > >> >>>> --
>> > >> >>>> Michael Nute
>> > >> >>>> Mike.Nute@gmail.com
>> > >> >>>>
>> > >> >>
>> > >> >>
>> > >>
>> > >>
>> > >
>> >
>>
>>
>>
>> --
>> Michael Nute
>> Mike.Nute@gmail.com
>>
>

```
Mime
View raw message