mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schilling <ch...@cellixis.com>
Subject Re: Logistic Regression Tutorial
Date Thu, 28 Apr 2011 20:32:23 GMT
Benson,

Chapter 14 and 15 discuss the 20 newsgroups classification example using bad-of-words.  In
this implementation of LR, you have to manually create the feature vectors when iterating
through the files.  The features are hashed into a vector of predetermined length.  The examples
are very clear and easy to setup.  I can send you some code I wrote for a similar problem
if it will help.

Chris

On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:

> Chris,
> 
> I'm looking a recently-purchased MIA.
> 
> The LR example is all about the donut file, which has features that
> don't look anything like, even remotely, a full-up bag-of-words
> vector.
> 
> I'm lacking the point of connection between the vectorization process
> (which we have some experience here with running canopy/kmeans) and
> the LR example. It's probably some simple principle that I'm failing
> to grasp.
> 
> --benson
> 
> 
> On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <chris@cellixis.com> wrote:
>> Benson,
>> 
>> The latest chapters in Mahout in Action cover document classification using LR very
well.
>> 
>> Chris
>> 
>> 
>> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
>> 
>>> Mike,
>>> 
>>> in the time available for the experiment I want to perform, all I can
>>> imagine doing is turning each document into a bag-of-words feature
>>> vector. So, I want to run the pipeline of lucene->vectors->... and
>>> train a model. I confess that I don't have the time to try to absorb
>>> the underlying math, indeed, I have some co-workers who can help me
>>> with that. My problem is entirely plumbing at this point.
>>> 
>>> --benson
>>> 
>>> 
>>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mike.nute@gmail.com> wrote:
>>>> Benson,
>>>> 
>>>> Lecture 3 in this one is a good intro to the logit model:
>>>> 
>>>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>>>> 
>>>> The lecture notes are pretty solid too so that might be faster.
>>>> 
>>>> The short version: Logistic Regression is a GLM with the link f^-1(x) =
>>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
>>>> Batch or Stochastic Gradient Descent.
>>>> 
>>>> I've never done document classification before though, so I'm not much help
>>>> with more complicated things like choosing the feature vector.
>>>> 
>>>> Good Luck,
>>>> Mike Nute
>>>> 
>>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bimargulies@gmail.com>wrote:
>>>> 
>>>>> Is there a logistic regression tutorial in the house? I've got a stack
>>>>> of files (Arabic ones, no less) and I want to train and score a
>>>>> classifier.
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Michael Nute
>>>> Mike.Nute@gmail.com
>>>> 
>> 
>> 


Mime
View raw message