mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Wang <wangfan...@gmail.com>
Subject Re: Implementation for Linear Regression
Date Fri, 22 Oct 2010 08:08:24 GMT
Thanks Ted.

It's a very interesting solution. Currently, we need to account for age
related terms when calculating the relevance ranking, and this is done
before display time. We will play around with our data and see if we can
model our data to leverage on the trick.

In terms of Linear Regression, I've attached the initial patch on
MAHOUT-529<https://issues.apache.org/jira/browse/MAHOUT-529>.
It's mainly the AbstractOnlineLinearRegression and OnlineLinearRegression
classes. Lemme know if the code makes sense.

I have 2 questions:

1.
The apply() function in DefaultGradient has:
    Vector r = v.like();
    if (actual != 0) {
      r.setQuick(actual - 1, 1);
    }

The code seems to work only for logistic regression. When actual is 0, r[0]
remains 0, and when actual is 1, r[0] gets set to 1. I'm not sure if I'm
understanding it correctly. For now, I've included DefaultGradientLinear in
the patch as a work around. If you could give me some advice, that'd be
helpful.


2.
As I'm working on the sample code TrainLinear, I was referring to
TrainLogistic code. I'm confused with this line:
         int targetValue = csv.processLine(line, input);

The training file is:
"a","b","c","target"
3,1,10,1
2,1,10,1
1,0,2,0
...

But the output for processLine() is:
Line 1: targetValue = 0, input = {2:4.0, 1:10.0, 0:1.0}
Line 2: targetValue = 0, input = {2:3.0, 1:10.0, 0:1.0}
Line 3: targetValue = 1, input = {2:1.0, 1:2.0, 0:1.0}
...

It seems the target values are inverted, and some input values are
incremented. It'd be great if you could explain the processLine() a little
bit.

btw, is the mail list a good place for implementation discussion or should
it take place on the JIRA page?

Thanks

On Wed, Oct 20, 2010 at 9:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> You don't have to apply the age correction to old data until you display
> the
> data.  The trick is to store all of the fixed components
> of the rating in linear form and then add only the age related terms at
> display time.  This allows you to penalize items that are unlikely to be
> relevant due to age and doesn't require any recomputation.
>
> On Wed, Oct 20, 2010 at 9:32 PM, Frank Wang <wangfanjie@gmail.com> wrote:
>
> > Hi Ted,
> >
> > I've created the JIRA issue at
> > https://issues.apache.org/jira/browse/MAHOUT-529, will attach what i
> have
> > soon.
> >
> > Do you mean using time as a feature in the logistic regression? I thought
> > about your suggestion the other day, but I'm not re-calculating the
> > probability on the old data. After training each night, we only apply the
> > coefficients on next day's new data. I'm not quite sure how would the
> decay
> > function work in this case. Do you have an example?
> >
> > Thanks
> >
> >
> > On Wed, Oct 20, 2010 at 8:48 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Can you open a JIRA and attach a patch.
> > >
> > > Your approach seems reasonable so far for the regression.
> > >
> > > In terms of how it could be applied, it seems like you are trying to
> > > estimate a life-span for a posting to model relevance decay.
> > >
> > > My own preference there would be to try to estimate relevance (0 or 1)
> > > using
> > > logistic regression and then put in various decay functions in as
> > features.
> > >  The weighted sum of those decay functions is your time decay of
> > relevance
> > > (in log-odds).
> > >
> > > My initial shot at decay functions would include age, square of age and
> > log
> > > of age.  My guess is that direct age would suffice because of the
> > logistic
> > > link function which looks like a logarithmic function where your models
> > > will
> > > probably live.
> > >
> > > On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wangfanjie@gmail.com>
> > wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > thanks for your reply.
> > > > I'm trying a new model where I want to estimate the output as a
> > timespan
> > > > quantified in number of seconds, which is not bounded. That's why I
> > think
> > > > I'd use linear regression instead of logistic regression. (lemme know
> > if
> > > > i'm
> > > > wrong)
> > > >
> > > > I started on the code yesterday. The new
> AbstractOnlineLinearRegression
> > > > class is implementing the OnlineLearner interface. I updated the
> > > classify()
> > > > function to use linear model. I tried to follow the format for
> > > > AbstractOnlineLogisticRegression.
> > > >
> > > > I think since linear regression can be implemented w/ sgd, the
> train()
> > > > and regularize() functions would look similar. I'm not sure if i'm on
> > the
> > > > right path. Any advice would be helpful.
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <ted.dunning@gmail.com>
> > > > wrote:
> > > >
> > > > > Frank,
> > > > >
> > > > > Sorry I didn't answer your previous email regarding this.
> > > > >
> > > > > It sounded to me like your application would actually be happier
> with
> > a
> > > > > form
> > > > > of logistic regression.
> > > > >
> > > > > Perhaps we should talk some more about this on the list.
> > > > >
> > > > > If you want a normal linear regression, the current OnlineLearner
> > > > interface
> > > > > isn't terribly appropriate since it assumes a 1 of n vector target
> > > > > variable.
> > > > >
> > > > > If you were to extend that interface to accept a vector form of
> > target
> > > > > variable then linear regression would work (and some clever tricks
> > > would
> > > > > become possible for logistic regression).
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wangfanjie@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm interested in implementing Linear Regression in Mahout.
Who
> > would
> > > > be
> > > > > > the
> > > > > > point person for the algorithm? I'd love to discuss the
> > > implementation
> > > > > > details, or to help out if anyone is working on it already :)
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message