mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Should I be using OnlineLogisticRegression?
Date Fri, 07 Sep 2012 23:06:17 GMT
OK.  So several of these show pretty massive skew.  That will be a problem.
 The key to look for is the difference between mean and median.

You also have a massively different scale for different variables.  One
variable has a range of 5, another has a range of 5.8 x 10^10.

Interestingly, v1 not only has skew, but has negative values.  This makes a
simple-minded log transform not work.

One handy transform that I have used in the past is to simply use a
smoothed version of the empirical cumulative distribution function to
transform variables.  In R, the ecdf function can produce such a function.
 That puts all of your variables into a uniform [0,1] distribution.  You
can further transform this to a normal distribution if you care to.

If that is too crazy, then try this:

xv1 = log(max(1e-6, v1+55))
xv2 = log(max(1e-6, v2))
xv3 = log(max(1e-6, v3))
xv4 = v4
xv5 = v5
xv6 = v6

Then normalize all of the xv variables to have unit variance and zero mean.

On Fri, Sep 7, 2012 at 8:23 AM, Mike Burba <mike.burba@gmail.com> wrote:

> Took some massaging to get into R.  Here is the output as requested
> for the 6 predictor variables:
>
> > summary(x)
>
>  v1 v2
>  Min.   :   -55.0         Min.   :    0.0
>  1st Qu.:     6.0         1st Qu.:    0.0
>  Median :    62.0         Median :    2.0
>  Mean   :   658.7         Mean   :   25.4
>  3rd Qu.:   391.0         3rd Qu.:   13.0
>  Max.   :461311.0         Max.   :21532.0
>
> v3           v4
> Min.   :0.000e+00   Min.   :  3.00
> 1st Qu.:1.821e+06   1st Qu.: 36.00
>  Median :1.268e+07   Median : 47.00
>  Mean   :2.345e+07   Mean   : 50.35
> 3rd Qu.:3.364e+07   3rd Qu.: 62.00
>   Max.   :5.820e+10   Max.   :257.00
>
>    v5         v6
>  Min.   :    0.0   Min.   :1.000
>  1st Qu.:  356.0   1st Qu.:2.000
>  Median :  623.0   Median :3.000
>  Mean   :  956.7   Mean   :2.862
>  3rd Qu.: 1100.0   3rd Qu.:4.000
>  Max.   :33413.0   Max.   :5.000
>
> So now I am going through the process of transforming / scaling.  Any
> top-of-mind thoughts on the output above are welcome...to help me
> validate my thought process.
>
> Thanks for the hints, I will let you know how it turns out.
>
> Mike
>
> On Thu, Sep 6, 2012 at 8:14 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> >
> > Try transforming them as well, likely with a log if they are positive and
> > have heavily skewed values.
> >
> > Can you suck the data into R and paste in the results of summary(x)?
> > (assuming you put the data into the variable x).  This should look
> > something like:
> >
> > > summary(x)
> > >        v1                 v2                  v3
> > >  Min.   :-3.41939   Min.   :0.0002538   Min.   :1.188
> > >  1st Qu.:-0.66695   1st Qu.:0.3122501   1st Qu.:3.321
> > >  Median :-0.07277   Median :0.6830144   Median :3.972
> > >  Mean   :-0.05619   Mean   :1.0286261   Mean   :4.010
> > >  3rd Qu.: 0.56784   3rd Qu.:1.4619058   3rd Qu.:4.712
> > >  Max.   : 2.74271   Max.   :7.7754864   Max.   :7.252
> > > >
> >
> >
> > On Thu, Sep 6, 2012 at 4:58 PM, Diederik van Liere <
> > Diederik.vanLiere@rotman.utoronto.ca> wrote:
> >
> > >
> > > > - My (6) predictor variables are all numeric; some of the variables
> range
> > > > from 0...5, others range from 0...1,000,000.
> > > Have you tried rescaling your predictor variables so they have the same
> > > range?
> > >
> > > Diederik
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message