OK. So several of these show pretty massive skew. That will be a problem.
The key to look for is the difference between mean and median.
You also have a massively different scale for different variables. One
variable has a range of 5, another has a range of 5.8 x 10^10.
Interestingly, v1 not only has skew, but has negative values. This makes a
simpleminded log transform not work.
One handy transform that I have used in the past is to simply use a
smoothed version of the empirical cumulative distribution function to
transform variables. In R, the ecdf function can produce such a function.
That puts all of your variables into a uniform [0,1] distribution. You
can further transform this to a normal distribution if you care to.
If that is too crazy, then try this:
xv1 = log(max(1e6, v1+55))
xv2 = log(max(1e6, v2))
xv3 = log(max(1e6, v3))
xv4 = v4
xv5 = v5
xv6 = v6
Then normalize all of the xv variables to have unit variance and zero mean.
On Fri, Sep 7, 2012 at 8:23 AM, Mike Burba <mike.burba@gmail.com> wrote:
> Took some massaging to get into R. Here is the output as requested
> for the 6 predictor variables:
>
> > summary(x)
>
> v1 v2
> Min. : 55.0 Min. : 0.0
> 1st Qu.: 6.0 1st Qu.: 0.0
> Median : 62.0 Median : 2.0
> Mean : 658.7 Mean : 25.4
> 3rd Qu.: 391.0 3rd Qu.: 13.0
> Max. :461311.0 Max. :21532.0
>
> v3 v4
> Min. :0.000e+00 Min. : 3.00
> 1st Qu.:1.821e+06 1st Qu.: 36.00
> Median :1.268e+07 Median : 47.00
> Mean :2.345e+07 Mean : 50.35
> 3rd Qu.:3.364e+07 3rd Qu.: 62.00
> Max. :5.820e+10 Max. :257.00
>
> v5 v6
> Min. : 0.0 Min. :1.000
> 1st Qu.: 356.0 1st Qu.:2.000
> Median : 623.0 Median :3.000
> Mean : 956.7 Mean :2.862
> 3rd Qu.: 1100.0 3rd Qu.:4.000
> Max. :33413.0 Max. :5.000
>
> So now I am going through the process of transforming / scaling. Any
> topofmind thoughts on the output above are welcome...to help me
> validate my thought process.
>
> Thanks for the hints, I will let you know how it turns out.
>
> Mike
>
> On Thu, Sep 6, 2012 at 8:14 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> >
> > Try transforming them as well, likely with a log if they are positive and
> > have heavily skewed values.
> >
> > Can you suck the data into R and paste in the results of summary(x)?
> > (assuming you put the data into the variable x). This should look
> > something like:
> >
> > > summary(x)
> > > v1 v2 v3
> > > Min. :3.41939 Min. :0.0002538 Min. :1.188
> > > 1st Qu.:0.66695 1st Qu.:0.3122501 1st Qu.:3.321
> > > Median :0.07277 Median :0.6830144 Median :3.972
> > > Mean :0.05619 Mean :1.0286261 Mean :4.010
> > > 3rd Qu.: 0.56784 3rd Qu.:1.4619058 3rd Qu.:4.712
> > > Max. : 2.74271 Max. :7.7754864 Max. :7.252
> > > >
> >
> >
> > On Thu, Sep 6, 2012 at 4:58 PM, Diederik van Liere <
> > Diederik.vanLiere@rotman.utoronto.ca> wrote:
> >
> > >
> > > >  My (6) predictor variables are all numeric; some of the variables
> range
> > > > from 0...5, others range from 0...1,000,000.
> > > Have you tried rescaling your predictor variables so they have the same
> > > range?
> > >
> > > Diederik
>
