spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: A note about MLlib's StandardScaler
Date Mon, 09 Jan 2017 11:23:36 GMT
This could be true if you knew you were just going to scale the input to
StandardScaler and nothing else. It's probably more typical you'd scale
some other data. The current behavior is therefore the sensible default,
because the input is a sample of some unknown larger population.

I think it doesn't matter much except for toy problems, because at any
scale, the difference between 1/n and 1/(n-1) is negligible, and for most
purposes for which the scaler is used, it won't matter anyway (faster
convergence of an optimizer for example). I'm neutral on whether it's worth
complicating the API to do both, therefore.

On Mon, Jan 9, 2017 at 6:50 AM Liang-Chi Hsieh <viirya@gmail.com> wrote:

>
> Actually I think it is possibly that an user/developer needs the
> standardized features with population mean and std in some cases. It would
> be better if StandardScaler can offer the option to do that.
>
>
>
> Holden Karau wrote
> > Hi Gilad,
> >
> > Spark uses the sample standard variance inside of the StandardScaler (see
> >
> https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler
> > ) which I think would explain the results you are seeing you are seeing.
> I
> > believe the scalers are intended to be used on larger sized datasets You
> > can verify this yourself doing the same computation in Python and see the
> > scaling using the sample deviation result in the values you are seeing
> > from
> > Spark.
> >
> > Cheers,
> >
> > Holden :)
> >
> >
> > On Sun, Jan 8, 2017 at 12:06 PM, Gilad Barkan &lt;
>
> > gilad.barkan@
>
> > &gt;
> > wrote:
> >
> >> Hi
> >>
> >> It seems that the output of MLlib's *StandardScaler*(*withMean=*True,
> >> *withStd*=True)are not as expected.
> >>
> >> The above configuration is expected to do the following transformation:
> >>
> >> X -> Y = (X-Mean)/Std  - Eq.1
> >>
> >> This transformation (a.k.a. Standardization) should result in a
> >> "standardized" vector with unit-variance and zero-mean.
> >>
> >> I'll demonstrate my claim using the current documentation example:
> >>
> >> >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0,
> >> 1.9])]>>> dataset = sc.parallelize(vs)>>> standardizer =
> >> StandardScaler(True, True)>>> model = standardizer.fit(dataset)>>>
> result
> >> = model.transform(dataset)>>> for r in result.collect(): print r
> >>     DenseVector([-0.7071, 0.7071, -0.7071])    DenseVector([0.7071,
> >> -0.7071, 0.7071])
> >>
> >> This result in std = sqrt(1/2) foreach column instead of std=1.
> >>
> >> Applying Standardization transformation on the above 2 vectors result in
> >> the following output
> >>
> >>     DenseVector([-1.0, 1.0, -1.0])    DenseVector([1.0, -1.0, 1.0])
> >>
> >>
> >> Another example:
> >>
> >> Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows
> >> of DenseVectors:
> >> [DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]),
> >> DenseVector([2.4, 0.8, 3.5])]
> >>
> >> The StandardScaler result the following scaled vectors:
> >> [DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982,
> >> -0.88499, 0.057073]), DenseVector([0.330409, 4
> >> -0.19984, 0.970241])
> >>
> >> This result has std=sqrt(2/3)
> >>
> >> Instead it should have resulted other 3 vectors that form std=1 for each
> >> column.
> >>
> >> Adding another vector (4 total) results in 4 scaled vectors that form
> >> std= sqrt(3/4) instead of std=1
> >>
> >> I hope all the examples help to make my point clear.
> >>
> >> I hope I don't miss here something.
> >>
> >> Thank you
> >>
> >> Gilad Barkan
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Cell : 425-233-8271
> > Twitter: https://twitter.com/holdenkarau
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-note-about-MLlib-s-StandardScaler-tp20513p20517.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message