spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: A note about MLlib's StandardScaler
Date Mon, 09 Jan 2017 03:30:43 GMT
Hi Gilad,

Spark uses the sample standard variance inside of the StandardScaler (see
https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler
) which I think would explain the results you are seeing you are seeing. I
believe the scalers are intended to be used on larger sized datasets You
can verify this yourself doing the same computation in Python and see the
scaling using the sample deviation result in the values you are seeing from
Spark.

Cheers,

Holden :)


On Sun, Jan 8, 2017 at 12:06 PM, Gilad Barkan <gilad.barkan@gmail.com>
wrote:

> Hi
>
> It seems that the output of MLlib's *StandardScaler*(*withMean=*True,
> *withStd*=True)are not as expected.
>
> The above configuration is expected to do the following transformation:
>
> X -> Y = (X-Mean)/Std  - Eq.1
>
> This transformation (a.k.a. Standardization) should result in a
> "standardized" vector with unit-variance and zero-mean.
>
> I'll demonstrate my claim using the current documentation example:
>
> >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]>>>
dataset = sc.parallelize(vs)>>> standardizer = StandardScaler(True, True)>>>
model = standardizer.fit(dataset)>>> result = model.transform(dataset)>>>
for r in result.collect(): print r
>     DenseVector([-0.7071, 0.7071, -0.7071])    DenseVector([0.7071, -0.7071, 0.7071])
>
> This result in std = sqrt(1/2) foreach column instead of std=1.
>
> Applying Standardization transformation on the above 2 vectors result in the following
output
>
>     DenseVector([-1.0, 1.0, -1.0])    DenseVector([1.0, -1.0, 1.0])
>
>
> Another example:
>
> Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows of DenseVectors:
> [DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]), DenseVector([2.4, 0.8,
3.5])]
>
> The StandardScaler result the following scaled vectors:
> [DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982, -0.88499, 0.057073]),
DenseVector([0.330409, 4
> -0.19984, 0.970241])
>
> This result has std=sqrt(2/3)
>
> Instead it should have resulted other 3 vectors that form std=1 for each column.
>
> Adding another vector (4 total) results in 4 scaled vectors that form std= sqrt(3/4)
instead of std=1
>
> I hope all the examples help to make my point clear.
>
> I hope I don't miss here something.
>
> Thank you
>
> Gilad Barkan
>
>
>
>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Mime
View raw message