spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilad Barkan <gilad.bar...@gmail.com>
Subject A note about MLlib's StandardScaler
Date Sun, 08 Jan 2017 20:06:55 GMT
Hi

It seems that the output of MLlib's *StandardScaler*(*withMean=*True,
*withStd*=True)are not as expected.

The above configuration is expected to do the following transformation:

X -> Y = (X-Mean)/Std  - Eq.1

This transformation (a.k.a. Standardization) should result in a
"standardized" vector with unit-variance and zero-mean.

I'll demonstrate my claim using the current documentation example:

>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]>>>
dataset = sc.parallelize(vs)>>> standardizer = StandardScaler(True, True)>>>
model = standardizer.fit(dataset)>>> result = model.transform(dataset)>>>
for r in result.collect(): print r
    DenseVector([-0.7071, 0.7071, -0.7071])    DenseVector([0.7071,
-0.7071, 0.7071])

This result in std = sqrt(1/2) foreach column instead of std=1.

Applying Standardization transformation on the above 2 vectors result
in the following output

    DenseVector([-1.0, 1.0, -1.0])    DenseVector([1.0, -1.0, 1.0])


Another example:

Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3
rows of DenseVectors:
[DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]),
DenseVector([2.4, 0.8, 3.5])]

The StandardScaler result the following scaled vectors:
[DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982,
-0.88499, 0.057073]), DenseVector([0.330409, 4
-0.19984, 0.970241])

This result has std=sqrt(2/3)

Instead it should have resulted other 3 vectors that form std=1 for each column.

Adding another vector (4 total) results in 4 scaled vectors that form
std= sqrt(3/4) instead of std=1

I hope all the examples help to make my point clear.

I hope I don't miss here something.

Thank you

Gilad Barkan

Mime
View raw message