Hi Gilad,

Spark uses the sample standard variance inside of the StandardScaler (see https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler ) which I think would explain the results you are seeing you are seeing. I believe the scalers are intended to be used on larger sized datasets You can verify this yourself doing the same computation in Python and see the scaling using the sample deviation result in the values you are seeing from Spark.

Cheers,

Holden :)


On Sun, Jan 8, 2017 at 12:06 PM, Gilad Barkan <gilad.barkan@gmail.com> wrote:
Hi

It seems that the output of MLlib's StandardScaler(withMean=True, withStd=True)are not as expected.

The above configuration is expected to do the following transformation:

X -> Y = (X-Mean)/Std  - Eq.1

This transformation (a.k.a. Standardization) should result in a "standardized" vector with unit-variance and zero-mean.

I'll demonstrate my claim using the current documentation example:
>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
>>> dataset = sc.parallelize(vs)
>>> standardizer = StandardScaler(True, True)
>>> model = standardizer.fit(dataset)
>>> result = model.transform(dataset)
>>> for r in result.collect(): print r

DenseVector([-0.7071, 0.7071, -0.7071])
DenseVector([0.7071, -0.7071, 0.7071])

This result in std = sqrt(1/2) foreach column instead of std=1.
Applying Standardization transformation on the above 2 vectors result in the following output

DenseVector([
-1.0, 1.0, -1.0]) DenseVector([1.0, -1.0, 1.0])

Another example:
Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows of DenseVectors:
[DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]), DenseVector([2.4, 0.8, 3.5])]

The StandardScaler result the following scaled vectors:
[DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982, -0.88499, 0.057073]), DenseVector([0.330409, 4 -0.19984, 0.970241])
This result has std=sqrt(2/3)
Instead it should have resulted other 3 vectors that form std=1 for each column.

Adding another vector (4 total) results in 4 scaled vectors that form std= sqrt(3/4) instead of std=1

I hope all the examples help to make my point clear.
I hope I don't miss here something.

Thank you
Gilad Barkan
 





--
Cell : 425-233-8271