Hi Gilad,

Cheers,

Holden :)

On Sun, Jan 8, 2017 at 12:06 PM, Gilad Barkan <gilad.barkan@gmail.com> wrote:

X -> Y = (X-Mean)/Std - Eq.1HiIt seems that the output of MLlib'sStandardScaler(withMean=True,withStd=True)are not as expected.

The above configuration is expected to do the following transformation:This transformation (a.k.a. Standardization) should result in a "standardized" vector with unit-variance and zero-mean.I'll demonstrate my claim using the current documentation example:>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): print r

DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071])`This result in std = sqrt(1/2) foreach column instead of std=1.`

Applying Standardization transformation on the above 2 vectors result in the following output

DenseVector([-1.0, 1.0, -1.0]) DenseVector([1.0, -1.0, 1.0])

`Another example:`

Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows of DenseVectors:

[DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]), DenseVector([2.4, 0.8, 3.5])]The StandardScaler result the following scaled vectors:

[DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982, -0.88499, 0.057073]), DenseVector([0.330409, 4 -0.19984, 0.970241])This result has std=sqrt(2/3)Instead it should have resulted other 3 vectors that form std=1 for each column.Adding another vector (4 total) results in 4 scaled vectors that form std= sqrt(3/4) instead of std=1I hope all the examples help to make my point clear.I hope I don't miss here something.`Thank you`

Gilad Barkan

Cell : 425-233-8271

Twitter: https://twitter.com/holdenkarau