spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhengruifeng (Jira)" <>
Subject [jira] [Commented] (SPARK-30286) Some thoughts on new features for MLLIB
Date Tue, 17 Dec 2019 11:12:00 GMT


zhengruifeng commented on SPARK-30286:

 It seem that the last roadmap for mllib is for 2.0, and it seems that the community has
not discuss the future of mllib for a long time.

Above is what I am thinking of for sometime. Among them, I tend to include three in ML: 1,*mini-batch
KMeans*, 2,*vector validator*, 3,*Vectors enhancement*

friendly ping [~srowen]  [~viirya]  how do you think of this? Thanks


> Some thoughts on new features for MLLIB
> ---------------------------------------
>                 Key: SPARK-30286
>                 URL:
>             Project: Spark
>          Issue Type: Wish
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Minor
> Some thoughts on new features for ML:
> 1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used algs in
MLLIB, mini-batch KMeans is much faster than KMeans with [compareable results|];
in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two params in existing
> 2, classification & regression:
>  2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of tree
ensamble, it has a lower variance than its brother RandomForest, it seems that in online contests
extratrees are more and more used; It seems that it can be easily impled atop existing ensamble
>  2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should be easy
to impl it as a new modelType in MLLIB's NB;
> 3, features:
>  3.1 *vector validator*: a new UnaryTransformer that check whether a vector column meets
some requirements, like non-NaN, non-negative, positive, all values are binary/int, all vectors
are dense/sparse, numFetures; Current some impls deal with invalid values, but most have not.
For example, we first scaler the input by MinMaxScaler, however MinMaxScaler will ignore NaN
in training and keep the NaN in transformation, then the scaled dataset is feed into LinearRegression,
at the end I obtain a LinearRegressionModel with NaN LinearRegression. In the whole pipeline,
no exception is thrown. With this validator, the pipeline can fail ahead.
>  3.2 inverse transform for models/transformers: we may add a new bool param HasInverseTransform;
>  3.3 non-linear transformation: quantile transforms and power transforms (including famous
Box-Cox method), map data from any distribution to as close to another distribution (mostly
Gaussian); _I am working on this, since I need this feature recently_;
>  3.4 similarity search: in my experience, Approximate Nearest Neighbors based on KMeans
provides more accurate result than LSH, can we follow some famous libraries like Facebook-FAISS
to impl a new ANN?
> 4, warm start: initialize the model from a previous model, ONLY the coefficients are
used (the params related to the previous model are ignored), maybe a new string param HasInitialModelPath
can be added at first.
> 5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, nonZeroIterator*;
so that we can impl some method based on Iterator[Int, Double] instead of ml.Vector/mllib.Vector,
and reuse it in both sides without vector conversions.
> 6, parameter server: there were several tickets for it. It should be super useful and
will provide efficient gradient-based solver for many algs. I also know there were some efforts
to impl it atop spark, like Tencent-Angel & [Glint|]

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message