spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhengruifeng (Jira)" <>
Subject [jira] [Commented] (SPARK-30286) Some thoughts on new features for MLLIB
Date Wed, 18 Dec 2019 03:59:00 GMT


zhengruifeng commented on SPARK-30286:

[~srowen] Thanks for the reply. I will make them individual JIRAs for tracking.

> Some thoughts on new features for MLLIB
> ---------------------------------------
>                 Key: SPARK-30286
>                 URL:
>             Project: Spark
>          Issue Type: Wish
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Minor
> Some thoughts on new features for ML:
> 1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used algs in
MLLIB, mini-batch KMeans is much faster than KMeans with [compareable results|];
in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two params in existing
> 2, classification & regression:
>  2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of tree
ensamble, it has a lower variance than its brother RandomForest, it seems that in online contests
extratrees are more and more used; It seems that it can be easily impled atop existing ensamble
>  2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should be easy
to impl it as a new modelType in MLLIB's NB;
> 3, features:
>  3.1 *vector validator*: a new UnaryTransformer that check whether a vector column meets
some requirements, like non-NaN, non-negative, positive, all values are binary/int, all vectors
are dense/sparse, numFetures; Current some impls deal with invalid values, but most have not.
For example, I first scale the input by MinMaxScaler, however MinMaxScaler will ignore NaN
in training and keep the NaN in transformation, then the scaled dataset is fed into LinearRegression,
at the end I obtain a LinearRegressionModel with NaN coefficients. In the whole pipeline,
no exception is thrown. With this validator, the pipeline can fail ahead.
>  3.2 inverse transform for models/transformers: we may add a new bool param HasInverseTransform;
>  3.3 non-linear transformation: quantile transforms and power transforms (including famous
Box-Cox method), map data from any distribution to as close to another distribution (mostly
Gaussian); _I am working on this, since I need this feature recently_;
>  3.4 similarity search: in my experience, Approximate Nearest Neighbors based on KMeans
provides more accurate result than LSH, can we follow some famous libraries like Facebook-FAISS
to impl a new ANN?
> 4, warm start: initialize the model from a previous model, ONLY the coefficients are
used (the params related to the previous model are ignored), maybe a new string param HasInitialModelPath
can be added at first.
> 5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, nonZeroIterator*;
so that we can impl some method based on Iterator[Int, Double] instead of ml.Vector/mllib.Vector,
and reuse it in both sides without vector conversions.
> 6, parameter server: there were several tickets for it. It should be super useful and
will provide efficient gradient-based solver for many algs. I also know there were some efforts
to impl it atop spark, like Tencent-Angel & [Glint|]

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message