spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhengruifeng (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-30286) Some thoughts on new features for MLLIB
Date Tue, 17 Dec 2019 11:15:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-30286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhengruifeng updated SPARK-30286:
---------------------------------
    Description: 
Some thoughts on new features for ML:

1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used algs in MLLIB,
mini-batch KMeans is much faster than KMeans with [compareable results|https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#sphx-glr-auto-examples-cluster-plot-mini-batch-kmeans-py];
in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two params in existing
KMeans.

2, classification & regression:
 2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of tree ensamble,
it has a lower variance than its brother RandomForest, it seems that in online contests extratrees
are more and more used; It seems that it can be easily impled atop existing ensamble impls;
 2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should be easy to impl
it as a new modelType in MLLIB's NB;

3, features:
 3.1 *vector validator*: a new UnaryTransformer that check whether a vector column meets some
requirements, like non-NaN, non-negative, positive, all values are binary/int, all vectors
are dense/sparse, numFetures; Current some impls deal with invalid values, but most have not.
For example, I first scale the input by MinMaxScaler, however MinMaxScaler will ignore NaN
in training and keep the NaN in transformation, then the scaled dataset is fed into LinearRegression,
at the end I obtain a LinearRegressionModel with NaN coefficients. In the whole pipeline,
no exception is thrown. With this validator, the pipeline can fail ahead.
 3.2 inverse transform for models/transformers: we may add a new bool param HasInverseTransform;
 3.3 non-linear transformation: quantile transforms and power transforms (including famous
Box-Cox method), map data from any distribution to as close to another distribution (mostly
Gaussian); _I am working on this, since I need this feature recently_;
 3.4 similarity search: in my experience, Approximate Nearest Neighbors based on KMeans provides
more accurate result than LSH, can we follow some famous libraries like Facebook-FAISS to
impl a new ANN?

4, warm start: initialize the model from a previous model, ONLY the coefficients are used
(the params related to the previous model are ignored), maybe a new string param HasInitialModelPath
can be added at first.

5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, nonZeroIterator*;
so that we can impl some method based on Iterator[Int, Double] instead of ml.Vector/mllib.Vector,
and reuse it in both sides without vector conversions.

6, parameter server: there were several tickets for it. It should be super useful and will
provide efficient gradient-based solver for many algs. I also know there were some efforts
to impl it atop spark, like Tencent-Angel & [Glint|https://github.com/Angel-ML/angel]

  was:
Some thoughts on new features for ML:

1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used algs in MLLIB,
mini-batch KMeans is much faster than KMeans with [compareable results|https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#sphx-glr-auto-examples-cluster-plot-mini-batch-kmeans-py];
in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two params in existing
KMeans.

2, classification & regression:
 2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of tree ensamble,
it has a lower variance than its brother RandomForest, it seems that in online contests extratrees
are more and more used; It seems that it can be easily impled atop existing ensamble impls;
 2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should be easy to impl
it as a new modelType in MLLIB's NB;

3, features:
 3.1 *vector validator*: a new UnaryTransformer that check whether a vector column meets some
requirements, like non-NaN, non-negative, positive, all values are binary/int, all vectors
are dense/sparse, numFetures; Current some impls deal with invalid values, but most have not.
For example, we first scaler the input by MinMaxScaler, however MinMaxScaler will ignore NaN
in training and keep the NaN in transformation, then the scaled dataset is feed into LinearRegression,
at the end I obtain a LinearRegressionModel with NaN LinearRegression. In the whole pipeline,
no exception is thrown. With this validator, the pipeline can fail ahead.
 3.2 inverse transform for models/transformers: we may add a new bool param HasInverseTransform;
 3.3 non-linear transformation: quantile transforms and power transforms (including famous
Box-Cox method), map data from any distribution to as close to another distribution (mostly
Gaussian); _I am working on this, since I need this feature recently_;
 3.4 similarity search: in my experience, Approximate Nearest Neighbors based on KMeans provides
more accurate result than LSH, can we follow some famous libraries like Facebook-FAISS to
impl a new ANN?

4, warm start: initialize the model from a previous model, ONLY the coefficients are used
(the params related to the previous model are ignored), maybe a new string param HasInitialModelPath
can be added at first.

5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, nonZeroIterator*;
so that we can impl some method based on Iterator[Int, Double] instead of ml.Vector/mllib.Vector,
and reuse it in both sides without vector conversions.

6, parameter server: there were several tickets for it. It should be super useful and will
provide efficient gradient-based solver for many algs. I also know there were some efforts
to impl it atop spark, like Tencent-Angel & [Glint|https://github.com/Angel-ML/angel]


> Some thoughts on new features for MLLIB
> ---------------------------------------
>
>                 Key: SPARK-30286
>                 URL: https://issues.apache.org/jira/browse/SPARK-30286
>             Project: Spark
>          Issue Type: Wish
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Minor
>
> Some thoughts on new features for ML:
> 1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used algs in
MLLIB, mini-batch KMeans is much faster than KMeans with [compareable results|https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#sphx-glr-auto-examples-cluster-plot-mini-batch-kmeans-py];
in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two params in existing
KMeans.
> 2, classification & regression:
>  2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of tree
ensamble, it has a lower variance than its brother RandomForest, it seems that in online contests
extratrees are more and more used; It seems that it can be easily impled atop existing ensamble
impls;
>  2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should be easy
to impl it as a new modelType in MLLIB's NB;
> 3, features:
>  3.1 *vector validator*: a new UnaryTransformer that check whether a vector column meets
some requirements, like non-NaN, non-negative, positive, all values are binary/int, all vectors
are dense/sparse, numFetures; Current some impls deal with invalid values, but most have not.
For example, I first scale the input by MinMaxScaler, however MinMaxScaler will ignore NaN
in training and keep the NaN in transformation, then the scaled dataset is fed into LinearRegression,
at the end I obtain a LinearRegressionModel with NaN coefficients. In the whole pipeline,
no exception is thrown. With this validator, the pipeline can fail ahead.
>  3.2 inverse transform for models/transformers: we may add a new bool param HasInverseTransform;
>  3.3 non-linear transformation: quantile transforms and power transforms (including famous
Box-Cox method), map data from any distribution to as close to another distribution (mostly
Gaussian); _I am working on this, since I need this feature recently_;
>  3.4 similarity search: in my experience, Approximate Nearest Neighbors based on KMeans
provides more accurate result than LSH, can we follow some famous libraries like Facebook-FAISS
to impl a new ANN?
> 4, warm start: initialize the model from a previous model, ONLY the coefficients are
used (the params related to the previous model are ignored), maybe a new string param HasInitialModelPath
can be added at first.
> 5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, nonZeroIterator*;
so that we can impl some method based on Iterator[Int, Double] instead of ml.Vector/mllib.Vector,
and reuse it in both sides without vector conversions.
> 6, parameter server: there were several tickets for it. It should be super useful and
will provide efficient gradient-based solver for many algs. I also know there were some efforts
to impl it atop spark, like Tencent-Angel & [Glint|https://github.com/Angel-ML/angel]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message