spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timsina, Prem" <prem.tims...@mssm.edu>
Subject Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
Date Tue, 05 Sep 2017 14:56:25 GMT
Hi Yanboo,
Thank You, I very much appreciate your help.
For the current use case, the data can fit into a single node. So, spark-sklearn seems to
be good choice.

I have  on question regarding this
“If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms
parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>”
If I understand correctly, it can run parameter search for cross-validation in parallel.
However,  currently  Spark does not support  running multiple algorithms (like Naïve Bayes,
 Random Forest, etc.) in parallel. Am I correct?
If not, could you please point me to some resources where they have run multiple algorithms
in parallel.

Thank You very much. It is great help, I will try spark-sklearn.
Prem




From: Yanbo Liang <ybliang8@gmail.com>
Date: Tuesday, September 5, 2017 at 10:40 AM
To: Patrick McCarthy <pmccarthy@dstillery.com>
Cc: "Timsina, Prem" <prem.timsina@mssm.edu>, "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms
parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>
If yes, you can also try spark-sklearn, which can distribute multiple model training(single
node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=>

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccarthy@dstillery.com<mailto:pmccarthy@dstillery.com>>
wrote:
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=>

On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <prem.timsina@mssm.edu<mailto:prem.timsina@mssm.edu>>
wrote:
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like
this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes training in
parallel?

I was not able to find any way to run the different algorithm in parallel. And it seems cross-validation
also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.

Prem


Mime
View raw message