spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yblia...@gmail.com>
Subject Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
Date Tue, 05 Sep 2017 14:40:05 GMT
Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine
learning algorithms parallel on distributed dataset and do parameter
search. FYI:
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
If yes, you can also try spark-sklearn, which can distribute multiple model
training(single node training with sklearn) across a distributed cluster
and do parameter search. FYI: https://github.com/databricks/spark-sklearn

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccarthy@dstillery.com>
wrote:

> You might benefit from watching this JIRA issue -
> https://issues.apache.org/jira/browse/SPARK-19071
>
> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <prem.timsina@mssm.edu>
> wrote:
>
>> Is there a way to parallelize multiple ML algorithms in Spark. My use
>> case is something like this:
>>
>> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
>> Forest, etc.) in parallel.
>>
>> 1) Validate each algorithm using 10-fold cross-validation
>>
>> B) Feed the output of step A) in second layer machine learning algorithm.
>>
>> My question is:
>>
>> Can we run multiple machine learning algorithm in step A in parallel?
>>
>> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
>> Bayes training in parallel?
>>
>>
>>
>> I was not able to find any way to run the different algorithm in
>> parallel. And it seems cross-validation also can not be done in parallel.
>>
>> I appreciate any suggestion to parallelize this use case.
>>
>>
>>
>> Prem
>>
>
>

Mime
View raw message