spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kyle Ellrott (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1486) Support multi-model training in MLlib
Date Tue, 12 Aug 2014 21:06:11 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094673#comment-14094673
] 

Kyle Ellrott commented on SPARK-1486:
-------------------------------------

It would be helpful to get some feedback if the work being done for SPARK-2372 would help
with this issue.

> Support multi-model training in MLlib
> -------------------------------------
>
>                 Key: SPARK-1486
>                 URL: https://issues.apache.org/jira/browse/SPARK-1486
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> It is rare in practice to train just one model with a given set of parameters. Usually,
this is done by training multiple models with different sets of parameters and then select
the best based on their performance on the validation set. MLlib should provide native support
for multi-model training/scoring. It requires decoupling of concepts like problem, formulation,
algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar
concepts, which we can borrow. There are different approaches for multi-model training:
> 0) Keep one copy of the data, and train models one after another (or maybe in parallel,
depending on the scheduler).
> 1) Keep one copy of the data, and train multiple models at the same time (similar to
`runs` in KMeans).
> 2) Make multiple copies of the data (still stored distributively), and use more cores
to distribute the work.
> 3) Collect the data, make the entire dataset available on workers, and train one or more
models on each worker.
> Users should be able to choose which execution mode they want to use. Note that 3) could
cover many use cases in practice when the training data is not huge, e.g., <1GB.
> This task will be divided into sub-tasks and this JIRA is created to discuss the design
and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message