spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1212) Support sparse data in MLlib
Date Mon, 31 Mar 2014 08:20:14 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955023#comment-13955023
] 

Xiangrui Meng commented on SPARK-1212:
--------------------------------------

Part II adds sparse data support to GLMs and Naive Bayes.

PR: https://github.com/apache/spark/pull/245

> Support sparse data in MLlib
> ----------------------------
>
>                 Key: SPARK-1212
>                 URL: https://issues.apache.org/jira/browse/SPARK-1212
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 0.9.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> MLlib's NaiveBayes, SGD, and KMeans accept RDD[LabeledPoint] for training and RDD[Array[Double]]
for prediction, where LabeledPoint is a wrapper of (Double, Array[Double]). Using Array[Double]
could have good performance, but sparse data appears quite often in practice. So I created
this JIRA to discuss the plan of adding sparse data support to MLlib and track its progress.
> The goal is to support sparse data for training and prediction in all existing algorithms
in MLlib:
> * Gradient Descent
> * K-Means
> * Naive Bayes
> Previous discussions and pull requests:
> * https://github.com/mesos/spark/pull/736



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message