spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Travis Galoppo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
Date Wed, 04 Feb 2015 19:23:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305775#comment-14305775
] 

Travis Galoppo edited comment on SPARK-5021 at 2/4/15 7:23 PM:
---------------------------------------------------------------

For the vectorMean function, the resulting vector may well be considerably more dense than
the input vectors (it is called only once, with a set of random vectors); however, the computed
means may become more sparse with each iteration if the clusters are represented through density
in different regions of the input vector.  Although this does have me thinking... since the
assignments are soft, it is likely that very few vector entries will become zero... I'm not
sure what the tolerance is for zero entries, but the soft nature of the assignments may undermine
the performance benefit of working with sparse vectors.




was (Author: tgaloppo):
For the vectorMean function, the resulting vector may well be considerably more dense than
the input vectors; however, the computed means may become more sparse with each iteration
if the clusters are represented through density in different regions of the input vector.
 Although this does have me thinking... since the assignments are soft, it is likely that
very few vector entries will become zero... I'm not sure what the tolerance is for zero entries,
but the soft nature of the assignments may undermine the performance benefit of working with
sparse vectors.



> GaussianMixtureEM should be faster for SparseVector input
> ---------------------------------------------------------
>
>                 Key: SPARK-5021
>                 URL: https://issues.apache.org/jira/browse/SPARK-5021
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would be nice if
it were faster for SparseVectors (running in time linear in the number of non-zero values).
> However, this may not be too important since clustering should rarely be done in high
dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message