spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan R. Sparks" <>
Subject Re: How to run kmeans after pca?
Date Tue, 30 Sep 2014 16:35:42 GMT
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to force this to be materialized before you run KMeans.

Also, KMeans is optimized to run quickly on both sparse and dense data. The
result of PCA is going to be dense, but if your input data has #nnz ~=
size(pca data), performance might be about the same. (I haven't actually
verified this last point.)

Finally, speed is partially going to be dependent on how much data you have
relative to scheduler overheads - if your input data is small it could be
that the costs of distributing your task are greater than the time spent
actually computing - usually this would manifest itself in the stages
taking about the same amount of time even though you're passing datasets
that have different dimensionality.

On Tue, Sep 30, 2014 at 9:00 AM, st553 <> wrote:

> Thanks for your response Burak it was very helpful.
> I am noticing that if I run PCA before KMeans that the KMeans algorithm
> will
> actually take longer to run than if I had just run KMeans without PCA. I
> was
> hoping that by using PCA first it would actually speed up the KMeans
> algorithm.
> I have followed the steps you've outlined but Im wondering if I need to
> cache/persist the RDD[Vector] rows of the RowMatrix returned after
> multiplying. Something like:
> val newData: RowMatrix = data.multiply(bcPrincipalComponents.value)
> val cachedRows = newData.rows.persist()
> cachedRows.unpersist()
> It doesnt seem intuitive to me that a smaller dimensional version of my
> data
> set would take longer for KMeans... unless Im missing something?
> Thanks!
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message