spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <eerla...@redhat.com>
Subject Re: [MLlib] PCA Aggregator
Date Fri, 19 Oct 2018 14:05:49 GMT
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion
in Spark, you should start by filing a JIRA for it, and then a pull
request.   Another possibility is to publish it as your own 3rd party
library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <matt@saunders.net> wrote:

> I built an Aggregator that computes PCA on grouped datasets. I wanted to
> use the PCA functions provided by MLlib, but they only work on a full
> dataset, and I needed to do it on a grouped dataset (like a
> RelationalGroupedDataset).
>
> So I built a little Aggregator that can do that, here’s an example of how
> it’s called:
>
>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>
>     // For each grouping, compute a PCA matrix/vector
>     val pcaModels = inputData
>       .groupBy(keys:_*)
>       .agg(pcaAggregation.as(pcaOutput))
>
> I used the same algorithms under the hood as
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
> directly on Datasets without converting to RDD first.
>
> I’ve seen others who wanted this ability (for example on Stack Overflow)
> so I’d like to contribute it if it would be a benefit to the larger
> community.
>
> So.. is this something worth contributing to MLlib? And if so, what are
> the next steps to start the process?
>
> thanks!
>

Mime
View raw message