spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <eerla...@redhat.com>
Subject Re: [MLlib] PCA Aggregator
Date Fri, 19 Oct 2018 17:33:19 GMT
For 3rd-party libs, I have been publishing independently, for example at
isarn-sketches-spark or silex:
https://github.com/isarn/isarn-sketches-spark
https://github.com/radanalyticsio/silex

Either of these repos provide some good working examples of publishing a
spark UDAF or ML library for jvm and pyspark.
(If anyone is interested in contributing new components to either of these,
feel free to reach out)

For people new to Spark library dev, Will Benton and I recently gave at
talk at SAI-EU on publishing Spark libraries:
https://databricks.com/session/apache-spark-for-library-developers-2
Cheers,
Erik

On Fri, Oct 19, 2018 at 9:40 AM Stephen Boesch <javadba@gmail.com> wrote:

> Erik - is there a current locale for approved/recommended third party
> additions?  The spark-packages has been stale for years it seems.
>
> Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <
> eerlands@redhat.com>:
>
>> Hi Matt!
>>
>> There are a couple ways to do this. If you want to submit it for
>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>> pull request.   Another possibility is to publish it as your own 3rd party
>> library, which I have done for aggregators before.
>>
>>
>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <matt@saunders.net> wrote:
>>
>>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>>> use the PCA functions provided by MLlib, but they only work on a full
>>> dataset, and I needed to do it on a grouped dataset (like a
>>> RelationalGroupedDataset).
>>>
>>> So I built a little Aggregator that can do that, here’s an example of
>>> how it’s called:
>>>
>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>
>>>     // For each grouping, compute a PCA matrix/vector
>>>     val pcaModels = inputData
>>>       .groupBy(keys:_*)
>>>       .agg(pcaAggregation.as(pcaOutput))
>>>
>>> I used the same algorithms under the hood as
>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>> directly on Datasets without converting to RDD first.
>>>
>>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>>> so I’d like to contribute it if it would be a benefit to the larger
>>> community.
>>>
>>> So.. is this something worth contributing to MLlib? And if so, what are
>>> the next steps to start the process?
>>>
>>> thanks!
>>>
>>

Mime
View raw message