spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Feynman Liang <fli...@databricks.com>
Subject Re: mllib on (key, Iterable[Vector])
Date Tue, 11 Aug 2015 21:07:30 GMT
You could try flatMapping i.e. if you have data : RDD[(key,
Iterable[Vector])] then  data.flatMap(_._2) : RDD[Vector], which can be
GMMed.

If you want to first partition by url, I would first create multiple RDDs
using `filter`, then running GMM on each of the filtered rdds.

On Tue, Aug 11, 2015 at 5:43 AM, Fabian Böhnlein <fabian.boehnlein@gmail.com
> wrote:

> Hi everyone,
>
> I am trying to use mllib.clustering.GaussianMixture, but am blocked by the
> fact that the API only accepts RDD[Vector].
>
> Broadly speaking I need to run the clustering on an RDD[(key,
> Iterable[Vector]), e.g. (fabricated):
>
> val WebsiteUserAgeRDD : RDD[url, userAgeVector]
>
> val ageClusterByUrl =
> WebsiteUserAgeRDD.groupby(_.url).mapValues(GaussianMixture.setK(x).run)
>
> This obviously does not work, as the mapValues function is called on
> Iterable[Vector] but requires RDD[Vector]
> As I see it, parallelizing this Iterable is not possible, would result in
> an RDD of RDDs?
>
> Anyone has an idea how to cluster an RDD of (key, Iterable[Vector]) like
> in above groupBy result?
>
> Many thanks,
> Fabian
>

Mime
View raw message