spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Böhnlein <fabian.boehnl...@gmail.com>
Subject mllib on (key, Iterable[Vector])
Date Tue, 11 Aug 2015 12:43:29 GMT
Hi everyone,

I am trying to use mllib.clustering.GaussianMixture, but am blocked by the
fact that the API only accepts RDD[Vector].

Broadly speaking I need to run the clustering on an RDD[(key,
Iterable[Vector]), e.g. (fabricated):

val WebsiteUserAgeRDD : RDD[url, userAgeVector]

val ageClusterByUrl =
WebsiteUserAgeRDD.groupby(_.url).mapValues(GaussianMixture.setK(x).run)

This obviously does not work, as the mapValues function is called on
Iterable[Vector] but requires RDD[Vector]
As I see it, parallelizing this Iterable is not possible, would result in
an RDD of RDDs?

Anyone has an idea how to cluster an RDD of (key, Iterable[Vector]) like in
above groupBy result?

Many thanks,
Fabian

Mime
View raw message