spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg <g...@zooniverse.org>
Subject understanding use of "filter" function in Spark
Date Thu, 31 Jul 2014 11:00:15 GMT
Hi, suppose I have some data of the form
k,(x,y)
which are all numbers. For each key value (k) I want to do kmeans clustering
on all corresponding (x,y) points. For each key value I have few enough
points that I'm happy to use a traditional (non-mapreduce) kmeans
implementation. The challenge is that I have a lot of different keys so I
want to use Hadoop/Spark to help split the clustering up over multiple
computers. With Hadoop-streaming and Python the code would be easy:
pts = []
current_k = None
for k,(x,y) in  sys.stdin:
  if k == current_k:
    pts.append((x,y))
  else:
    if current_k is not None:
       #do kmeans clustering on pts
    current_k = k
    pts = []

(and obviously run kmeans for the final key as well)
How do I express this in Spark? The function f for both filter and
filterByKey needs to be transitive (all of the examples I've seen are just
adding values). Ideally I'd like to be able to run this iteratively,
changing the number of clusters for kmeans (so Spark would be nice). Given
how easy this is to do in Hadoop, I feel like this should be easy in Spark
as well but I can't figure out a way.

thanks :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/understanding-use-of-filter-function-in-Spark-tp11037.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Mime
View raw message