spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luis Guerra <luispelay...@gmail.com>
Subject Apply function to all elements along each key
Date Tue, 20 Jan 2015 16:24:36 GMT
Hi all,

I would like to apply a function over all elements for each key (assuming
key-value RDD). For instance, imagine I have:

import numpy as np
a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello',
'goodbye']])
a = sc.parallelize(a)

Then I want to create a key-value RDD, using the first element of each []
as key:

b = a.groupBy(lambda x: x[0])

And finally, I want to filter only those values where the second element is
equal along each key (or there is only one element). So, for key 1, there
is only one element ('hola'), whereas there are 2 different elements for
key 2 ('hi', 'hello'). Therefore, only values associated to key 1 must be
returned:

def test(group):
x = group[0][1]
for g in group[1:]:
y = g[1]
if x != y:
return []
else:
x = y
return group

c = flatMap(lambda (x,y): test(y.data))

Is there a more efficient way to do this?

Many thanks in advance,

Best

Mime
View raw message