spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eahlberg <>
Subject Optimizing cartesian product using keys
Date Mon, 29 Feb 2016 16:56:23 GMT

To avoid computing all possible combinations, I'm trying to group values
according to a certain key, and then compute the cartesian product of the
values for each key, i.e.:

 [(k1, [v1]), (k1, [v2]), (k2, [v3])]

Desired output:
[(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)]

Currently I'm doing it as follows (product is from Python itertools):
input = sc.textFile('data.csv')
rdd = x: (x.key, [x]))
rdd2 = rdd.reduceByKey(lambda x, y: x + y)
rdd3 = rdd2.flatMapValues(lambda x: itertools.product(x, x))
result = x: x[1])

This works fine for very small files, but when the list is of length ~1000
the computation completely freezes.

Thanks in advance!

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message