spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pengcheng YIN <pcyin1...@gmail.com>
Subject merge elements in a Spark RDD under custom condition
Date Mon, 01 Dec 2014 08:05:58 GMT
Hi Pro,
I want to merge elements in a Spark RDD when the two elements satisfy certain condition

Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements.
The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new
RDD.

For example, suppose RDD[Seq[Int]] = [[1,2,3], [2,4,5], [1,2], [7,8,9]], the result should
be [[1,2,3,4,5], [7,8,9]].

Since RDD[Seq[Int]] is very large, I cannot do it in driver program. Is it possible to get
it done using distributed groupBy/map/reduce, etc?

Thanks in advance,

Pengcheng
Mime
View raw message