spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <rnowl...@gmail.com>
Subject Grouping runs of elements in a RDD
Date Tue, 30 Jun 2015 18:01:57 GMT
Hi all,

I have a problem where I have a RDD of elements:

Item1 Item2 Item3 Item4 Item5 Item6 ...

and I want to run a function over them to decide which runs of elements to
group together:

[Item1 Item2] [Item3] [Item4 Item5 Item6] ...

Technically, I could use aggregate to do this, but I would have to use a
List of List of T which would produce a very large collection in memory.

Is there an easy way to accomplish this?  e.g.,, it would be nice to have a
version of aggregate where the combination function can return a complete
group that is added to the new RDD and an incomplete group which is passed
to the next call of the reduce function.

Thanks,
RJ

Mime
View raw message