spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abhishek R. Singh" <abhis...@tetrationanalytics.com>
Subject Re: Grouping runs of elements in a RDD
Date Tue, 30 Jun 2015 19:07:20 GMT
could you use a custom partitioner to preserve boundaries such that all related tuples end
up on the same partition?

On Jun 30, 2015, at 12:00 PM, RJ Nowling <rnowling@gmail.com> wrote:

> Thanks, Reynold.  I still need to handle incomplete groups that fall between partition
boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle
those using the partition indices and key-value pairs as a second pass after the first.
> 
> OCaml's std library provides a function called group() that takes a break function that
operators on pairs of successive elements.  It seems a similar approach could be used in Spark
and would be more efficient than my approach with key-value pairs since you know the ordering
of the partitions.
> 
> Has this need been expressed by others?  
> 
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rxin@databricks.com> wrote:
> Try mapPartitions, which gives you an iterator, and you can produce an iterator back.
> 
> 
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowling@gmail.com> wrote:
> Hi all,
> 
> I have a problem where I have a RDD of elements:
> 
> Item1 Item2 Item3 Item4 Item5 Item6 ...
> 
> and I want to run a function over them to decide which runs of elements to group together:
> 
> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
> 
> Technically, I could use aggregate to do this, but I would have to use a List of List
of T which would produce a very large collection in memory.
> 
> Is there an easy way to accomplish this?  e.g.,, it would be nice to have a version of
aggregate where the combination function can return a complete group that is added to the
new RDD and an incomplete group which is passed to the next call of the reduce function.
> 
> Thanks,
> RJ
> 
> 


Mime
View raw message