spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <rnowl...@gmail.com>
Subject Re: Grouping runs of elements in a RDD
Date Tue, 30 Jun 2015 19:21:14 GMT
That's an interesting idea!  I hadn't considered that.  However, looking at
the Partitioner interface, I would need to know from looking at a single
key which doesn't fit my case, unfortunately.  For my case, I need to
compare successive pairs of keys.  (I'm trying to re-join lines that were
split prematurely.)

On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
abhishsi@tetrationanalytics.com> wrote:

> could you use a custom partitioner to preserve boundaries such that all
> related tuples end up on the same partition?
>
> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rnowling@gmail.com> wrote:
>
> Thanks, Reynold.  I still need to handle incomplete groups that fall
> between partition boundaries. So, I need a two-pass approach. I came up
> with a somewhat hacky way to handle those using the partition indices and
> key-value pairs as a second pass after the first.
>
> OCaml's std library provides a function called group() that takes a break
> function that operators on pairs of successive elements.  It seems a
> similar approach could be used in Spark and would be more efficient than my
> approach with key-value pairs since you know the ordering of the partitions.
>
> Has this need been expressed by others?
>
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rxin@databricks.com> wrote:
>
>> Try mapPartitions, which gives you an iterator, and you can produce an
>> iterator back.
>>
>>
>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowling@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have a problem where I have a RDD of elements:
>>>
>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>
>>> and I want to run a function over them to decide which runs of elements
>>> to group together:
>>>
>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>
>>> Technically, I could use aggregate to do this, but I would have to use a
>>> List of List of T which would produce a very large collection in memory.
>>>
>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>> have a version of aggregate where the combination function can return a
>>> complete group that is added to the new RDD and an incomplete group which
>>> is passed to the next call of the reduce function.
>>>
>>> Thanks,
>>> RJ
>>>
>>
>>
>
>

Mime
View raw message