spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshat Aranya <aara...@gmail.com>
Subject Re: partitioned groupBy
Date Wed, 17 Sep 2014 20:04:09 GMT
Patrick,

If I understand this correctly, I won't be able to do this in the closure
provided to mapPartitions() because that's going to be stateless, in the
sense that a hash map that I create within the closure would only be useful
for one call of MapPartitionsRDD.compute().  I guess I would need to
override mapPartitions() directly within my RDD.  Right?

On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell <pwendell@gmail.com> wrote:

> If each partition can fit in memory, you can do this using
> mapPartitions and then building an inverse mapping within each
> partition. You'd need to construct a hash map within each partition
> yourself.
>
> On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aaranya@gmail.com> wrote:
> > I have a use case where my RDD is set up such:
> >
> > Partition 0:
> > K1 -> [V1, V2]
> > K2 -> [V2]
> >
> > Partition 1:
> > K3 -> [V1]
> > K4 -> [V3]
> >
> > I want to invert this RDD, but only within a partition, so that the
> > operation does not require a shuffle.  It doesn't matter if the
> partitions
> > of the inverted RDD have non unique keys across the partitions, for
> example:
> >
> > Partition 0:
> > V1 -> [K1]
> > V2 -> [K1, K2]
> >
> > Partition 1:
> > V1 -> [K3]
> > V3 -> [K4]
> >
> > Is there a way to do only a per-partition groupBy, instead of shuffling
> the
> > entire data?
> >
>

Mime
View raw message