spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajat Kumar <rajatkumar10...@gmail.com>
Subject Re: spark rdd grouping
Date Wed, 02 Dec 2015 07:44:46 GMT
What if I don't have to use aggregate function only groupbykeylocally() and
then a map transformation?

Will reduceByKeyLocally help here? Or is there any workaround if groupbykey
is not locally and is global across all partitions.

Thanks

On Tue, Dec 1, 2015 at 5:20 PM, ayan guha <guha.ayan@gmail.com> wrote:

> I believe reduceByKeyLocally was introduced for this purpose.
>
> On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski <jacek@japila.pl> wrote:
>
>> Hi Rajat,
>>
>> My quick test has showed that groupBy will preserve the partitions:
>>
>> scala>
>> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex
>> { case (idx, iter) => val s = iter.toSeq; println(idx + " with " +
>> s.size + " elements: " + s); s.toIterator
>> }.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s =
>> iter.toSeq; println(idx + " with " + s.size + " elements: " + s);
>> s.toIterator }.collect
>>
>> 1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1))
>> 0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1))
>>
>> 0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1))))
>> 1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1))))
>>
>> Do I miss anything?
>>
>> Pozdrawiam,
>> Jacek
>>
>> --
>> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
>> http://blog.jaceklaskowski.pl
>> Mastering Spark
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> Follow me at https://twitter.com/jaceklaskowski
>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>
>>
>> On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar <rajatkumar10885@gmail.com>
>> wrote:
>> > Hi
>> >
>> > i have a javaPairRdd<K,V> rdd1. i want to group by rdd1 by keys but
>> preserve
>> > the partitions of original rdd only to avoid shuffle since I know all
>> same
>> > keys are already in same partition.
>> >
>> > PairRdd is basically constrcuted using kafka streaming low level
>> consumer
>> > which have all records with same key already in same partition. Can i
>> group
>> > them together with avoid shuffle.
>> >
>> > Thanks
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message