spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: spark rdd grouping
Date Tue, 01 Dec 2015 11:21:40 GMT
Hi Rajat,

My quick test has showed that groupBy will preserve the partitions:

scala> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex
{ case (idx, iter) => val s = iter.toSeq; println(idx + " with " +
s.size + " elements: " + s); s.toIterator
}.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s =
iter.toSeq; println(idx + " with " + s.size + " elements: " + s);
s.toIterator }.collect

1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1))
0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1))

0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1))))
1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1))))

Do I miss anything?

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar <rajatkumar10885@gmail.com> wrote:
> Hi
>
> i have a javaPairRdd<K,V> rdd1. i want to group by rdd1 by keys but preserve
> the partitions of original rdd only to avoid shuffle since I know all same
> keys are already in same partition.
>
> PairRdd is basically constrcuted using kafka streaming low level consumer
> which have all records with same key already in same partition. Can i group
> them together with avoid shuffle.
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message