spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shushant Arora <shushantaror...@gmail.com>
Subject Re: spark rdd grouping
Date Fri, 25 Dec 2015 08:29:39 GMT
Hi

I have created a jira for this feature
https://issues.apache.org/jira/browse/SPARK-12524
Please vote this feature if its necessary. I would like to implement this
feature.

Thanks
Shushant

On Wed, Dec 2, 2015 at 1:14 PM, Rajat Kumar <rajatkumar10885@gmail.com>
wrote:

> What if I don't have to use aggregate function only groupbykeylocally()
> and then a map transformation?
>
> Will reduceByKeyLocally help here? Or is there any workaround if
> groupbykey is not locally and is global across all partitions.
>
> Thanks
>
> On Tue, Dec 1, 2015 at 5:20 PM, ayan guha <guha.ayan@gmail.com> wrote:
>
>> I believe reduceByKeyLocally was introduced for this purpose.
>>
>> On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski <jacek@japila.pl> wrote:
>>
>>> Hi Rajat,
>>>
>>> My quick test has showed that groupBy will preserve the partitions:
>>>
>>> scala>
>>> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex
>>> { case (idx, iter) => val s = iter.toSeq; println(idx + " with " +
>>> s.size + " elements: " + s); s.toIterator
>>> }.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s =
>>> iter.toSeq; println(idx + " with " + s.size + " elements: " + s);
>>> s.toIterator }.collect
>>>
>>> 1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1))
>>> 0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1))
>>>
>>> 0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1))))
>>> 1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1))))
>>>
>>> Do I miss anything?
>>>
>>> Pozdrawiam,
>>> Jacek
>>>
>>> --
>>> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
>>> http://blog.jaceklaskowski.pl
>>> Mastering Spark
>>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>>> Follow me at https://twitter.com/jaceklaskowski
>>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>>
>>>
>>> On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar <rajatkumar10885@gmail.com>
>>> wrote:
>>> > Hi
>>> >
>>> > i have a javaPairRdd<K,V> rdd1. i want to group by rdd1 by keys but
>>> preserve
>>> > the partitions of original rdd only to avoid shuffle since I know all
>>> same
>>> > keys are already in same partition.
>>> >
>>> > PairRdd is basically constrcuted using kafka streaming low level
>>> consumer
>>> > which have all records with same key already in same partition. Can i
>>> group
>>> > them together with avoid shuffle.
>>> >
>>> > Thanks
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Mime
View raw message