spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Is shuffle "stable"?
Date Sat, 14 Jun 2014 22:55:02 GMT
The order is not guaranteed actually, only which keys end up in each partition. Reducers may
fetch data from map tasks in an arbitrary order, depending on which ones are available first.
If you’d like a specific order, you should sort each partition. Here you might be getting
it because each partition only ends up having one element, and collect() does return the partitions
in order.

Matei

On Jun 14, 2014, at 12:14 PM, Daniel Darabos <daniel.darabos@lynxanalytics.com> wrote:

> What I mean is, let's say I run this:
> 
> sc.parallelize(Seq(0->3, 0->2, 0->1), 3).partitionBy(HashPartitioner(3)).collect
> 
> Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get a different
order?
> 
> I'm pretty sure the shuffle files are taken in the order of the source partitions...
But after much search, and the discussion on http://stackoverflow.com/questions/24206660/does-groupbykey-in-spark-preserve-the-original-order
I still can't find the code that does this.
> 
> Thanks!


Mime
View raw message