spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sung Hwan Chung <coded...@cs.stanford.edu>
Subject coalesce with shuffle or repartition is not necessarily fault-tolerant
Date Wed, 08 Oct 2014 22:42:29 GMT
I noticed that repartition will result in non-deterministic lineage because
it'll result in changed orders for rows.

So for instance, if you do things like:

val data = read(...)
val k = data.repartition(5)
val h = k.repartition(5)

It seems that this results in different ordering of rows for 'k' each time
you call it.
And because of this different ordering, 'h' will result in different
partitions even, because 'repartition' distributes through a random number
generator with the 'index' as the key.

Mime
View raw message