spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sung Hwan Chung <coded...@cs.stanford.edu>
Subject Re: coalesce with shuffle or repartition is not necessarily fault-tolerant
Date Thu, 09 Oct 2014 07:11:49 GMT
Are there a large number of non-deterministic lineage operators?

This seems like a pretty big caveat, particularly for casual programmers
who expect consistent semantics between Spark and Scala.

E.g., making sure that there's no randomness what-so-ever in RDD
transformations seems critical. Additionally, shuffling operators would
usually result in changed orders, etc.

These are very easy errors to make, and if you tend to cache things, some
errors won't be detected until fault-tolerance is triggered. It would be
very helpful for programmers to have a big warning list of not-to-dos
within RDD transformations.

On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen <sowen@cloudera.com> wrote:

> Yes, I think this another operation that is not deterministic even for
> the same RDD. If a partition is lost and recalculated the ordering can
> be different in the partition. Sorting the RDD makes the ordering
> deterministic.
>
> On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
> <codedeft@cs.stanford.edu> wrote:
> > Let's say you have some rows in a dataset (say X partitions initially).
> >
> > A
> > B
> > C
> > D
> > E
> > .
> > .
> > .
> > .
> >
> >
> > You repartition to Y > X, then it seems that any of the following could
> be
> > valid:
> >
> > partition 1             partition 2
> ........................
> > A                          B
> > ........................
> > C                          E
> > D                           .
> > ........................
> > --------------------------
> > C                          E
> > A                          B
> > D                          .
> > --------------------------
> > D                          B
> > C                          E
> > A
> >
> > etc. etc.
> >
> > I.e., although each partition will have the same unordered set, the rows'
> > orders will change from call to call.
> >
> > Now, because row ordering can change from call to call, if you do any
> > operation that depends on the order of items you saw, then lineage is no
> > longer deterministic. For example, it seems that the repartition call
> itself
> > is a row-order dependent call, because it creates a random number
> generator
> > with the partition index as the seed, and then call nextInt as you go
> > through the rows.
> >
> >
> > On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell <pwendell@gmail.com>
> wrote:
> >>
> >> IIRC - the random is seeded with the index, so it will always produce
> >> the same result for the same index. Maybe I don't totally follow
> >> though. Could you give a small example of how this might change the
> >> RDD ordering in a way that you don't expect? In general repartition()
> >> will not preserve the ordering of an RDD.
> >>
> >> On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
> >> <codedeft@cs.stanford.edu> wrote:
> >> > I noticed that repartition will result in non-deterministic lineage
> >> > because
> >> > it'll result in changed orders for rows.
> >> >
> >> > So for instance, if you do things like:
> >> >
> >> > val data = read(...)
> >> > val k = data.repartition(5)
> >> > val h = k.repartition(5)
> >> >
> >> > It seems that this results in different ordering of rows for 'k' each
> >> > time
> >> > you call it.
> >> > And because of this different ordering, 'h' will result in different
> >> > partitions even, because 'repartition' distributes through a random
> >> > number
> >> > generator with the 'index' as the key.
> >
> >
>

Mime
View raw message