spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sung Hwan Chung <>
Subject Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?
Date Wed, 08 Oct 2014 22:32:02 GMT
One thing I didn't mention is that we actually do data.repartition before
hand with shuffle.

I found that this can actually introduce randomness to lineage steps,
because data get shuffled to different partitions and lead to inconsistent
behavior if your algorithm is dependent on the order at which the data rows
appear, because now data rows will appear in a different orders.

If you want to guarantee fault-tolerance, you can't have any randomness
whatsoever in lineage steps, and repartition violates that (depending on
what you do with the data).

On Wed, Oct 8, 2014 at 12:24 PM, Sung Hwan Chung <> wrote:

> There is no circular dependency. Its simply dropping references to prev
> RDDs because there is no need for it.
> I wonder if that messes up things up though internally for Spark due to
> losing references to intermediate RDDs.
> On Oct 8, 2014, at 12:13 PM, Akshat Aranya <> wrote:
> Using a var for RDDs in this way is not going to work.  In this example,
> would create and RDD that depends on tx2, but then soon after
> that, you change what tx2 means, so you would end up having a circular
> dependency.
> On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung <
> > wrote:
>> My job is not being fault-tolerant (e.g., when there's a fetch failure or
>> something).
>> The lineage of RDDs are constantly updated every iteration. However, I
>> think that when there's a failure, the lineage information is not being
>> correctly reapplied.
>> It goes something like this:
>> val rawRDD = read(...)
>> val repartRDD = rawRDD.repartition(X)
>> val tx1 =
>> var tx2 =
>> while (...) {
>>   tx2 =
>> }
>> Is there any way to monitor RDD's lineage, maybe even including? I want
>> to make sure that there's no unexpected things happening.

View raw message