spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Lost leader exception in Kafka Direct for Streaming
Date Thu, 01 Oct 2015 14:42:27 GMT
This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of a  7 machine
cluster.

I’d put my money on recovering from an out of sync replica, although I haven’t done extensive
testing around it.

-adrian

From: Cody Koeninger
Date: Thursday, October 1, 2015 at 5:18 PM
To: swetha
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Lost leader exception in Kafka Direct for Streaming

Did you check you kafka broker logs to see what was going on during that time?

The direct stream will handle normal leader loss / rebalance by retrying tasks.

But the exception you got indicates that something with kafka was wrong, such that offsets
were being re-used.

ie. your job already processed up through beginning offset 15027734702

but when asking kafka for the highest available offsets, it returns ending offset 15027725493

which is lower, in other words kafka lost messages.  This might happen because you lost a
leader and recovered from a replica that wasn't in sync, or someone manually screwed up a
topic, or ... ?

If you really want to just blindly "recover" from this situation (even though something is
probably wrong with your data), the most straightforward thing to do is monitor and restart
your job.




On Wed, Sep 30, 2015 at 4:31 PM, swetha <swethakasireddy@gmail.com<mailto:swethakasireddy@gmail.com>>
wrote:

Hi,

I see this sometimes in our Kafka Direct approach in our Streaming job. How
do we make sure that the job recovers from such errors and works normally
thereafter?

15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
19,  sleeping for 200ms
15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
5,  sleeping for 200ms

Followed by every task failing with something like this:

15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
(TID 818804)
kafka.common.NotLeaderForPartitionException

And:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
assertion failed: Beginning offset 15027734702 is after the ending offset
15027725493 for topic hubble_stream partition 12. You either provided an
invalid fromOffset, or the Kafka topic has been damaged


Thanks,
Swetha




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>


Mime
View raw message