kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hargett, Phil" <phil.harg...@mirror-image.com>
Subject RE: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_
Date Mon, 05 Aug 2013 14:28:05 GMT
And today I've attached a better patch (KAFKA-989-failed-to-find-leader-patch2.patch) to KAFKA-989.

When in doubt, add another lock. ;)

https://issues.apache.org/jira/browse/KAFKA-989?focusedCommentId=13729537&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13729537
________________________________________
From: Hargett, Phil
Sent: Friday, August 02, 2013 1:36 PM
To: Jun Rao
Cc: users@kafka.apache.org
Subject: RE: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level
consumer fails to make progress_

I've attached a patch to KAFKA-989:

https://issues.apache.org/jira/browse/KAFKA-989

Does this patch seem valid?  I'm not sure about any locking constraints / order in Kafka code,
but this seemed like a benign change that might help the cause.

This change should eliminate dangerous races between startConnections / stopConnections in
ConsumerFetcherManager *and* the doWork method of the inner LeaderFinderThread class.  Basically,
with this change it now should not be possible for any of these 3 pieces of code to execute
their dangerous sections without holding the common lock--thus, each should be able to take
into account exactly the state left behind by the prior holder of the lock, whether starting
or stopping or doing work.

Not guaranteeing its perfect, but on the surface it seems better.

________________________________________
From: Jun Rao [junrao@gmail.com]
Sent: Wednesday, July 31, 2013 12:16 AM
To: Hargett, Phil
Cc: users@kafka.apache.org
Subject: Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level
consumer fails to make progress_

Hmm, that's a good theory. My understanding is that you have one thread that first shuts down
the consumer connector and then creates new streams on the same connector. Is that right?
If so, I don't think the race condition can happen. When we shutdown the consumer connector,
it waits until the leaderFinder thread is stopped. So, if the leaderFinder thread is still
being started, shutdown will block.

Thanks,

Jun


On Tue, Jul 30, 2013 at 9:34 AM, Hargett, Phil <phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>
wrote:
Hmmm...is there a reason that stopConnections in ConsumerFetcherManager does not grab a lock
before shutting down the leaderFinderThread?

I don't see what prevents startConnections/stopConnections from causing a race in certain
conditions and if called on separate threads.

Given there are no locks, its seems even possible that the ZookeeperConsumerConnector could
get all the way through its shutdown (including successfully calling stopConnections on ConsumerFetcherManager)
before the leaderFinderThread has been able to startup. In that scenario, the leaderFinderThread
would startup and immediately fail, because the ZkClient has already been closed.

That is the behavior I am seeing: leaderFinderThread fails because ZkClient is hitting an
NullPointerException, presumably because the ZkClient is already closed.

Don't know if that is the cause but it could be.

:)

On Jul 30, 2013, at 12:01 PM, "Jun Rao" <junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>
wrote:

What's the revision of the 0.8 branch that you used? If that's older than the beta1 release,
I recommend that you upgrade.

Thanks,

Jun


On Tue, Jul 30, 2013 at 3:09 AM, Hargett, Phil <phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>
wrote:
No, sorry, it didn't take 90 seconds to connect to ZK (at least I hope not). I had my consumer
open for 90 secs in this case before shutting it down and disposing of it—hence any races
caused by fast startup/shutdown should not have been relevant.

I build from source off of the 0.8 branch, so isn't that pretty close to beta 1?

:)

On Jul 30, 2013, at 12:22 AM, "Jun Rao" <junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>
wrote:

Hmm, it takes 90 secs to connect to ZK? That seems way too long. Is your ZK healthy.

Also, are you on the 0.8 beta1 release? If not, could you try that one? It may not be related,
but we did fix some consumer side deadlock issues there.

Thanks,

Jun


On Mon, Jul 29, 2013 at 9:02 AM, Hargett, Phil <phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>
wrote:
I think we have 3 different classes in play here:

 * kafka.consumer.ZookeeperConsumerConnector
 * kafka.utils.ShutdownableThread
 * kafka.consumer.ConsumerFetcherManager

The actual consumer is the first one, and it does a fair amount of work *before* stopping
the fetcher—which then results in shutting down the leader thread

If the initial connectZk method in ZookeeperConsumerConnector takes a long time (more than
90 seconds in 1 case this morning), then I could see the fetcher's stopConnections method
not getting called, because there isn't a ConsumerFetcherManager instance yet.

In other words, we could be shutting down the consumer before it is fully initialized—but
that doesn't seem correct, as the code in ZookeeperConsumerConnector is synchronous—my application
wouldn't have a reference to a partially initialized consumer.

Odd.

:)

On Jul 29, 2013, at 11:22 AM, "Jun Rao" <junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>>
wrote:

There seems to be two separate issues.

1. Why do you see NullPointerException in the leaderFinder thread? I am not sure what's causing
this. In the normal path, when a consumer connector is shut down (this is when the pointer
is set to null), it first waits for the leaderFinder thread to shut down. Do you think that
you can provide a test case that reproduces this and attach it to the jira?

2. It seems that you have lots of consumer rebalances. This is good to avoid since it can
slow down the consumption.

Thanks,

Jun


On Mon, Jul 29, 2013 at 6:21 AM, Hargett, Phil <phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>>
wrote:
Why would a consumer that has been shutdown still be rebalancing?

Zookeeper session timeout (zookeeper.session.timeout.ms<http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms>)
is 1000 and sync time (zookeeper.sync.timeout.ms<http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms>)
is 500.

Also, the timeout for the thread that looks for the leader is left at the default 200 milliseconds
(refresh.leader.backoff.ms<http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms>).
That's why we see these messages so often in our logs.

I can imagine I need to tune some of these settings for load...but the issue, I think, is
that the consumer has been shutdown, so the ZkClient for the leader finder thread no longer
has a connection—and won't.

:)

On Jul 28, 2013, at 11:21 PM, "Jun Rao" <junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>>>
wrote:

Ok. So, it seems that the issue is there are lots of rebalances in the consumer. How long
did you set the zk session expiration time? A typical reason for many rebalances is the consumer
side GC. If so, you will see Zookeeper session expirations in the consumer log (grep for Expired).
Occasional rebalances are fine. Too many rebalances can slow down the consumption and one
will need to tune the java GC setting.

Thanks,

Jun


On Sat, Jul 27, 2013 at 9:33 AM, Hargett, Phil <phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>>>
wrote:
All bugs are relative, aren't they? :)

Well, since this thread attempts to rebalance every 200 milliseconds, these messages REALLY
fill up a log and fast.

Because this error results in so much log output, it makes it difficult to find other actionable
error messages in the log.

Yes, I could suppress messages from that class (we use log4j after all) but I am uncomfortable
1) hiding a thread leak, 2) hiding other possible  errors from the same class.

I filed this as KAFKA 989 (IIRC), as I did not see an obvious bug that covers it.

This error also happens in less than 1 day of use: most of our systems in this category are
up for 2-3 months before a software upgrade or other event causes us to cycle the process.

I'm sure you have uptime and scaling requirements far beyond ours. So I hope these reasons
don't seem too petty. :)


On Jul 27, 2013, at 12:24 AM, "Jun Rao" <junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>>>>
wrote:

Other than those exceptions, what issues do you see in your consumer?

Thanks,

Jun


On Fri, Jul 26, 2013 at 9:24 AM, Hargett, Phil <phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>>>>
wrote:
This NOT a harmless race.

Now my QA teammate is encountering this issue under load. The result of it is a background
thread that is spinning in a loop that always hits a NullPointerException.

I have implemented a variety of assurances in my application code to ensure that the high-level
consumer I'm spinning up in Java stays alive for at least 10 seconds before being asked to
shutdown.  Yet the issue still persists.

Suggestions?
________________________________________
From: Jun Rao [junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>>>]
Sent: Tuesday, June 25, 2013 11:58 PM
To: users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>>>>>
Subject: Re: 0.8 throwing exception "Failed to find leader" and high-level consumer fails
to make progress

The exception is likely due to a race condition btw the logic in ZK watcher
and the closing of ZK connection. It's harmless, except for the weird
exception.

Thanks,

Jun


On Tue, Jun 25, 2013 at 10:07 AM, Hargett, Phil <
phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>>>>
wrote:

> Possibly.
>
> I see evidence that its being stopped / started every 30 seconds in same
> cases (due to my code). It's entirely possible that I have a race, too, in
> that 2 separate pieces of code could be triggering such a stop / start.
>
> Gives me something to track down. Thank you!!
>
> On Jun 25, 2013, at 12:45 PM, "Jun Rao" <junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>><mailto:junrao@gmail.com<mailto:junrao@gmail.com><mailto:junrao@gmail.com<mailto:junrao@gmail.com>>>>>>>
wrote:
>
> > This typically only happens when the consumerConnector is shut down. Are
> > you restarting the consumerConnector often?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Tue, Jun 25, 2013 at 9:40 AM, Hargett, Phil <
> > phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com><mailto:phil.hargett@mirror-image.com<mailto:phil.hargett@mirror-image.com>>>>>>>
wrote:
> >
> >> Seeing this exception a LOT (3-4 times per second, same log topic).
> >>
> >> I'm using external code to feed data to about 50 different log topics
> over
> >> a cluster of 3 Kafka 0.8 brokers.  There are 3 ZooKeeper instances as
> well,
> >> all of this is running on EC2.  My application creates a high-level
> >> consumer (1 per topic) to consumer data from each and do further
> processing.
> >>
> >> The problem is this exception is in the high-level consumer, so my code
> >> has no way of knowing that it's become stuck.
> >>
> >> This exception does not always appear, but as far as I can tell, once
> this
> >> happens, the only cure is to restart my application's process.
> >>
> >> I saw this in 0.8 built from source about 1 week ago, and also am seeing
> >> it today after pulling the latest 0.8 sources and rebuilding Kafka.
> >>
> >> Thoughts?
> >>
> >> Failed to find leader for Set([topic6,0]):
> java.lang.NullPointerException
> >>        at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
> >>        at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
> >>        at
> >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> >>        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
> >>        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
> >>        at
> >> kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
> >>        at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
> >>        at
> >>
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
> >>        at
> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> >>
>





Mime
View raw message