kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: Producer connect timeouts
Date Sun, 18 Dec 2016 00:55:18 GMT
Without having dug back into the code to check, this sounds right.
Connection management just fires off a request to connect and then
subsequent poll() calls will handle any successful/failed connections. The
timeouts wrt requests are handled somewhat differently (the connection
request isn't explicitly tied to the request that triggered it, so when the
latter times out, we don't follow up and timeout the connection request
either).

So yes, you currently will have connection requests tied to your underlying
TCP timeout request. This tends to be much more of a problem in public
clouds where the handshake request will be silently dropped due to firewall
rules.

The metadata.max.age.ms is a workable solution, but agreed that it's not
great. If possible, reducing the default TCP connection timeout isn't
unreasonable either -- the defaults are set for WAN connections (and
arguably set for WAN connections of long ago), so much more aggressive
timeouts are reasonable for Kafka clusters.

-Ewen

On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen <luke.steensen@
braintreepayments.com> wrote:

> Hello,
>
> Is it correct that producers do not fail new connection establishment when
> it exceeds the request timeout?
>
> Running on AWS, we've encountered a problem where certain very low volume
> producers end up with metadata that's sufficiently stale that they attempt
> to establish a connection to a broker instance that has already been
> terminated as part of a maintenance operation. I would expect this to fail
> and be retried normally, but it appears to hang until the system-level TCP
> connection timeout is reached (2-3 minutes), with the writes themselves
> being expired before even a single attempt is made to send them.
>
> We've worked around the issue by setting `metadata.max.age.ms` extremely
> low, such that these producers are requesting new metadata much faster than
> our maintenance operations are terminating instances. While this does work,
> it seems like an unfortunate workaround for some very surprising behavior.
>
> Thanks,
> Luke
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message