nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Curtis Ruck <curtis.r...@gmail.com>
Subject Re: 1.8.0 Clustering Timeouts
Date Tue, 21 May 2019 20:11:31 GMT
After more analysis for the log, it appears the issue is likely in Jetty as
the stacktrace shows the relevant threads during the 30 seconds timeout is
sitting in sun's EpollSelector in native code.

I've dug as deep as I can.  It can be reliably recreated in nifi-1.8.0 and
in nifi-1.9.2 in my specific environment, though it can't be recreated
outside of the environment.  It isn't eating 100% CPU during the 30 second
timeout, it seems to just loose track of the connection/request until the
client timeout closes the connection.  I've tried increasing the timeout to
3 minutes with no success.  I've tried different JDK versions Oracle 8u202,
8u212, 8u65, OpenJDK (Redhat flavor) 8u212.

I've boiled the essence of the log down to this...  if anyone has any
suggestions i'm all ears.  I do wish NIFI-5581 had the symptoms associated
with the issue instead of just the solution.

2019-05-21 03:05:30,416 The client request is initiated on thread "NiFi Web
Server-32".
2019-05-21 03:05:30,494 The request makes it way to "NiFi Web Server-209"
(how?) and initiates Request Replication
2019-05-21 03:05:30,496 we see the new replicated connection get accepted
and it lands on "NiFi Web Server-24" and "NiFi Web Server-210".
2019-05-21 03:05:30,567 shows the full replicated request headers.
2019-05-21 03:05:30,569 the threads start waiting with ManagedSelector
(3663b227)
30 seconds go by with the JVM doing some heartbeating between cluster nodes.
2019-05-21 03:06:00,563 the ManagedSelector (3663b227) wakes up because the
socket was closed.
2019-05-21 03:06:00,570 the client gets a timeout exception from okhttp.

https://gist.github.com/ruckc/61e2ef68f084e859e225319e0ba6c458

--
Curtis Ruck


On Mon, May 20, 2019 at 11:58 PM Curtis Ruck <curtis.ruck@gmail.com> wrote:

> Alright, round two occurred today, and after much frustration, I believe I
> know what exactly is causing the problem, and the solution is probably just
> upgrade to nifi-1.9.2 which comes with an upgraded jetty.
>
> Java 8u202 - Nifi 1.8.0 - Jetty 9.4.11 (same Jetty version is in Nifi
> 1.9.2)
>
> BLUF, a single request to node1/nifi-api/controller/cluster (on thread
> NiFi Web Server-209) replicates (thread Replicate Request Thread-4) to
> itself (node1/nifi-api/controller/cluster) with additional headers
> indicating it's a replicated request.  The replicated request is received
> by thread NiFi Web Server-210.  The thread parks "waiting" This request to
> controller/cluster wants to run in a ReservedThreadExecutor.  The
> replicated request is stuck "waiting" for 30 seconds until OkHttp timeouts
> in EpollSelector land.
>
> My nifi.properties has 200 threads assigned to nifi.web.jetty.threads.  I
> also have read.timeout set to 30 sec.
>
> ~6500 lines of debugging is here.
>
> https://gist.githubusercontent.com/ruckc/df947e04ae4fb55bc37ecc116f747848/raw/06625f0b6169acf272edc648a7fbd2abd043e0a4/node1.log
>
> Wish I knew where to go from here...
>
> --
> Curtis Ruck
>
>
> On Mon, May 20, 2019 at 10:06 AM Curtis Ruck <curtis.ruck@gmail.com>
> wrote:
>
>> I have a new 3 node cluster i'm trying to stand up.
>>
>> On most of the API calls (to create proxy the proxy user) and others for
>> debugging, i get a "java.net.SocketTimeoutException: timeout" returned.  In
>> debugging the issue, the timeout occurs when the API replicates the request
>> to the same instance receiving the request.  I.e when node1 replicates the
>> request to node1.
>>
>> I've attached a pruned debug logfile showing the /controller/config
>> request with the current request "nifi.cluster.node.read.timeout=10 sec".
>> From what I can tell in the attached log is, Thread "Nifi Web Server-24"
>> gets a request for /nifi-api/controller/cluster" it replicates the request
>> to itself, which arrives 10 secs (the request timeout) later and comes in
>> on thread "NiFi Web Server-161" at roughly the same moment (5-6 ms after)
>> okhttp3 timeouts the request.  When 161 goes to respond the connection is
>> already closed.  So I think what's happening is that node1 locks/blocks
>> something that the replicated request needs to read from, but can't until
>> the lock is released when the request timeouts.  So why is the request
>> getting replicated to itself?
>>
>> In troubleshooting I tried finding relevant REST endpoints and i get
>> these results:
>>
>> /controller/config -> java.net.SocketTimeoutException: timeout
>> /flow/status ->  java.net.SocketTimeoutException: timeout
>>
>> /flow/cluster/summary ->  {
>>   clusterSummary => {
>>     connectedNodes => "3 / 3",
>>     connectedNodeCount => 3,
>>     totalNodeCount => 3,
>>     connectedToCluster => true,
>>     clustered => true
>>   }
>> }
>>
>> /controller/cluster -> {
>>   cluster => {
>>     nodes => [
>>       {
>>         nodeId => "8d066e95-b254-4527-a214-79ca5359f078",
>>         address => "node2.example.com",
>>         apiPort => 8443,
>>         status => "CONNECTED",
>>         heartbeat => "05/20/2019 02:19:13 UTC",
>>         roles => [
>>           "Cluster Coordinator"
>>         ],
>>         activeThreadCount => 0,
>>         queued => "0 / 0 bytes",
>>         events => [
>>           {
>>             timestamp => "05/20/2019 02:17:35 UTC",
>>             category => "INFO",
>>             message => "Received first heartbeat from connecting node.
>> Node connected."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:20 UTC",
>>             category => "INFO",
>>             message => "Connection requested from existing node. Setting
>> status to connecting."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:20 UTC",
>>             category => "INFO",
>>             message => "Connection requested from existing node. Setting
>> status to connecting."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:14 UTC",
>>             category => "INFO",
>>             message => "Requesting that node connect to cluster"
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:14 UTC",
>>             category => "INFO",
>>             message => "Received heartbeat from node previously
>> disconnected due to Has Not Yet Connected
>>           }
>>         ],
>>         nodeStartTime => "05/20/2019 02:17:08 UTC"
>>       },
>>       {
>>         nodeId => "e5785deb-5646-421f-b858-e35342195988",
>>         address => "node3.example.com",
>>         apiPort => 8443,
>>         status => "CONNECTED",
>>         heartbeat => "05/20/2019 02:19:10 UTC",
>>         roles => [
>>           "Primary Node"
>>         ],
>>         activeThreadCount => 0,
>>         queued => "0 / 0 bytes",
>>         events => [
>>           {
>>             timestamp => "05/20/2019 02:17:30 UTC",
>>             category => "INFO",
>>             message => "Received first heartbeat from connecting node.
>> Node connected."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:17 UTC",
>>             category => "INFO",
>>             message => "Connection requested from existing node. Setting
>> status to connecting."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:17 UTC",
>>             category => "INFO",
>>             message => "Connection requested from existing node. Setting
>> status to connecting."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:14 UTC",
>>             category => "INFO",
>>             message => "Received heartbeat from node previously
>> disconnected due to Has Not Yet Connected
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:14 UTC",
>>             category => "INFO",
>>             message => "Requesting that node connect to cluster"
>>           }
>>         ],
>>         nodeStartTime => "05/20/2019 02:17:06 UTC"
>>       },
>>       {
>>         nodeId => "334f49b5-d714-4013-8180-0ffca95aa829",
>>         address => " node1.example.com  ",
>>         apiPort => 8443,
>>         status => "CONNECTED",
>>         heartbeat => "05/20/2019 02:19:10 UTC",
>>         roles => [],
>>         activeThreadCount => 0,
>>         queued => "0 / 0 bytes",
>>         events => [
>>           {
>>             timestamp => "05/20/2019 02:17:35 UTC",
>>             category => "INFO",
>>             message => "Received first heartbeat from connecting node.
>> Node connected."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:23 UTC",
>>             category => "INFO",
>>             message => "Connection requested from existing node. Setting
>> status to connecting."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:17 UTC",
>>             category => "INFO",
>>             message => "Connection requested from existing node. Setting
>> status to connecting."
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:15 UTC",
>>             category => "INFO",
>>             message => "Received heartbeat from node previously
>> disconnected due to Has Not Yet Connected
>>           },
>>           {
>>             timestamp => "05/20/2019 02:17:15 UTC",
>>             category => "INFO",
>>             message => "Requesting that node connect to cluster"
>>           }
>>         ],
>>         nodeStartTime => "05/20/2019 02:17:06 UTC"
>>       }
>>     ],
>>     generated => "02:19:15 UTC"
>>   }
>> }
>>
>>
>> --
>> Curtis Ruck
>>
>

Mime
View raw message