nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Curtis Ruck <curtis.r...@gmail.com>
Subject Re: 1.8.0 Clustering Timeouts
Date Tue, 21 May 2019 03:58:31 GMT
Alright, round two occurred today, and after much frustration, I believe I
know what exactly is causing the problem, and the solution is probably just
upgrade to nifi-1.9.2 which comes with an upgraded jetty.

Java 8u202 - Nifi 1.8.0 - Jetty 9.4.11 (same Jetty version is in Nifi 1.9.2)

BLUF, a single request to node1/nifi-api/controller/cluster (on thread NiFi
Web Server-209) replicates (thread Replicate Request Thread-4) to itself
(node1/nifi-api/controller/cluster) with additional headers indicating it's
a replicated request.  The replicated request is received by thread NiFi
Web Server-210.  The thread parks "waiting" This request to
controller/cluster wants to run in a ReservedThreadExecutor.  The
replicated request is stuck "waiting" for 30 seconds until OkHttp timeouts
in EpollSelector land.

My nifi.properties has 200 threads assigned to nifi.web.jetty.threads.  I
also have read.timeout set to 30 sec.

~6500 lines of debugging is here.
https://gist.githubusercontent.com/ruckc/df947e04ae4fb55bc37ecc116f747848/raw/06625f0b6169acf272edc648a7fbd2abd043e0a4/node1.log

Wish I knew where to go from here...

--
Curtis Ruck


On Mon, May 20, 2019 at 10:06 AM Curtis Ruck <curtis.ruck@gmail.com> wrote:

> I have a new 3 node cluster i'm trying to stand up.
>
> On most of the API calls (to create proxy the proxy user) and others for
> debugging, i get a "java.net.SocketTimeoutException: timeout" returned.  In
> debugging the issue, the timeout occurs when the API replicates the request
> to the same instance receiving the request.  I.e when node1 replicates the
> request to node1.
>
> I've attached a pruned debug logfile showing the /controller/config
> request with the current request "nifi.cluster.node.read.timeout=10 sec".
> From what I can tell in the attached log is, Thread "Nifi Web Server-24"
> gets a request for /nifi-api/controller/cluster" it replicates the request
> to itself, which arrives 10 secs (the request timeout) later and comes in
> on thread "NiFi Web Server-161" at roughly the same moment (5-6 ms after)
> okhttp3 timeouts the request.  When 161 goes to respond the connection is
> already closed.  So I think what's happening is that node1 locks/blocks
> something that the replicated request needs to read from, but can't until
> the lock is released when the request timeouts.  So why is the request
> getting replicated to itself?
>
> In troubleshooting I tried finding relevant REST endpoints and i get these
> results:
>
> /controller/config -> java.net.SocketTimeoutException: timeout
> /flow/status ->  java.net.SocketTimeoutException: timeout
>
> /flow/cluster/summary ->  {
>   clusterSummary => {
>     connectedNodes => "3 / 3",
>     connectedNodeCount => 3,
>     totalNodeCount => 3,
>     connectedToCluster => true,
>     clustered => true
>   }
> }
>
> /controller/cluster -> {
>   cluster => {
>     nodes => [
>       {
>         nodeId => "8d066e95-b254-4527-a214-79ca5359f078",
>         address => "node2.example.com",
>         apiPort => 8443,
>         status => "CONNECTED",
>         heartbeat => "05/20/2019 02:19:13 UTC",
>         roles => [
>           "Cluster Coordinator"
>         ],
>         activeThreadCount => 0,
>         queued => "0 / 0 bytes",
>         events => [
>           {
>             timestamp => "05/20/2019 02:17:35 UTC",
>             category => "INFO",
>             message => "Received first heartbeat from connecting node.
> Node connected."
>           },
>           {
>             timestamp => "05/20/2019 02:17:20 UTC",
>             category => "INFO",
>             message => "Connection requested from existing node. Setting
> status to connecting."
>           },
>           {
>             timestamp => "05/20/2019 02:17:20 UTC",
>             category => "INFO",
>             message => "Connection requested from existing node. Setting
> status to connecting."
>           },
>           {
>             timestamp => "05/20/2019 02:17:14 UTC",
>             category => "INFO",
>             message => "Requesting that node connect to cluster"
>           },
>           {
>             timestamp => "05/20/2019 02:17:14 UTC",
>             category => "INFO",
>             message => "Received heartbeat from node previously
> disconnected due to Has Not Yet Connected
>           }
>         ],
>         nodeStartTime => "05/20/2019 02:17:08 UTC"
>       },
>       {
>         nodeId => "e5785deb-5646-421f-b858-e35342195988",
>         address => "node3.example.com",
>         apiPort => 8443,
>         status => "CONNECTED",
>         heartbeat => "05/20/2019 02:19:10 UTC",
>         roles => [
>           "Primary Node"
>         ],
>         activeThreadCount => 0,
>         queued => "0 / 0 bytes",
>         events => [
>           {
>             timestamp => "05/20/2019 02:17:30 UTC",
>             category => "INFO",
>             message => "Received first heartbeat from connecting node.
> Node connected."
>           },
>           {
>             timestamp => "05/20/2019 02:17:17 UTC",
>             category => "INFO",
>             message => "Connection requested from existing node. Setting
> status to connecting."
>           },
>           {
>             timestamp => "05/20/2019 02:17:17 UTC",
>             category => "INFO",
>             message => "Connection requested from existing node. Setting
> status to connecting."
>           },
>           {
>             timestamp => "05/20/2019 02:17:14 UTC",
>             category => "INFO",
>             message => "Received heartbeat from node previously
> disconnected due to Has Not Yet Connected
>           },
>           {
>             timestamp => "05/20/2019 02:17:14 UTC",
>             category => "INFO",
>             message => "Requesting that node connect to cluster"
>           }
>         ],
>         nodeStartTime => "05/20/2019 02:17:06 UTC"
>       },
>       {
>         nodeId => "334f49b5-d714-4013-8180-0ffca95aa829",
>         address => " node1.example.com  ",
>         apiPort => 8443,
>         status => "CONNECTED",
>         heartbeat => "05/20/2019 02:19:10 UTC",
>         roles => [],
>         activeThreadCount => 0,
>         queued => "0 / 0 bytes",
>         events => [
>           {
>             timestamp => "05/20/2019 02:17:35 UTC",
>             category => "INFO",
>             message => "Received first heartbeat from connecting node.
> Node connected."
>           },
>           {
>             timestamp => "05/20/2019 02:17:23 UTC",
>             category => "INFO",
>             message => "Connection requested from existing node. Setting
> status to connecting."
>           },
>           {
>             timestamp => "05/20/2019 02:17:17 UTC",
>             category => "INFO",
>             message => "Connection requested from existing node. Setting
> status to connecting."
>           },
>           {
>             timestamp => "05/20/2019 02:17:15 UTC",
>             category => "INFO",
>             message => "Received heartbeat from node previously
> disconnected due to Has Not Yet Connected
>           },
>           {
>             timestamp => "05/20/2019 02:17:15 UTC",
>             category => "INFO",
>             message => "Requesting that node connect to cluster"
>           }
>         ],
>         nodeStartTime => "05/20/2019 02:17:06 UTC"
>       }
>     ],
>     generated => "02:19:15 UTC"
>   }
> }
>
>
> --
> Curtis Ruck
>

Mime
View raw message