Alright, round two occurred today, and after much frustration, I believe I know what exactly is causing the problem, and the solution is probably just upgrade to nifi-1.9.2 which comes with an upgraded jetty.

Java 8u202 - Nifi 1.8.0 - Jetty 9.4.11 (same Jetty version is in Nifi 1.9.2)

BLUF, a single request to node1/nifi-api/controller/cluster (on thread NiFi Web Server-209) replicates (thread Replicate Request Thread-4) to itself (node1/nifi-api/controller/cluster) with additional headers indicating it's a replicated request.  The replicated request is received by thread NiFi Web Server-210.  The thread parks "waiting" This request to controller/cluster wants to run in a ReservedThreadExecutor.  The replicated request is stuck "waiting" for 30 seconds until OkHttp timeouts in EpollSelector land.

My nifi.properties has 200 threads assigned to nifi.web.jetty.threads.  I also have read.timeout set to 30 sec.

~6500 lines of debugging is here.  
https://gist.githubusercontent.com/ruckc/df947e04ae4fb55bc37ecc116f747848/raw/06625f0b6169acf272edc648a7fbd2abd043e0a4/node1.log

Wish I knew where to go from here...

--
Curtis Ruck


On Mon, May 20, 2019 at 10:06 AM Curtis Ruck <curtis.ruck@gmail.com> wrote:
I have a new 3 node cluster i'm trying to stand up.

On most of the API calls (to create proxy the proxy user) and others for debugging, i get a "java.net.SocketTimeoutException: timeout" returned.  In debugging the issue, the timeout occurs when the API replicates the request to the same instance receiving the request.  I.e when node1 replicates the request to node1.

I've attached a pruned debug logfile showing the /controller/config request with the current request "nifi.cluster.node.read.timeout=10 sec".  From what I can tell in the attached log is, Thread "Nifi Web Server-24" gets a request for /nifi-api/controller/cluster" it replicates the request to itself, which arrives 10 secs (the request timeout) later and comes in on thread "NiFi Web Server-161" at roughly the same moment (5-6 ms after) okhttp3 timeouts the request.  When 161 goes to respond the connection is already closed.  So I think what's happening is that node1 locks/blocks something that the replicated request needs to read from, but can't until the lock is released when the request timeouts.  So why is the request getting replicated to itself?

In troubleshooting I tried finding relevant REST endpoints and i get these results:

/controller/config -> java.net.SocketTimeoutException: timeout 
/flow/status ->  java.net.SocketTimeoutException: timeout

/flow/cluster/summary ->  {
  clusterSummary => {
    connectedNodes => "3 / 3",
    connectedNodeCount => 3,
    totalNodeCount => 3,
    connectedToCluster => true,
    clustered => true
  }
}

/controller/cluster -> {
  cluster => {
    nodes => [
      {
        nodeId => "8d066e95-b254-4527-a214-79ca5359f078",
        address => "node2.example.com",
        apiPort => 8443,
        status => "CONNECTED",
        heartbeat => "05/20/2019 02:19:13 UTC",
        roles => [
          "Cluster Coordinator"
        ],
        activeThreadCount => 0,
        queued => "0 / 0 bytes",
        events => [
          {
            timestamp => "05/20/2019 02:17:35 UTC",
            category => "INFO",
            message => "Received first heartbeat from connecting node. Node connected."
          },
          {
            timestamp => "05/20/2019 02:17:20 UTC",
            category => "INFO",
            message => "Connection requested from existing node. Setting status to connecting."
          },
          {
            timestamp => "05/20/2019 02:17:20 UTC",
            category => "INFO",
            message => "Connection requested from existing node. Setting status to connecting."
          },
          {
            timestamp => "05/20/2019 02:17:14 UTC",
            category => "INFO",
            message => "Requesting that node connect to cluster"
          },
          {
            timestamp => "05/20/2019 02:17:14 UTC",
            category => "INFO",
            message => "Received heartbeat from node previously disconnected due to Has Not Yet Connected
          }
        ],
        nodeStartTime => "05/20/2019 02:17:08 UTC"
      },
      {
        nodeId => "e5785deb-5646-421f-b858-e35342195988",
        address => "node3.example.com",
        apiPort => 8443,
        status => "CONNECTED",
        heartbeat => "05/20/2019 02:19:10 UTC",
        roles => [
          "Primary Node"
        ],
        activeThreadCount => 0,
        queued => "0 / 0 bytes",
        events => [
          {
            timestamp => "05/20/2019 02:17:30 UTC",
            category => "INFO",
            message => "Received first heartbeat from connecting node. Node connected."
          },
          {
            timestamp => "05/20/2019 02:17:17 UTC",
            category => "INFO",
            message => "Connection requested from existing node. Setting status to connecting."
          },
          {
            timestamp => "05/20/2019 02:17:17 UTC",
            category => "INFO",
            message => "Connection requested from existing node. Setting status to connecting."
          },
          {
            timestamp => "05/20/2019 02:17:14 UTC",
            category => "INFO",
            message => "Received heartbeat from node previously disconnected due to Has Not Yet Connected
          },
          {
            timestamp => "05/20/2019 02:17:14 UTC",
            category => "INFO",
            message => "Requesting that node connect to cluster"
          }
        ],
        nodeStartTime => "05/20/2019 02:17:06 UTC"
      },
      {
        nodeId => "334f49b5-d714-4013-8180-0ffca95aa829",
        address => " node1.example.com  ",
        apiPort => 8443,
        status => "CONNECTED",
        heartbeat => "05/20/2019 02:19:10 UTC",
        roles => [],
        activeThreadCount => 0,
        queued => "0 / 0 bytes",
        events => [
          {
            timestamp => "05/20/2019 02:17:35 UTC",
            category => "INFO",
            message => "Received first heartbeat from connecting node. Node connected."
          },
          {
            timestamp => "05/20/2019 02:17:23 UTC",
            category => "INFO",
            message => "Connection requested from existing node. Setting status to connecting."
          },
          {
            timestamp => "05/20/2019 02:17:17 UTC",
            category => "INFO",
            message => "Connection requested from existing node. Setting status to connecting."
          },
          {
            timestamp => "05/20/2019 02:17:15 UTC",
            category => "INFO",
            message => "Received heartbeat from node previously disconnected due to Has Not Yet Connected
          },
          {
            timestamp => "05/20/2019 02:17:15 UTC",
            category => "INFO",
            message => "Requesting that node connect to cluster"
          }
        ],
        nodeStartTime => "05/20/2019 02:17:06 UTC"
      }
    ],
    generated => "02:19:15 UTC"
  }
}


--
Curtis Ruck