nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject Re: flowfiles stuck in load balanced queue; nifi 1.8
Date Fri, 18 Jan 2019 00:17:40 GMT
Hey Dan,

This can happen even within a process group, it is just much more likely when the destination
of the connection is a Port or a Funnel because those components don’t really do any work,
just push the FlowFile to the next connection and that makes them super fast.

There are a few different PR’s that are awaiting review (unrelated to this) that I’d like
to see merged in very soon and then I think it’s probably time to start talking about a
1.9.0 release. There are several bug fixes, especially related to the load balance connections,
and enough new features that I think it’s worth considering a release soon.

Sent from my iPhone

On Jan 17, 2019, at 6:59 PM, dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:

Hello Mark,

We're seeing "stuck" flow files again, this time within a PG...see attached screen shots :(



On Fri, Dec 28, 2018 at 8:43 AM Mark Payne <markap14@hotmail.com<mailto:markap14@hotmail.com>>
wrote:
Dan, et al,

Great news! I was able to replicate this issue finally, by creating a Load-Balanced connection
between two Process Groups/Ports instead of between two processors. The fact that it's between
two Ports does not, in and of itself, matter. But there is a race condition, and Ports do
no actual
Processing of the FlowFile (simply pull it from one queue and transfer it to another). As
a result, because
it is extremely fast, it is more likely to trigger the race condition.

So I created a JIRA [1] and have submitted a PR for it.

Interestingly, while there is no real workaround that is fool-proof, until this fix is in
and released, you could
choose to update your flow so that the connection between Process Groups is not load balanced
and instead
the connection between the Input Port and the first Processor is load balanced. Again, this
is not fool-proof,
because it could affect the Load Balanced Connection even if it is connected to a Processor,
but it is less likely
to do so, so you would likely see the issue occur far less often.

Thank you so much for sticking with us all as we diagnose this and figure it all out - would
not have been able to
figure it out without you spending the time to debug the issue!

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-5919


On Dec 26, 2018, at 10:31 PM, dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:

Hello Mark,

I just stopped the destination processor, and then disconnected the node in question (nifi1-1).
Once I disconnected the node, the flow file in the load balance connection disappeared from
the queue.  After that, I reconnected the node (with the downstream processor disconnected)
and once the node successfully rejoined the cluster, the flowfile showed up in the queue again.
After this, I started the connected downstream processor, but the flowfile stays in the queue.
The only way to clear the queue is if I actually restart the node.  If I disconnect the node,
and then restart that node, the flowfile is no longer present in the queue.

Regards,

Dano


On Wed, Dec 26, 2018 at 6:13 PM Mark Payne <markap14@hotmail.com<mailto:markap14@hotmail.com>>
wrote:
Ok, I just wanted to confirm that when you said “once it rejoins the cluster that flow file
is gone” that you mean “the flowfile did not exist on the system” and NOT “the queue
size was 0 by the time that I looked at the UI.” I.e., is it possible that the FlowFile
did exist, was restored, and then was processed before you looked at the UI? Or the FlowFile
definitely did not exist after the node was restarted? That’s why I was suggesting that
you restart with the connection’s source and destination stopped. Just to make sure that
the FlowFile didn’t just get processed quickly on restart.

Sent from my iPhone

On Dec 26, 2018, at 7:55 PM, dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:

Heya Mark,

If we restart the node, that "stuck" flowfile will disappear. This is the only way so far
to clear out the flowfile. I usually disconnect the node, then once it's disconnected I restart
nifi, and then once it rejoins the cluster that flow file is gone. If we try to empty the
queue, it will just say that there no flow files in the queue.


On Wed, Dec 26, 2018, 5:22 PM Mark Payne <markap14@hotmail.com<mailto:markap14@hotmail.com>
wrote:
Hey Dan,

Thanks, this is super useful! So, the following section is the damning part of the JSON:

          {
            "totalFlowFileCount": 1,
            "totalByteCount": 975890,
            "nodeIdentifier": "nifi1-1:9443",
            "localQueuePartition": {
              "totalFlowFileCount": 0,
              "totalByteCount": 0,
              "activeQueueFlowFileCount": 0,
              "activeQueueByteCount": 0,
              "swapFlowFileCount": 0,
              "swapByteCount": 0,
              "swapFiles": 0,
              "inFlightFlowFileCount": 0,
              "inFlightByteCount": 0,
              "allActiveQueueFlowFilesPenalized": false,
              "anyActiveQueueFlowFilesPenalized": false
            },
            "remoteQueuePartitions": [
              {
                "totalFlowFileCount": 0,
                "totalByteCount": 0,
                "activeQueueFlowFileCount": 0,
                "activeQueueByteCount": 0,
                "swapFlowFileCount": 0,
                "swapByteCount": 0,
                "swapFiles": 0,
                "inFlightFlowFileCount": 0,
                "inFlightByteCount": 0,
                "nodeIdentifier": "nifi2-1:9443"
              },
              {
                "totalFlowFileCount": 0,
                "totalByteCount": 0,
                "activeQueueFlowFileCount": 0,
                "activeQueueByteCount": 0,
                "swapFlowFileCount": 0,
                "swapByteCount": 0,
                "swapFiles": 0,
                "inFlightFlowFileCount": 0,
                "inFlightByteCount": 0,
                "nodeIdentifier": "nifi3-1:9443"
              }
            ]
          }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 bytes. But it
also shows that the FlowFile is not in the "local partition" or either of the two "remote
partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented (perhaps a threading
bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to the FlowFile
was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that connection and
then restart node nifi1-1. Once it has restarted, check if the FlowFile is still in the connection.
That will tell us which of the two above scenarios is taking place. If the FlowFile exists
upon restart, then the Queue somehow lost the handle to it. If the FlowFile does not exist
in the connection upon restart (I'm guessing this will be the case), then it indicates that
somehow the count is incorrect.

Many thanks
-Mark

________________________________
From: dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the "stuck" flowfile
to it.   I ran a get diagnostics to the Log Attribute processor before I started it, and then
ran another diagnostics after I started it.  The flowfile stayed in the load balanced connection/queue.
 I've attached both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne <markap14@hotmail.com<mailto:markap14@hotmail.com>>
wrote:
Dan,

You would want to get diagnostics for the processor that is the source/destination of the
connection - not the FlowFile. But if you connection is connecting 2 process groups then both
its source and destination are Ports, not Processors. So the easiest thing to do would be
to drop a “dummy processor” into the flow between the 2 groups, drag the Connection to
that processor, get diagnostics for the processor, and then drag it back to where it was.
Does that make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from the connection
where the flowfile appears to be "stuck". I'm not sure what processor is the source of this
particular "stuck" flowfile since we have a number of upstream processor groups (PG) that
feed into a funnel.  This funnel is then connected to a downstream PG. It is this connection
between the funnel and a downstream PG where the flowfile is stuck. I might reduce the upstream
"load balanced connections" between the various PGs to just one so I can narrow where we need
to run diagnostics....  If this isn't the correct processor to be gathering diagnostics, please
LMK where else I should look or other diagnostics to run...

I've also attached the output (nifi-api/connections/{id}) of the get for that connection where
the flowfile appears to be "stuck"

On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende <bbende@gmail.com<mailto:bbende@gmail.com>>
wrote:
You’ll need to get the token that was obtained when you logged in to the SSO and submit
it on the curl requests the same way the UI is doing on all requests.

You should be able to open chrome dev tool tools while in the UI and look at one of the request/responses
and copy the value of the 'Authorization’ header which should be in the form ‘Bearer <token>’.

Then send this on the curl command by specifying a header of -H 'Authorization: Bearer <token>'

On Sun, Dec 23, 2018 at 6:28 PM dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:
I forgot to mention that we're using the OpenId Connect SSO .  Is there a way to run these
command via curl when we have the cluster configured this way?  If so would anyone be able
to provide some insight/examples.

Happy Holidays!

Regards,

Dano

On Sun, Dec 23, 2018 at 3:53 PM dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:
This is what I'm seeing in the logs when I try to access the nifi-api/flow/about for example...

2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201] o.a.n.w.s.NiFiAuthenticationFilter Authentication
success for dan@looker.com<mailto:dan@looker.com>
2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136] o.a.n.w.a.c.AccessDeniedExceptionMapper
identity[anonymous], groups[none] does not have permission to access the requested resource.
Unknown user with identity 'anonymous'. Returning Unauthorized response.

On Sun, Dec 23, 2018 at 3:50 PM dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
wrote:
Hello Mark,

I have a queue again with a "stuck/phantom" flowfile again.  When I try to call the nifi-api/processors/<processor-id>/diagnostics
against a processor, in the UI after I authenticate, I get a "Unknown user with identity 'anonymous'.
Contact the system administrator." We're running a secure 3x node cluster. I tried this via
the browser and also via the command line with curl on one of the nodes. One clarification
point, what processor id should I be trying to gather the diagnostics on? the the queue is
in between two processor groups.

Maybe the issue with the Unknown User has to do with some policy I don't have set correctly?

Happy Holidays!

Regards,
Dano




On Wed, Dec 19, 2018 at 6:51 AM Mark Payne <markap14@hotmail.com<mailto:markap14@hotmail.com>>
wrote:
Hey Josef, Dano,

Firstly, let me assure you that while I may be the only one from the NiFi side who's been
engaging on debugging
this, I am far from the only one who cares about it! :) This is a pretty big new feature that
was added to the latest
release, so understandably there are probably not yet a lot of people who understand the code
well enough to
debug. I have tried replicating the issue, but have not been successful. I have a 3-node cluster
that ran for well over
a month without a restart, and i've also tried restarting it every few hours for a couple
of days. It has about 8 different
load-balanced connections, with varying data sizes and volumes. I've not been able to get
into this situation, though,
unfortunately.

But yes, I think that we've seen this issue arise from each of the two of you and one other
on the mailing list, so it
is certainly something that we need to nail down ASAP. Unfortunately, debugging an issue that
involves communication
between multiple nodes is often difficult to fully understand, so it may not be a trivial
task to debug.

Dano, if you are able to get to the diagnostics, as Josef mentioned, that is likely to be
pretty helpful. Off the top of my head,
there are a few possibilities that are coming to mind, as to what kind of bug could cause
such behavior:

1) Perhaps there really is no flowfile in the queue, but we somehow miscalculated the size
of the queue. The diagnostics
info would tell us whether or not this is the case. It will look into the queues themselves
to determine how many FlowFiles are
destined for each node in the cluster, rather than just returning the pre-calculated count.
Failing that, you could also stop the source
and destination of the queue, restart the node, and then see if the FlowFile is entirely gone
from the queue on restart, or if it remains
in the queue. If it is gone, then that likely indicates that the pre-computed count is somehow
off.

2) We are having trouble communicating with the node that we are trying to send the data to.
I would expect some sort of ERROR
log messages in this case.

3) The node is properly sending the FlowFile to where it needs to go, but for some reason
the receiving node is then re-distributing it
to another node in the cluster, which then re-distributes it again, so that it never ends
in the correct destination. I think this is unlikely
and would be easy to verify by looking at the "Summary" table [1] and doing the "Cluster view"
and constantly refreshing for a few seconds
to see if the queue changes on any node in the cluster.

4) For some entirely unknown reason, there exists a bug that causes the node to simply see
the FlowFile and just skip over it
entirely.

For additional logging, we can enable DEBUG logging on
org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask:
<logger name="org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask"
level="DEBUG" />

With that DEBUG logging turned on, it may or may not generate a lot of DEBUG logs. If it does
not, then that in and of itself tells us something.
If it does generate a lot of DEBUG logs, then it would be good to see what it's dumping out
in the logs.

And a big Thank You to you guys for staying engaged on this and your willingness to dig in!

Thanks!
-Mark

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page


On Dec 19, 2018, at 2:18 AM, <Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>>
<Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>> wrote:

Hi Dano

Seems that the problem has been seen by a few people but until now nobody from NiFi team really
cared about it – except Mark Payne. He mentioned the part below with the diagnostics, however
in my case this doesn’t even work (tried it on standalone unsecured cluster as well as on
secured cluster)! Can you get the diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST endpoint (in
your browser is fine)
/nifi-api/processors/<processor id>/diagnostics

Where <processor id> is the UUID of either the source or the destination of the Connection
in question. This gives us
a lot of information about the internals of Connection. The easiest way to get that Processor
ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. You can copy
& paste from there. If you
then send the diagnostics information to us, we can analyze that to help understand what's
happening.



From: dan young <danoyoung@gmail.com<mailto:danoyoung@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load balanced queue.
 The only resolution is to disconnect the node and then restart that node.  After this, the
flowfile disappears from the queue.  Any ideas on what might be going on here or what additional
information I might be able to provide to debug this?

I've attached another thread dump and some screen shots....


Regards,

Dano

--
Sent from Gmail Mobile
<Screen Shot 2018-12-24 at 9.12.31 AM.png>
<diag.json>
<conn.json>

<Screen Shot 2019-01-17 at 4.45.51 PM.png>
<Screen Shot 2019-01-17 at 4.46.06 PM.png>
Mime
View raw message