nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files
Date Fri, 04 Jan 2019 16:54:16 GMT
Josef,

OK, thanks for confirming. My suspicion is that the Load-Balancing bug is what is biting you,
and that
when you tried to replicate with the CompressContent in a simple case, you may have just been
experiencing
the "cleanup lag" related to the way that the repositories interact with one another.

Custom Processors should not be an issue. You should not be able to cause any FlowFile to
stay in the Repository.

Thanks
-Mark


On Jan 4, 2019, at 11:48 AM, <Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>>
<Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>> wrote:

Mark,

Yes we are using Load Balancing capability and we do that after the ListSFTP processor, so
yes we loadbalance 0-Byte files. Seems that we probably facing your Bug here.

Thanks a lot for explaining in detail what happens regarding the flowfile/content repo in
NiFi.

Additionally we have several custom processors, could be as well that one of them causing
it? Can someone share a (java) codesnipplet which ensures that a custom processor doesn’t
keep the flowfiles in content repo?

Cheers Josef

From: Mark Payne <markap14@hotmail.com<mailto:markap14@hotmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 14:48
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly
fast by corrupt GZIP files

Josef,

Thanks for the info! There are a few things to consider here. Firstly, you said that you are
using NiFi 1.8.0.
Are you using the new Load Balancing capability? I.e., do you have any Connections configured
to balance
load across your cluster? And if so, are you load-balancing any 0-byte files? If so, then
you may be getting
bitten by [1]. That can result in data staying in the Content Repo and not getting cleaned
up until restart.

The second thing that is important to consider is the interaction between the FlowFile Repositories
and Content
Repository. At a high level, the Content Repository stores the FlowFiles' content/payload.
The FlowFile Repository
stores the FlowFiles' attributes, which queue it is in, and some other metadata. Once a FlowFile
completes its processing
and is no longer part of the flow, we cannot simply delete the content claim from the Content
Repository. If we did so,
we could have a condition where the node is restarted and the FlowFile Repository has not
yet been fully flushed to disk
(NiFi may have already written to the file, but the Operating System may be caching that without
having flushed/"fsync'ed"
to disk). In such a case, we want the transaction to be "rolled back" and reprocessed. So,
if we deleted the Content Claim
from the Content Repository immediately when it is no longer needed, and then restarted, we
could have a case where the
FlowFile repo wasn't flushed to disk and as a result points to a Content Claim that has been
deleted, and this would result
in data loss.

So, to avoid the above scenario, what we do is instead keep track of how many "claims" there
are for a Content Claim
and then, when the FlowFile repo performs a checkpoint (every 2 minutes by default), we go
through and delete any Content
Claims that have a claim count of 0. So this means that any Content Claim that has been accessed
in the past 2 minutes
(or however long the checkpoint time is) will be considered "active" and will not be cleaned
up.

I hope this helps to explain some of the behavior, but if not, let's please investigate further!

Thanks
-Mark



[1] https://issues.apache.org/jira/browse/NIFI-5771



On Jan 4, 2019, at 7:41 AM, <Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>>
<Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>> wrote:

Hi Joe

We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the partitions
below.

[nifi@nifi-12 ~]$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/disk1-root             100G  2.0G   99G   2% /
devtmpfs                           126G     0  126G   0% /dev
tmpfs                              126G     0  126G   0% /dev/shm
tmpfs                              126G  3.1G  123G   3% /run
tmpfs                              126G     0  126G   0% /sys/fs/cgroup
/dev/sda1                         1014M  188M  827M  19% /boot
/dev/mapper/disk1-home              30G   34M   30G   1% /home
/dev/mapper/disk1-var              100G  1.1G   99G   2% /var
/dev/mapper/disk1-opt               50G  5.9G   45G  12% /opt
/dev/mapper/disk1-database_repo   1014M   35M  980M   4% /database_repo
/dev/mapper/disk1-provenance_repo  4.0G   33M  4.0G   1% /provenance_repo
/dev/mapper/disk1-flowfile_repo    530G   34M  530G   1% /flowfile_repo
/dev/mapper/disk2-content_repo     850G   64G  786G   8% /content_repo
tmpfs                               26G     0   26G   0% /run/user/2000


Cheers Josef


From: Joe Witt <joe.witt@gmail.com<mailto:joe.witt@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 13:29
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly
fast by corrupt GZIP files

Josef

Not looping for that proc for sure makes sense.  Nifi dying in the middle of a process/transaction
is no problem..it will restart the transaction.

But we do need to find out what is filling the repo.  You have flowfile, content, and prov
in different disk volumes or partitins right?  What version of nifi?

Lets definitely figure this out.  You should see clean behavior of the repos and you should
never have to restart.

thanks

On Fri, Jan 4, 2019, 7:16 AM Mike Thomsen <mikerthomsen@gmail.com<mailto:mikerthomsen@gmail.com>
wrote:
I agree with Pierre's take on the failure relationship. Corrupted compressed files are also
going to be nearly impossible to recover in most cases, so your best bet is to simply log
the file name and other relevant attributes and establish a process to notify the source system
that they sent you corrupt data.

On Fri, Jan 4, 2019 at 6:48 AM <Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>>
wrote:
Hi Arpad

I’m doing it (hopefully) gracefully:

  *   /opt/nifi/bin/nifi.sh stop
  *   /opt/nifi/bin/nifi.sh restart


But what I see in case of our cluster is, that it takes a more than a few seconds until the
service is stopped – I have never checked the log during shutdown, I guess I should do that
to check if it was really graceful or not. Do you want to say that it depends on the shutdown
into which queue the file goes?


“What still doesn’t make sense for me is why NiFi doesn’t release the content_repo disk
space after a failure without archiving enabled?”

It gives you the possibility to rollback failed operations.

--> sure yes, so does this mean that we can’t get rid of content processed by a “failure”
queue until we do a NiFi restart? And I want to go further, if we wouldn’t use the provenance
VolatileProvenanceRepository it wouldn’t be possible at all to get rid of it until we delete
the files manually via command line?

Cheers Josef


From: Arpad Boda <aboda@hortonworks.com<mailto:aboda@hortonworks.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 12:14
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly
fast by corrupt GZIP files

“but maybe I’m wrong and it goes back to the success queue before the “CompressContent”
processor?”

How do you shut it down? Some graceful way or just kill it?

“What still doesn’t make sense for me is why NiFi doesn’t release the content_repo disk
space after a failure without archiving enabled?”

It gives you the possibility to rollback failed operations.

From: <Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>>
Reply-To: <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 12:10
To: <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly
fast by corrupt GZIP files

Hi Pierre

Thanks for your feedback. You are right it doesn’t make much sense to connect the “failure”
as a loop in this specific processor. We have to restart NiFi regularly, so my thought was
that if we have to decompress a huge file and nifi does a hard restart during processing the
file it goes into the failure queue – but maybe I’m wrong and it goes back to the success
queue before the “CompressContent” processor? That’s the only reason why I’ve connected
“failure” as a loop…

Thanks for the tipp regarding the failure relationship, we will do what you suggested.

What still doesn’t make sense for me is why NiFi doesn’t release the content_repo disk
space after a failure without archiving enabled?

Cheers Josef

From: Pierre Villard <pierre.villard.fr@gmail.com<mailto:pierre.villard.fr@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 11:50
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly
fast by corrupt GZIP files

Hi Josef,

I don't think it's a good idea to use the failure relationship as a self-loop on the processor.
If the decompression failed, it's *very* likely that it will fail again and again. Usually,
when developing a processor, the best practice is to have a 'retry' relationship to handle
errors that could be solved few seconds later. You have such a relationship on few processors.

The failure relationship gives you the possibility to handle errors the way you want. For
instance, in your case, you could move the file to a 'quarantine' folder and send an email
to notify such an error occurred so that it can be processed manually if needed.

What we could do is to penalize the flow file when going in the failure relationship but that's
not necessarily a good idea when the relationship is not a self loop.

Hope it makes sense.
Pierre


Le ven. 4 janv. 2019 à 10:21, <Josef.Zahner1@swisscom.com<mailto:Josef.Zahner1@swisscom.com>>
a écrit :
Hi guys

We had it already two times that our production 8 node cluster with 800GB storage each, ran
out disk space for the content_repository – at the end everything stopped working within
a few minutes which under normal circumstances isn’t possible as we process in peak no more
than 40GB/5min. So I’ve investigated that and I’ve found out that the culprit was the
NiFi “Compress Content” processor which we use to decompress GZIP files, together with
a few small (few MBs) corrupt GZIPs. After a restart of NiFi the whole content_repository
was emptied again (I’ve deleted the corrupt GZIPs from the queue before the restart).

Today I’ve made a small test with the “Compress Content” processor on a standalone NiFi
VM. I’ve used a corrupt 10MB GZIP file and let NiFi decompress it, at the same time I had
observed the size of the content_repository and I was shocked how fast the disk space had
been eaten up, it took 8s to generate 3GB content_repository space with this 10MB file…!
To free up the space I had to restart NiFi.

For us this is a major issue, as in case of a corrupt GZIP NiFi stops really fast due to the
lack of disk space, so the workaround for now is to connect the “failure” to another terminated
processor (or terminate it directly on the “Compress Content” processor). However that’s
not the idea of the “failure” connection under normal circumstances.

I don’t think that this is normal behavior right? Why does the NiFi processor not free up
the content_repo in case of a failure in the decompression of a file?

Btw. we use “org.apache.nifi.provenance.VolatileProvenanceRepository” and we have “nifi.content.repository.archive.enabled”
disabled.

Below some additional outputs/pictures to explain everything.

Cheers Josef





nifi.properties
nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=false


NiFi Canvas
<image001.png>


Disk Consumption content_repo
[user@nifi nifi]$ for i in {1..10}; do du -sh * | grep content; date ; sleep 2; done
82M     content_repository
Fri Jan  4 09:35:00 CET 2019
82M     content_repository
Fri Jan  4 09:35:02 CET 2019.   -> Start “Compress Content”
532M    content_repository
Fri Jan  4 09:35:04 CET 2019
1.2G    content_repository
Fri Jan  4 09:35:06 CET 2019
1.8G    content_repository
Fri Jan  4 09:35:08 CET 2019
2.5G    content_repository
Fri Jan  4 09:35:10 CET 2019
3.2G    content_repository
Fri Jan  4 09:35:12 CET 2019
3.5G    content_repository
Fri Jan  4 09:35:14 CET 2019.   -> Stop “Compress Content”
3.5G    content_repository
Fri Jan  4 09:35:16 CET 2019
3.5G    content_repository
Fri Jan  4 09:35:18 CET 2019


NiFi Error Message
CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02] Unable to decompress StandardFlowFileRecord[uuid=578b1e61-914a-4b4c-9b82-82de4cb51265,claim=StandardContentClaim
[resourceClaim=StandardResourceClaim[id=1546590880942-1<tel:1546590880942-1>, container=default,
section=1], offset=0, length=11471455],offset=0,name=name_dhcp_1_log.1545816464.gz,size=11471455]
using gzip compression format due to IOException thrown from CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02]:
java.io.IOException: Gzip-compressed data is corrupt; routing to failure: org.apache.nifi.processor.exception.ProcessException:
IOException thrown from CompressContent[id=01681002-a64b-17f3-7f52-3e6a2ff7bc02]: java.io.IOException:
Gzip-compressed data is corrupt

Mime
View raw message