nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: stream one large file, only once
Date Mon, 14 Nov 2016 13:23:58 GMT
The pattern you want for this is

1) GetFile or (ListFile + FetchFile)
2) RouteText
3) PublishKafka

As Andrew points out GetFile and FetchFile do *not* read the file
contents into memory.  The whole point of NiFi's design in general is
to take advantage of the content repository rather than forcing
components to hold things in memory.  While they can elect to hold
things in memory they don't have to and the repository allows reading
from and writing to streams all within a unit of work pattern
transactional model.  There is a lot more to say on that topic but you
can see a good bit about it in the docs.

RouteText is the way to avoid the SplitText memory scenario where
there are so many lines that even holding pointers/metadata about
those lines itself becomes problematic.  You can also do as Andrew
points out and split in chunks which also works well.  RouteText will
likely yield higher performance though overall if it works for your
case.

Thanks
Joe

On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <aperepel@gmail.com> wrote:
> Neither GetFile nor FetchFile read the file into memory, they only deal with
> the file handle and pass the contents via a handle to the content repository
> (NiFi streams data into and reads as a stream).
>
> What you will face, however, is an issue with a SplitText when you try to
> split it in 1 transaction. This might fail based on the JVM heap allocated
> and file size. A recommended best practice in this case is to introduce a
> series of 2 SplitText processors. 1st pass would split into e.g. 10 000 row
> chunks, 2nd - into individual. Adjust for your expected file sizes and
> available memory.
>
> HTH,
> Andrew
>
> On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <raf.huys@gmail.com> wrote:
>>
>> I would like to read in a large (several gigs) of logdata, and route every
>> line to a (potentially different) Kafka topic.
>>
>> - I don't want this file to be in memory
>> - I want it to be read once, not more
>>
>> using `GetFile` takes the whole file in memory. Same with `FetchFile` as
>> far as I can see.
>>
>> I also used a `ExecuteProcess` processor in which the file is `cat` and
>> which splits off a flowfile every millisecond. This looked to be a somewhat
>> streaming approach to the problem, but this processor runs continuously (or
>> cron based) and by consequence the logfile is re-injected all the time.
>>
>> What's the typical Nifi for this? Tx
>>
>> Raf Huys

Mime
View raw message