Many thanks, Bryan for the quick feedback. Will look into these options.

On Tue, Aug 6, 2019 at 9:24 AM Bryan Bende <bbende@gmail.com> wrote:
Ok makes sense, there are basically two options to make it efficient...

A) You can use ListenSyslog with batching, followed by ValidateRecord
with one of the syslog record readers  [1][2].

B) You can use ListenTCPRecord with a syslog record reader.

A will probably work better for a larger number of TCP connections, B
would work better for a smaller number of connections.

One challenge with both of them is that there isn't a syslog record
writer, so you would probably have to use the
FreeFormTextRecordSetWriter with some expression that rewrites the
message using the record fields, like "${hostname} ${body}" if you
wanted to rewrite each message with the hostname and body.

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.9.2/org.apache.nifi.syslog.SyslogReader/index.html
[2] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.9.2/org.apache.nifi.syslog.Syslog5424Reader/index.html
[3] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.9.2/org.apache.nifi.text.FreeFormTextRecordSetWriter/index.html

On Tue, Aug 6, 2019 at 10:08 AM Clay Teahouse <clayteahouse@gmail.com> wrote:
>
> Hello Bryan,
>
> I am ingesting millions of syslog records from various data sources. I need to make sure the format is valid and then prefix each message with the host name (from syslog header) and some other meta data and push the records to various consumers.
>
> thanks
> Clay
>
> On Tue, Aug 6, 2019 at 6:26 AM Bryan Bende <bbende@gmail.com> wrote:
>>
>> Can you describe what you want to do with each message?
>>
>> Right now I’m not following why you need to parse them.
>>
>> On Tue, Aug 6, 2019 at 6:40 AM Clay Teahouse <clayteahouse@gmail.com> wrote:
>>>
>>> Bryan,
>>> Understood, but wouldn't then this processor be inefficient if you are dealing with a very large number of syslog messages, if you don't have the batching option? I suppose we could have had the option of parsing each syslog record in a batch and then writing the syslog message along with the syslog headers to the flowfile content.
>>> thanks
>>> Clay
>>>
>>> On Mon, Aug 5, 2019 at 12:12 PM Bryan Bende <bbende@gmail.com> wrote:
>>>>
>>>> Clay,
>>>>
>>>> You can only parse when its 1 message per flow file because parsing
>>>> adds all the field/value pairs as flow file attributes, which wouldn't
>>>> really make sense when you have say 1k messages with all different
>>>> values for those fields.
>>>>
>>>> -Bryan
>>>>
>>>> On Mon, Aug 5, 2019 at 11:25 AM Clay Teahouse <clayteahouse@gmail.com> wrote:
>>>> >
>>>> > Hi Edward, Bryan
>>>> > One more question regarding ListenSyslog. Is it possible to set batch size > 1 with parse set to true? I am ingesting a very high volume of syslog records and want to avoid flowfiles containing only one record but at the same time, I want to be able to parse the records. Is there a way around this?
>>>> >
>>>> > thanks
>>>> > Clay
>>>> >
>>>> > On Fri, Aug 2, 2019 at 8:50 AM Edward Armes <edward.armes@gmail.com> wrote:
>>>> >>
>>>> >> HI Clay,
>>>> >>
>>>> >> So as Bryan has said the actual connection is managed by a selector and all this does is goes through each connection and once that connection has data to receive it the selector then hands that over to a thread in the TCP receiving thread pool which does then some basic TCP processing and puts it into a buffer for an instance of associated ListenSyslog processor to processes, when the framework executes an instance of that processor.
>>>> >>
>>>> >> Just so you're aware while setting the maximum number of connections does create a thread pool of 4,000 threads. In reality these threads don't really exist until one is created by the selector to run on the pool. So in short unless a single Nifi server gets 4,000 syslog messages in a very short space time (< 1 micro-second) I can't see it being an issue.
>>>> >>
>>>> >> Edward
>>>> >>
>>>> >> On Fri, Aug 2, 2019 at 2:06 PM Bryan Bende <bbende@gmail.com> wrote:
>>>> >>>
>>>> >>> The actual connections themselves are managed with a selector, so if
>>>> >>> all the connections are idle there should only be one thread for the
>>>> >>> socket.
>>>> >>>
>>>> >>> As soon as a connection has something available to read then a thread
>>>> >>> is spawned to start reading the connection until either no matter is
>>>> >>> available, or it is closed.
>>>> >>>
>>>> >>> On Fri, Aug 2, 2019 at 7:18 AM Clay Teahouse <clayteahouse@gmail.com> wrote:
>>>> >>> >
>>>> >>> > Hello Edward,
>>>> >>> > So, if have of to listen to 32,000 tcp connections and I have only 80 cores, and I configure each ListenSyslog instance for 4,000 connections, doesn't each spawn 4,000 threads behind the scene? The tcp connections will be idle most of the time.
>>>> >>> >
>>>> >>> > thanks
>>>> >>> > Clay
>>>> >>> >
>>>> >>> >
>>>> >>> > On Fri, Aug 2, 2019 at 6:10 AM Edward Armes <edward.armes@gmail.com> wrote:
>>>> >>> >>
>>>> >>> >> Hi Clay,
>>>> >>> >>
>>>> >>> >> Because Nifi underneath uses a thread pool for it's own threading underneath, and each instance processor runs does so in it's own thread, I don't see any reason why not. One thing to note that the way the ListenTCP processor appears to have been written such that it gets all the requests that have been received on that socket and processes them until either it has no more requests left or process or that instance of the processor is no longer scheduled to run.
>>>> >>> >>
>>>> >>> >> Hope that helps
>>>> >>> >>
>>>> >>> >> Edward
>>>> >>> >>
>>>> >>> >> On Fri, Aug 2, 2019 at 11:28 AM Clay Teahouse <clayteahouse@gmail.com> wrote:
>>>> >>> >>>
>>>> >>> >>> Hello All,
>>>> >>> >>>
>>>> >>> >>> I need to listen to and process thousands of persistent TCP connections. I have 10 nodes, each having 8 cores.
>>>> >>> >>> My understanding is that with existing NiFi listening processors, such as ListnSyslog, a thread is utilized for each TCP connection. Does this scale? Do I need to write a custom processor that utilizes a thread pool for reading the data from the socket and processing them?
>>>> >>> >>>
>>>> >>> >>> thanks
>>>> >>> >>> Clay
>>
>> --
>> Sent from Gmail Mobile