nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <bbe...@gmail.com>
Subject Re: Spark or custom processor?
Date Thu, 02 Jun 2016 14:15:13 GMT
It will be ok to have PutSyslog running on every node, because each node
will have a portion of the log messages, either through load balancer in
front of NiFi, or by doing the RPG, each case divides the data across the
nodes so you won't get duplicates.

Ryan, Great point about the rebind capability! I was just reading some
posts about that.

On Thu, Jun 2, 2016 at 10:11 AM, Ryan Ward <ryan.ward2@gmail.com> wrote:

> Conrad,
>
> Depending on the number of clients and endpoints you have you can load
> balance the TCP connections with haproxy it wouldn't load balance the data
> just the connections. If you are using rsyslog you can tell it to rebind
> every x number of messages to better load balance the data.
>
> http://blog.afkham.org/2014/10/tcp-load-balancing-with-haproxy.html
>
> Ryan
>
> On Thu, Jun 2, 2016 at 10:09 AM, Conrad Crampton <
> conrad.crampton@secdata.com> wrote:
>
>> Mark,
>>
>> Ah, never considered the ScanAttribute processor before – looks like I
>> could coerce it to work asis for my use case – with a few chained together
>> (and more likely routeonattribute processor) for all the criteria.
>>
>>
>>
>> Quick follow up question on PutSyslog though – as the flows run on all
>> nodes in cluster, for PutSyslog, do I also have to run on single node
>> otherwise doesn’t the putting of message get executed on all nodes (and
>> therefore I get duplicate syslog messages x number of nodes)?
>>
>>
>>
>> Thanks
>>
>> Conrad
>>
>>
>>
>> *From: *Mark Payne <markap14@hotmail.com>
>> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Date: *Thursday, 2 June 2016 at 14:58
>>
>> *To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Subject: *Re: Spark or custom processor?
>>
>>
>>
>> Conrad,
>>
>>
>>
>> Excellent - I think this is a great use case, as well. This is similar to
>> the enrichment case, as you are operating
>>
>> on each piece of data in conjunction with some 'reference dataset' (bad
>> domains, etc.) which would likely be
>>
>> some file, etc. that is configured in the Processor. This is actually
>> similar to the ScanContent / ScanAttribute
>>
>> Processors I think. You may want to review those for 'inspiration' for
>> your processor.
>>
>>
>>
>> At a high level, the way that they work is that they are configured with
>> a file that is a dictionary of terms to look
>>
>> for in the FlowFile. As each FlowFile comes through, it checks if its
>> attributes (or content, depending on the processor)
>>
>> match any of the terms in the dictionary and routes each FlowFile to
>> either 'matched' or 'unmatched'. The Processor
>>
>> will periodically check the dataset file and reload the dictionary if the
>> file has changed. Typically, GetSFTP or GetHTTP
>>
>> or something like that would be used to fetch new versions of the
>> dictionary and then PutFile would be used to write
>>
>> the file to a directory. This allows the Scan* processors not to have to
>> worry about fetching the data and allows the
>>
>> data to come from wherever.
>>
>>
>>
>> Hope this is helpful!
>>
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>> On Jun 2, 2016, at 9:51 AM, Conrad Crampton <conrad.crampton@SecData.com>
>> wrote:
>>
>>
>>
>> Mark,
>>
>> A very helpful explanation and distinction on appropriate use for NiFi. I
>> think my particular use case currently (probably) falls into the Simple
>> Event Processing. I say ‘probably’ because I am bringing in some other data
>> to compare the data against (bad domains and maybe others), but certainly
>> isn’t doing anything clever at the moment in terms of windowing/
>> aggregation with previously seen data etc.
>>
>>
>>
>> Thanks for the advice, very helpful.
>>
>> Conrad
>>
>>
>>
>> *From: *Mark Payne <markap14@hotmail.com>
>> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Date: *Thursday, 2 June 2016 at 14:42
>> *To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Subject: *Re: Spark or custom processor?
>>
>>
>>
>> Conrad,
>>
>>
>>
>> Typically, the way that we like to think about using NiFi vs. something
>> like Spark or Storm is whether
>>
>> the processing is Simple Event Processing or Complex Event Processing.
>> Simple Event Processing
>>
>> encapsulates those tasks where you are able to operate on a single piece
>> of data by itself (or in correlation
>>
>> with an Enrichment Dataset). So tasks like enrichment, splitting, and
>> transformation are squarely within
>>
>> the wheelhouse of NiFi.
>>
>>
>>
>> When we talk about doing Complex Event Processing, we are generally
>> talking about either processing data
>>
>> from multiple streams together (think JOIN operations) or analyzing data
>> across time windows (think calculating
>>
>> norms, standard deviation, etc. over the last 30 minutes). The idea here
>> is to derive a single new "insight" from
>>
>> windows of data or joined streams of data - not to transform or enrich
>> individual pieces of data. For this, we would
>>
>> recommend something like Spark, Storm, Flink, etc.
>>
>>
>>
>> In terms of scalability, NiFi certainly was not designed to scale outward
>> in the way that Spark was. With Spark you
>>
>> may be scaling to thousands of nodes, but with NiFi you would get a
>> pretty poor user experience because each change
>>
>> in the UI must be replicated to all of those nodes. That being said, NiFi
>> does scale up very well to take full advantage
>>
>> of however much CPU and disks you have available. We typically see
>> processing of several terabytes of data per day
>>
>> on a single node, so we have generally not needed to scale out to
>> hundreds or thousands of nodes.
>>
>>
>>
>> I hope this helps to clarify when/where to use each one. If there are
>> things that are still unclear or if you have more
>>
>> questions, as always, don't hesitate to shoot another email!
>>
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>> On Jun 2, 2016, at 9:28 AM, Conrad Crampton <conrad.crampton@SecData.com>
>> wrote:
>>
>>
>>
>> Hi,
>>
>> ListenSyslog (using the approach that is being discussed currently in
>> another thread – ListenSyslog running on primary node as RGP, all other
>> nodes connecting to the port that the RPG exposes).
>>
>> Various enrichment, routing on attributes etc. and finally into HDFS as
>> Avro.
>>
>> I want to branch off at an appropriate point in the flow and do some
>> further realtime analysis – got the output to port feeding to Spark process
>> working fine (notwithstanding the issue that you have been so kind to help
>> with previously with the SSLContext), just thinking about if this is most
>> appropriate solution.
>>
>>
>>
>> I have dabbled with a custom processor (for enriching url splitting/
>> enriching etc. – probably could have done with ExecuteScript processor in
>> hindsight) so am comfortable with going this route if that is deemed more
>> appropriate.
>>
>>
>>
>> Thanks
>>
>> Conrad
>>
>>
>>
>> *From: *Bryan Bende <bbende@gmail.com>
>> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Date: *Thursday, 2 June 2016 at 13:12
>> *To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Subject: *Re: Spark or custom processor?
>>
>>
>>
>> Conrad,
>>
>>
>>
>> I would think that you could do this all in NiFi.
>>
>>
>>
>> How do the log files come into NiFi? TailFile, ListenUDP/ListenTCP,
>> List+FetchFile?
>>
>>
>>
>> -Bryan
>>
>>
>>
>>
>>
>> On Thu, Jun 2, 2016 at 6:41 AM, Conrad Crampton <
>> conrad.crampton@secdata.com> wrote:
>>
>> Hi,
>>
>> Any advice on ‘best’ architectural approach whereby some processing
>> function has to be applied to every flow file in a dataflow with some
>> (possible) output based on flowfile content.
>>
>> e.g. inspect log files for specific ip then send message to syslog
>>
>>
>>
>> approach 1
>>
>> Spark
>>
>> Output port from NiFi -> Spark listens to that stream -> processes and
>> outputs accordingly
>>
>> Advantages – scale spark job on Yarn, decoupled (reusable) from NiFi
>>
>> Disadvantages – adds complexity, decoupled from NiFi.
>>
>>
>>
>> Approach 2
>>
>> NiFi
>>
>> Custom processor -> PutSyslog
>>
>> Advantages – reuse existing NiFi processors/ capability, obvious flow
>> (design intent)
>>
>> Disadvantages – scale??
>>
>>
>>
>> Any comments/ advice/ experience of either approaches?
>>
>>
>>
>> Thanks
>>
>> Conrad
>>
>>
>>
>>
>>
>>
>>
>> SecureData, combating cyber threats
>>
>>
>> ------------------------------
>>
>> The information contained in this message or any of its attachments may
>> be privileged and confidential and intended for the exclusive use of the
>> intended recipient. If you are not the intended recipient any disclosure,
>> reproduction, distribution or other dissemination or use of this
>> communications is strictly prohibited. The views expressed in this email
>> are those of the individual and not necessarily of SecureData Europe Ltd.
>> Any prices quoted are only valid if followed up by a formal written quote.
>>
>> SecureData Europe Limited. Registered in England & Wales 04365896.
>> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
>> Maidstone, Kent, ME16 9NT
>>
>>
>>
>>
>>
>> ***This email originated outside SecureData***
>>
>> Click here
>> <https://www.mailcontrol.com/sr/0Yez0Z9rJiDGX2PQPOmvUr11KAWLA5a39FXrkhyyO4eQg2DXa9Xl!rwzg+4hlLPKdufvfzcRzpTaNxM9hG2QrA==>
>>  to report this email as spam.
>>
>>
>>
>>
>>
>
>

Mime
View raw message