nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Nifi cluster features - Questions
Date Sun, 10 Jan 2016 20:44:49 GMT
Chakri,

Would love to hear what you've learned and how that differed from the
docs themselves.  Site-to-site has proven difficult to setup so we're
clearly not there yet in having the right operator/admin experience.

Thanks
Joe

On Sun, Jan 10, 2016 at 3:41 PM, Chakrader Dewaragatla
<Chakrader.Dewaragatla@lifelock.com> wrote:
> I was able to get site-to-site work.
> I tried to follow your instructions to send data distribute across the
> nodes.
>
> GenerateFlowFile (On Primary) —> RPG
> RPG —> Input Port   —> Putfile (Time driven scheduling)
>
> However, data is only written to one slave (Secondary slave). Primary slave
> has not data.
>
> Image screenshot :
> http://tinyurl.com/jjvjtmq
>
> From: Chakrader Dewaragatla <chakrader.dewaragatla@lifelock.com>
> Date: Sunday, January 10, 2016 at 11:26 AM
>
> To: "users@nifi.apache.org" <users@nifi.apache.org>
> Subject: Re: Nifi cluster features - Questions
>
> Bryan – Thanks – I am trying to setup site-to-site.
> I have two slaves and one NCM.
>
> My properties as follows :
>
> On both Slaves:
>
> nifi.remote.input.socket.port=10880
> nifi.remote.input.secure=false
>
> On NCM:
> nifi.remote.input.socket.port=10880
> nifi.remote.input.secure=false
>
> When I try drop remote process group (with http://<NCM IP>:8080/nifi), I see
> error as follows for two nodes.
>
> [<Slave1 ip>:8080] - Remote instance is not allowed for Site to Site
> communication
> [<Slave2 ip>:8080] - Remote instance is not allowed for Site to Site
> communication
>
> Do you have insight why its trying to connecting 8080 on slaves ? When do
> 10880 port come into the picture ? I remember try setting site to site few
> months back and succeeded.
>
> Thanks,
> -Chakri
>
>
>
> From: Bryan Bende <bbende@gmail.com>
> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
> Date: Saturday, January 9, 2016 at 11:22 AM
> To: "users@nifi.apache.org" <users@nifi.apache.org>
> Subject: Re: Nifi cluster features - Questions
>
> The sending node (where the remote process group is) will distribute the
> data evenly across the two nodes, so an individual file will only be sent to
> one of the nodes. You could think of it as if a separate NiFi instance was
> sending directly to a two node cluster, it would be evenly distributing the
> data across the two nodes. In this case it just so happens to all be with in
> the same cluster.
>
> The most common use case for this scenario is the List and Fetch processors
> like HDFS. You can perform the listing on primary node, and then distribute
> the results so the fetching takes place on all nodes.
>
> On Saturday, January 9, 2016, Chakrader Dewaragatla
> <Chakrader.Dewaragatla@lifelock.com> wrote:
>>
>> Bryan – Thanks, how do the nodes distribute the load for a input port. As
>> port is open and listening on two nodes,  does it copy same files on both
>> the nodes?
>> I need to try this setup to see the results, appreciate your help.
>>
>> Thanks,
>> -Chakri
>>
>> From: Bryan Bende <bbende@gmail.com>
>> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
>> Date: Friday, January 8, 2016 at 3:44 PM
>> To: "users@nifi.apache.org" <users@nifi.apache.org>
>> Subject: Re: Nifi cluster features - Questions
>>
>> Hi Chakri,
>>
>> I believe the DistributeLoad processor is more for load balancing when
>> sending to downstream systems. For example, if you had two HTTP endpoints,
>> you could have the first relationship from DistributeLoad going to a
>> PostHTTP that posts to endpoint #1, and the second relationship going to a
>> second PostHTTP that goes to endpoint #2.
>>
>> If you want to distribute the data with in the cluster, then you need to
>> use site-to-site. The way you do this is the following...
>>
>> - Add an Input Port connected to your PutFile.
>> - Add GenerateFlowFile scheduled on primary node only, connected to a
>> Remote Process Group. The Remote Process Group should be connected to the
>> Input Port from the previous step.
>>
>> So both nodes have an input port listening for data, but only the primary
>> node produces a FlowFile and sends it to the RPG which then re-distributes
>> it back to one of the Input Ports.
>>
>> In order for this to work you need to set nifi.remote.input.socket.port in
>> nifi.properties to some available port, and you probably want
>> nifi.remote.input.secure=false for testing.
>>
>> -Bryan
>>
>>
>> On Fri, Jan 8, 2016 at 6:27 PM, Chakrader Dewaragatla
>> <Chakrader.Dewaragatla@lifelock.com> wrote:
>>>
>>> Mark – I have setup a two node cluster and tried the following .
>>>  GenrateFlowfile processor (Run only on primary node) —> DistributionLoad
>>> processor (RoundRobin)   —> PutFile
>>>
>>> >> The GetFile/PutFile will run on all nodes (unless you schedule it to
>>> >> run on primary node only).
>>> From your above comment, It should put file on two nodes. It put files on
>>> primary node only. Any thoughts ?
>>>
>>> Thanks,
>>> -Chakri
>>>
>>> From: Mark Payne <markap14@hotmail.com>
>>> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
>>> Date: Wednesday, October 7, 2015 at 11:28 AM
>>>
>>> To: "users@nifi.apache.org" <users@nifi.apache.org>
>>> Subject: Re: Nifi cluster features - Questions
>>>
>>> Chakri,
>>>
>>> Correct - when NiFi instances are clustered, they do not transfer data
>>> between the nodes. This is very different
>>> than you might expect from something like Storm or Spark, as the key
>>> goals and design are quite different.
>>> We have discussed providing the ability to allow the user to indicate
>>> that they want to have the framework
>>> do load balancing for specific connections in the background, but it's
>>> still in more of a discussion phase.
>>>
>>> Site-to-Site is simply the capability that we have developed to transfer
>>> data between one instance of
>>> NiFi and another instance of NiFi. So currently, if we want to do load
>>> balancing across the cluster, we would
>>> create a site-to-site connection (by dragging a Remote Process Group onto
>>> the graph) and give that
>>> site-to-site connection the URL of our cluster. That way, you can push
>>> data to your own cluster, effectively
>>> providing a load balancing capability.
>>>
>>> If you were to just run ListenHTTP without setting it to Primary Node,
>>> then every node in the cluster will be listening
>>> for incoming HTTP connections. So you could then use a simple load
>>> balancer in front of NiFi to distribute the load
>>> across your cluster.
>>>
>>> Does this help? If you have any more questions we're happy to help!
>>>
>>> Thanks
>>> -Mark
>>>
>>>
>>> On Oct 7, 2015, at 2:32 PM, Chakrader Dewaragatla
>>> <Chakrader.Dewaragatla@lifelock.com> wrote:
>>>
>>> Mark - Thanks for the notes.
>>>
>>> >> The other option would be to have a ListenHTTP processor run on
>>> >> Primary Node only and then use Site-to-Site to distribute the data to
other
>>> >> nodes.
>>> Lets say I have 5 node cluster and ListenHTTP processor on Primary node,
>>> collected data on primary node is not transfered to other nodes by default
>>> for processing despite all nodes are part of one cluster?
>>> If ListenHTTP processor is running  as a dafult (with out explicit
>>> setting to run on primary node), how does the data transferred to rest of
>>> the nodes? Does site-to-site come in play when I make one processor to run
>>> on primary node ?
>>>
>>> Thanks,
>>> -Chakri
>>>
>>> From: Mark Payne <markap14@hotmail.com>
>>> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
>>> Date: Wednesday, October 7, 2015 at 7:00 AM
>>> To: "users@nifi.apache.org" <users@nifi.apache.org>
>>> Subject: Re: Nifi cluster features - Questions
>>>
>>> Hello Chakro,
>>>
>>> When you create a cluster of NiFi instances, each node in the cluster is
>>> acting independently and in exactly
>>> the same way. I.e., if you have 5 nodes, all 5 nodes will run exactly the
>>> same flow. However, they will be
>>> pulling in different data and therefore operating on different data.
>>>
>>> So if you pull in 10 1-gig files from S3, each of those files will be
>>> processed on the node that pulled the data
>>> in. NiFi does not currently shuffle data around between nodes in the
>>> cluster (you can use site-to-site to do
>>> this if you want to, but it won't happen automatically). If you set the
>>> number of Concurrent Tasks to 5, then
>>> you will have up to 5 threads running for that processor on each node.
>>>
>>> The only exception to this is the Primary Node. You can schedule a
>>> Processor to run only on the Primary Node
>>> by right-clicking on the Processor, and going to the Configure menu. In
>>> the Scheduling tab, you can change
>>> the Scheduling Strategy to Primary Node Only. In this case, that
>>> Processor will only be triggered to run on
>>> whichever node is elected the Primary Node (this can be changed in the
>>> Cluster management screen by clicking
>>> the appropriate icon in the top-right corner of the UI).
>>>
>>> The GetFile/PutFile will run on all nodes (unless you schedule it to run
>>> on primary node only).
>>>
>>> If you are attempting to have a single input running HTTP and then push
>>> that out across the entire cluster to
>>> process the data, you would have a few options. First, you could just use
>>> an HTTP Load Balancer in front of NiFi.
>>> The other option would be to have a ListenHTTP processor run on Primary
>>> Node only and then use Site-to-Site
>>> to distribute the data to other nodes.
>>>
>>> For more info on site-to-site, you can see the Site-to-Site section of
>>> the User Guide at
>>> http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site
>>>
>>> If you have any more questions, let us know!
>>>
>>> Thanks
>>> -Mark
>>>
>>> On Oct 7, 2015, at 2:33 AM, Chakrader Dewaragatla
>>> <Chakrader.Dewaragatla@lifelock.com> wrote:
>>>
>>> Nifi Team – I would like to understand the advantages of Nifi clustering
>>> setup.
>>>
>>> Questions :
>>>
>>>  - How does workflow work on multiple nodes ? Does it share the resources
>>> intra nodes ?
>>> Lets say I need to pull data 10 1Gig files from S3, how does work load
>>> distribute  ? Setting concurrent tasks as 5. Does it spew 5 tasks per node ?
>>>
>>>  - How to “isolate” the processor to the master node (or one node)?
>>>
>>> - Getfile/Putfile processors on cluster setup, does it get/put on primary
>>> node ? How do I force processor to look in one of the slave node?
>>>
>>> - How can we have a workflow where the input side we want to receive
>>> requests (http) and then the rest of the pipeline need to run in parallel on
>>> all the nodes ?
>>>
>>> Thanks,
>>> -Chakro
>>>
>>> ________________________________
>>> The information contained in this transmission may contain privileged and
>>> confidential information. It is intended only for the use of the person(s)
>>> named above. If you are not the intended recipient, you are hereby notified
>>> that any review, dissemination, distribution or duplication of this
>>> communication is strictly prohibited. If you are not the intended recipient,
>>> please contact the sender by reply email and destroy all copies of the
>>> original message.
>>> ________________________________
>>>
>>>
>>> ________________________________
>>> The information contained in this transmission may contain privileged and
>>> confidential information. It is intended only for the use of the person(s)
>>> named above. If you are not the intended recipient, you are hereby notified
>>> that any review, dissemination, distribution or duplication of this
>>> communication is strictly prohibited. If you are not the intended recipient,
>>> please contact the sender by reply email and destroy all copies of the
>>> original message.
>>> ________________________________
>>>
>>>
>>> ________________________________
>>> The information contained in this transmission may contain privileged and
>>> confidential information. It is intended only for the use of the person(s)
>>> named above. If you are not the intended recipient, you are hereby notified
>>> that any review, dissemination, distribution or duplication of this
>>> communication is strictly prohibited. If you are not the intended recipient,
>>> please contact the sender by reply email and destroy all copies of the
>>> original message.
>>> ________________________________
>>
>>
>> ________________________________
>> The information contained in this transmission may contain privileged and
>> confidential information. It is intended only for the use of the person(s)
>> named above. If you are not the intended recipient, you are hereby notified
>> that any review, dissemination, distribution or duplication of this
>> communication is strictly prohibited. If you are not the intended recipient,
>> please contact the sender by reply email and destroy all copies of the
>> original message.
>> ________________________________
>
>
>
> --
> Sent from Gmail Mobile
> ________________________________
> The information contained in this transmission may contain privileged and
> confidential information. It is intended only for the use of the person(s)
> named above. If you are not the intended recipient, you are hereby notified
> that any review, dissemination, distribution or duplication of this
> communication is strictly prohibited. If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of the
> original message.
> ________________________________

Mime
View raw message