nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srikanth <srikanth...@gmail.com>
Subject Re: Solr Cloud support
Date Fri, 04 Sep 2015 02:25:54 GMT
Bryan,

That was my first thought too but then I felt I was over complicating it ;-)
I didn't realize such pattern was used in other processors.

If you don't mind, can you elaborate more on this...

> *"In a cluster you would likely set this up by having
> DistributeSolrCommand send to a Remote Process Group that points back to an
> input port of itself, and the input port feeds into ExecuteSolrCommand."*


Srikanth

On Thu, Sep 3, 2015 at 9:32 AM, Bryan Bende <bbende@gmail.com> wrote:

> Srikanth/Joe,
>
> I think I understand the scenario a little better now, and to Joe's points
> - it will probably be clearer how to do this in a more generic way as we
> work towards the High-Availability NCM.
>
> Thinking out loud here given the current state of things, I'm wondering if
> the desired functionality could be achieve by doing something similar to
> ListHDFS and FetchHDFS... what if there was a DistributeSolrCommand and
> ExecuteSolrCommand?
>
> DistributeSolrCommand would be set to run on the Primary Node and would be
> configured with similar properties to what GetSolr has now (zk hosts, a
> query, timestamp field, distrib=false, etc), it would query ZooKeeper and
> produce a FlowFile for each shard, and the FlowFile would use either the
> attributes, or payload, to capture all of the processor property values
> plus the shard information, basically producing a command for a downstream
> processor to run.
>
> ExecuteSolrCommand would be running on every node and would be responsible
> for interpreting the incoming FlowFile and executing whatever operation was
> being specified, and then passing on the results.
>
> In a cluster you would likely set this up by having DistributeSolrCommand
> send to a Remote Process Group that points back to an input port of itself,
> and the input port feeds into ExecuteSolrCommand.
> This would get the automatic querying of each shard, but you would still
> have DistributeSolrCommand running on one node and needing to be manually
> failed over, until we address the HA stuff.
>
> This would be a fair amount of work, but food for thought.
>
> -Bryan
>
>
> On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.witt@gmail.com> wrote:
>
>> <- general commentary not specific to the solr case ->
>>
>> This concept of being able to have nodes share information about
>> 'which partition' they should be responsible for is a generically
>> useful and very powerful thing.  We need to support it.  It isn't
>> immediately obvious to me how best to do this as a generic and useful
>> thing but a controller service on the NCM could potentially assign
>> 'partitions' to the nodes.  Zookeeper could be an important part.  I
>> think we need to tackle the HA NCM construct we talked about months
>> ago before we can do this one nicely.
>>
>> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth.ht@gmail.com> wrote:
>> > Bryan,
>> >
>> > <Bryan> --> "I'm still a little bit unclear about the use case for
>> querying
>> > the shards individually... is the reason to do this because of a
>> > performance/failover concern?"
>> > <Srikanth> --> Reason to do this is to achieve better performance with
>> the
>> > convenience of automatic failover.
>> > In the current mode, we do get very good failover offered by Solr.
>> Failover
>> > is seamless.
>> > At the same time, we are not getting best performance. I guess its
>> clear to
>> > us why having each NiFi process query each shard with distrib=false will
>> > give better performance.
>> >
>> > Now, question is how do we achieve this. Making user configure one NiFi
>> > processor for each Solr node is one way to go.
>> > I'm afraid this will make failover a tricky process. May even need human
>> > intervention.
>> >
>> > Another approach is to have cluster master in NiFi talk to ZK and decide
>> > which shards to query. Divide these shards among slave nodes.
>> > My understanding is NiFi cluster master is not indented for such
>> purpose.
>> > I'm not sure if this even possible.
>> >
>> > Hope I'm a bit more clear now.
>> >
>> > Srikanth
>> >
>> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbende@gmail.com> wrote:
>> >>
>> >> Srikanth,
>> >>
>> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
>> both
>> >> the dev and users list now :)
>> >>
>> >> I'm still a little bit unclear about the use case for querying the
>> shards
>> >> individually... is the reason to do this because of a
>> performance/failover
>> >> concern? or is it something specific about how the data is shared?
>> >>
>> >> Lets say you have your Solr cluster with 10 shards, each on their own
>> node
>> >> for simplicity, and then your ZooKeeper cluster.
>> >> Then you also have a NiFi cluster with 3 nodes each with their own nifi
>> >> instance, the first node designated as the primary, and a fourth node
>> as the
>> >> cluster manager.
>> >>
>> >> Now if you want to extract data from your Solr cluster, you would do
>> the
>> >> following...
>> >> - Drag GetSolr on to the graph
>> >> - Set type to "cloud"
>> >> - Set the Solr Location to the ZK hosts string
>> >> - Set the scheduling to "Primary Node"
>> >>
>> >> When you start the processor it is now only running on the first NiFi
>> >> node, and it it is extracting data from all your shards at the same
>> time.
>> >> If a Solr shard/node fails this would be handled for us by the SolrJ
>> >> SolrCloudClient which is using ZooKeeper to know about the state of
>> things,
>> >> and would choose a healthy replica of the shard if it existed.
>> >> If the primary NiFi node failed, you would manually elect a new primary
>> >> node and the extraction would resume on that node (this will get
>> better in
>> >> the future).
>> >>
>> >> I think if we expose the distrib=false it would allow you to query
>> shards
>> >> individually, either by having a nifi instance with a GetSolr
>> processor per
>> >> shard, or several mini-NiFis each with a single GetSolr, but
>> >> I'm not sure if we could achieve the dynamic assignment you are
>> thinking
>> >> of.
>> >>
>> >> Let me know if I'm not making sense, happy to keep discussing and
>> trying
>> >> to figure out what else can be done.
>> >>
>> >> -Bryan
>> >>
>> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth.ht@gmail.com>
>> wrote:
>> >>>
>> >>>
>> >>> Bryan,
>> >>>
>> >>> That is correct, having the ability to query nodes with
>> "distrib=false"
>> >>> is what I was talking about.
>> >>>
>> >>> Instead of user having to configure each Solr node in a separate NiFi
>> >>> processor, can we provide a single configuration??
>> >>> It would be great if we can take just Zookeeper(ZK) host as input from
>> >>> user and
>> >>>   i) Determine all nodes for a container from ZK
>> >>>   ii) Let each NiFi processor takes ownership of querying a node with
>> >>> "distrib=false"
>> >>>
>> >>> From what I understand, NiFi slaves in cluster can't talk to each
>> other.
>> >>> Will it be possible to do the ZK query part in cluster master and have
>> >>> individual Solr nodes propagated to each slave?
>> >>> I don't know how we can achieve this in NiFi, if at all.
>> >>>
>> >>> This will make Solr interface to NiFi much simpler. User needs to
>> provide
>> >>> just ZK.
>> >>> We'll be able to take care rest. Including failing over to an
>> alternate
>> >>> Solr node with current one fails.
>> >>>
>> >>> Let me know your thoughts.
>> >>>
>> >>> Rgds,
>> >>> Srikanth
>> >>>
>> >>> P.S : I had subscribed only to digest and didn't receive your original
>> >>> reply. Had to pull this up from mail archive.
>> >>> Only Dev list is in Nabble!!
>> >>>
>> >>>
>> >>>
>> ***************************************************************************************************
>> >>>
>> >>> Hi Srikanth,
>> >>>
>> >>> You are correct that in a NiFi cluster the intent would be to schedule
>> >>> GetSolr on the primary node only (on the scheduling tab) so that only
>> one
>> >>> node in your cluster was extracting data.
>> >>>
>> >>> GetSolr determines which SolrJ client to use based on the "Solr Type"
>> >>> property, so if you select "Cloud" it will use SolrCloudClient. It
>> would
>> >>> send the query to one node based on the cluster state from ZooKeeper,
>> and
>> >>> then that Solr node performs the distributed query.
>> >>>
>> >>> Did you have a specific use case where you wanted to query each shard
>> >>> individually?
>> >>>
>> >>> I think it would be straight forward to expose something on GetSolr
>> that
>> >>> would set "distrib=false" on the query so that Solr would not execute
>> a
>> >>> distributed query. You would then most likely create separate
>> instances
>> >>> of
>> >>> GetSolr and configure them as Standard type pointing at the respective
>> >>> shards. Let us know if that is something you are interested in.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Bryan
>> >>>
>> >>>
>> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth.ht@gmail.com>
>> wrote:
>> >>>
>> >>> > Hello,
>> >>> >
>> >>> > I started to explore NiFi project a few days back. I'm still trying
>> it
>> >>> > out.
>> >>> >
>> >>> > I have a few basic question on GetSolr.
>> >>> >
>> >>> > Should GetSolr be run as an Isolated Processor?
>> >>> >
>> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4
>> nodes,
>> >>> > will GetSolr be able to query each shard from one specific NiFi
>> node?
>> >>> > I'm
>> >>> > guessing it doesn't work that way.
>> >>> >
>> >>> >
>> >>> > Thanks,
>> >>> > Srikanth
>> >>> >
>> >>> >
>> >>
>> >>
>> >
>>
>
>

Mime
View raw message