lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ganesh Sethuraman <ganeshmail...@gmail.com>
Subject Re: how to get high-availability for Solr csv update handler?
Date Mon, 25 Feb 2019 21:31:42 GMT
Thanks for details and updates. We are looking at load balancers not
because of the little improvement in performance. But more for high
availability. Other alternative is, if the update fails on one server using
curl, on error we have to call another SOLR server. I was looking to see if
there any other way to get the working leader from the Zookeeper before the
update, is there a way to query zookeeper for the same? But, I understand
there is no guarantee that leader wont change during the large CSV file
update. But at least some protection during planed server restarts can be
managed.

Regarding the Solrj option, it certainly seems to be best option, do we
have the python solr client to it which can be Solr Leader aware? like how
it is done in the solrj (java) client.

Regards,
Ganesh

On Mon, Feb 25, 2019 at 3:00 PM Shawn Heisey <apache@elyograg.org> wrote:

> On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
> > We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
> > bulk update (several Millions of docs) in to multiple collections. When
> we
> > make a call to the CSV update handler using curl command line (as below),
> > we are pointing to single server in Solr. During the problem time, when
> one
> > of the Solr server goes down this approach could fail. Is there any way
> > that we do this to send the write to the leader, like how the solrj does,
> > through the simple curl command(s) line?
>
> The SolrJ client named CloudSolrClient is able to do this because it is
> a full ZooKeeper client that has instant access to the clusterstate
> maintained by your Solr servers.
>
> To get that capability in any other client would require that the client
> is aware of the ZooKeeper ensemble in the same way.  Curl cannot do this.
>
> >
> > In the request below for some reason, if the SOLR1-SERVER is down, the
> > request will fail, even though the new leader say SOLR2-SERVER is up.
> >
> > curl 'http://
> <<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
> > --data-binary @example/exampledocs/books.csv -H
> > 'Content-type:application/csv'
> >
> > 1. I can create load balancer / ALB infront of solr, but that may not
> still
> > identify the Leader for efficiency.
>
> A load balancer won't be able to identify the leader unless it is
> capable of talking to ZooKeeper and knows how Solr represents data in
> ZK.  Have you measured the efficiency improvement that comes from
> sending to the leader?  If that improvement is small, it's probably not
> worth implementing something that talks to ZooKeeper.  I know there are
> people who don't try to send to leaders that are achieving very fast
> indexing rates ... I suspect that the improvement obtained by sending to
> leaders is relatively small.
>
> > 2. I can write a solrj client to update, but i am not sure if i will get
> > the efficiency of  bulk update? not sure about the simplicity of the curl
> > as well.
>
> SolrJ is probably more efficient than something like curl, because it
> utilizes a compact binary format for data transfer in both directions,
> called javabin.  With curl, you would most likely be using a text format
> like json, xml, or csv.
>
> SolrJ clients are fully thread-safe.  Which means you can use a single
> instance to send updates in parallel with multiple threads.  That is the
> best way to achieve good indexing performance with Solr.
>
> Thanks,
> Shawn
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message