lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: how to get high-availability for Solr csv update handler?
Date Mon, 25 Feb 2019 19:59:57 GMT
On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
> We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
> bulk update (several Millions of docs) in to multiple collections. When we
> make a call to the CSV update handler using curl command line (as below),
> we are pointing to single server in Solr. During the problem time, when one
> of the Solr server goes down this approach could fail. Is there any way
> that we do this to send the write to the leader, like how the solrj does,
> through the simple curl command(s) line?

The SolrJ client named CloudSolrClient is able to do this because it is 
a full ZooKeeper client that has instant access to the clusterstate 
maintained by your Solr servers.

To get that capability in any other client would require that the client 
is aware of the ZooKeeper ensemble in the same way.  Curl cannot do this.

> 
> In the request below for some reason, if the SOLR1-SERVER is down, the
> request will fail, even though the new leader say SOLR2-SERVER is up.
> 
> curl 'http://<<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
> --data-binary @example/exampledocs/books.csv -H
> 'Content-type:application/csv'
> 
> 1. I can create load balancer / ALB infront of solr, but that may not still
> identify the Leader for efficiency.

A load balancer won't be able to identify the leader unless it is 
capable of talking to ZooKeeper and knows how Solr represents data in 
ZK.  Have you measured the efficiency improvement that comes from 
sending to the leader?  If that improvement is small, it's probably not 
worth implementing something that talks to ZooKeeper.  I know there are 
people who don't try to send to leaders that are achieving very fast 
indexing rates ... I suspect that the improvement obtained by sending to 
leaders is relatively small.

> 2. I can write a solrj client to update, but i am not sure if i will get
> the efficiency of  bulk update? not sure about the simplicity of the curl
> as well.

SolrJ is probably more efficient than something like curl, because it 
utilizes a compact binary format for data transfer in both directions, 
called javabin.  With curl, you would most likely be using a text format 
like json, xml, or csv.

SolrJ clients are fully thread-safe.  Which means you can use a single 
instance to send updates in parallel with multiple threads.  That is the 
best way to achieve good indexing performance with Solr.

Thanks,
Shawn

Mime
View raw message