lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SolrCloud failover behavior
Date Sun, 04 Nov 2012 02:51:42 GMT
SolrCloud doesn't work unless every shard has at least one server that is
up and running.

I _think_ you might be killing both nodes that host one of the shards. The
admin
page has a link showing you the state of your cluster. So when this happens,
does that page show both nodes for that shard being down?

And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
node, killing that will bring down the whole cluster. Which is why the
usual
recommendation is that ZK be run externally and usually an odd number of ZK
nodes (three or more).

Anyone can create a login and edit the Wiki, so any clarifications are
welcome!

Best
Erick


On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase <nchase@earthlink.net> wrote:

> I think there's a change in the behavior of SolrCloud vs. what's in the
> wiki, but I was hoping someone could confirm for me.  I checked JIRA and
> there were a couple of issues requesting partial results if one server
> comes down, but that doesn't seem to be the issue here.  I also checked
> CHANGES.txt and don't see anything that seems to apply.
>
> I'm running "Example B: Simple two shard cluster with shard replicas" from
> the wiki at https://wiki.apache.org/solr/**SolrCloud<https://wiki.apache.org/solr/SolrCloud>and
everything starts out as expected.  However, when I get to the part
> about fail over behavior is when things get a little wonky.
>
> I added data to the shard running on 7475.  If I kill 7500, a query to any
> of the other servers works fine.  But if I kill 7475, rather than getting
> zero results on a search to 8983 or 8900, I get a 503 error:
>
> <response>
>    <lst name="responseHeader">
>       <int name="status">503</int>
>       <int name="QTime">5</int>
>       <lst name="params">
>          <str name="q">*:*</str>
>       </lst>
>    </lst>
>    <lst name="error">
>       <str name="msg">no servers hosting shard:</str>
>       <int name="code">503</int>
>    </lst>
> </response>
>
> I don't see any errors in the consoles.
>
> Also, if I kill 8983, which includes the Zookeeper server, everything
> dies, rather than just staying in a steady state; the other servers
> continually show:
>
> Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
> startConnect
> NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
> ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
> ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
> closing socket connection and attempting reconnect
> ava.net.ConnectException: Connection refused: no further information
>        at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
>        at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
>        at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
> ClientCnxn.java:1143)
>
> ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
> startConnect
>
> over and over again, and a call to any of the servers shows a connection
> error to 8983.
>
> This is the current 4.0.0 release, running on Windows 7.
>
> If this is the proper behavior and the wiki needs updating, fine; I just
> need to know.  Otherwise if anybody has any clues as to what I may be
> missing, I'd be grateful. :)
>
> Thanks...
>
> ---  Nick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message