lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: solr cloud does not start with many collections
Date Mon, 02 Mar 2015 16:33:21 GMT
On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> collections from scratch and then attempted to stop/start the cloud.
>
> node1:
> WARN  - 2015-03-02 18:09:02.371;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
> out waiting to see all nodes published as DOWN in our cluster state.
> WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DDDDDD-3219 after 30 seconds; our state says
> http://host:8002/solr/DDDDDD-3219_shard1_replica1/, but ZooKeeper says
> http://host:8000/solr/DDDDDD-3219_shard1_replica2/
>
> node2:
> WARN  - 2015-03-02 18:09:01.871;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:17:04.458;
> org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
> but Solr cannot talk to ZK
> stop/start
> WARN  - 2015-03-02 18:53:12.725;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DDDDDD-3581 after 30 seconds; our state says
> http://host:8001/solr/DDDDDD-3581_shard1_replica2/, but ZooKeeper says
> http://host:8002/solr/DDDDDD-3581_shard1_replica1/
>
> node3:
> WARN  - 2015-03-02 18:09:03.022;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
> out waiting to see all nodes published as DOWN in our cluster state.
> WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DDDDDD-2707 after 30 seconds; our state says
> http://host:8002/solr/DDDDDD-2707_shard1_replica2/, but ZooKeeper says
> http://host:8000/solr/DDDDDD-2707_shard1_replica1/

I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
it would.

There is one other thing I'd like to try before you file a bug --
increasing zkClientTimeout to 40 seconds, to see whether it allows
changes the point at which it fails (or allows it to succeed).  With the
default tickTime (2 seconds), the maximum time you can set
zkClientTimeout to is 40 seconds ... which in normal circumstances is a
VERY long time.  In your situation, at least with the code in its
current state, 30 seconds (I'm pretty sure this is the default in 5.0)
may simply not be enough.

https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters

I think filing a bug, even if 40 seconds allows this to succeed, is a
good idea ... but you might want to wait for some of the cloud experts
to look at your logs to see if they have anything to add.

Thanks,
Shawn


Mime
View raw message