lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: SolrCloud Nodes going to recovery state during indexing
Date Wed, 03 Jan 2018 14:28:31 GMT
Do you have deletes by query while indexing or it is append only index?

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 12:16, sravan <sravan@caavo.com> wrote:
> 
> SolrCloud Nodes going to recovery state during indexing
> 
> 
> We have solr cloud setup with the settings shared below. We have a collection with 3
shards and a replica for each of them.
> 
> Normal State(As soon as the whole cluster is restarted):
>     - Status of all the shards is UP.
>     - a bulk update request of 50 documents each takes < 100ms.
>     - 6-10 simultaneous bulk updates.
> 
> Nodes going to recover state after updates for 15-30 mins.
>     - Some shards starts giving the following ERRORs:
>         - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: Read timed out
>         - o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException: Read timed
out
>     - the following error is seen on the shard which goes to recovery state.
>         - too many updates received since start - startingUpdates no longer overlaps
with our currentUpdates.
>     - Sometimes, the same shard even goes to DOWN state and needs a node restart to come
back.
>     - a bulk update request of 50 documents takes more than 5 seconds. Sometimes even
>120 secs. This is seen for all the requests if at least one node is in recovery state
in the whole cluster.
> 
> We have a standalone setup with the same collection schema which is able to take update
& query load without any errors.
> 
> 
> We have the following solrcloud setup.
>     - setup in AWS.
> 
>     - Zookeeper Setup:
>         - number of nodes: 3
>         - aws instance type: t2.small
>         - instance memory: 2gb
> 
>     - Solr Setup:
>         - Solr version: 6.6.0
>         - number of nodes: 3
>         - aws instance type: m5.xlarge
>         - instance memory: 16gb
>         - number of cores: 4
>         - JAVA HEAP: 8gb
>         - JAVA VERSION: oracle java version "1.8.0_151"
>         - GC settings: default CMS.
> 
>         collection settings:
>             - number of shards: 3
>             - replication factor: 2
>             - total 6 replicas.
>             - total number of documents in the collection: 12 million
>             - total number of documents in each shard: 4 million
>             - Each document has around 25 fields with 12 of them containing textual analysers
& filters.
>             - Commit Strategy:
>                 - No explicit commits from application code.
>                 - Hard commit of 15 secs with OpenSearcher as false.
>                 - Soft commit of 10 mins.
>             - Cache Strategy:
>                 - filter queries
>                     - number: 512
>                     - autowarmCount: 100
>                 - all other caches
>                     - number: 512
>                     - autowarmCount: 0
>             - maxWarmingSearchers: 2
> 
> 
> - We tried the following
>     - commit strategy
>         - hard commit - 150 secs
>         - soft commit - 5 mins
>     - with GCG1 garbage collector based on https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:
>         - the nodes go to recover state in less than a minute.
> 
> The issue is seen even when the leaders are balanced across the three nodes.
> 
> Can you help us find the soluttion to this problem?


Mime
View raw message