lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: SolrCloud Nodes going to recovery state during indexing
Date Wed, 03 Jan 2018 15:11:10 GMT
Hi Sravan,
DBQ does not play well with indexing - it causes indexing to be completely blocked on replicas
while it is running. It is highly likely that it is the root cause of your issues. If you
can change indexing logic to avoid it, you can quickly test it. What you can do as a workaround
is to query for IDs that needs to be deleted and execute bulk delete by ID - that will not
cause issues as DBQ.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 16:04, Sravan Kumar <sravan@caavo.com> wrote:
> 
> Emir,
>    Yes there is a delete_by_query on every bulk insert.
>    This delete_by_query deletes all the documents which are updated lesser
> than a day before the current time.
>    Is bulk delete_by_query the reason?
> 
> On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> 
>> Do you have deletes by query while indexing or it is append only index?
>> 
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Jan 2018, at 12:16, sravan <sravan@caavo.com> wrote:
>>> 
>>> SolrCloud Nodes going to recovery state during indexing
>>> 
>>> 
>>> We have solr cloud setup with the settings shared below. We have a
>> collection with 3 shards and a replica for each of them.
>>> 
>>> Normal State(As soon as the whole cluster is restarted):
>>>    - Status of all the shards is UP.
>>>    - a bulk update request of 50 documents each takes < 100ms.
>>>    - 6-10 simultaneous bulk updates.
>>> 
>>> Nodes going to recover state after updates for 15-30 mins.
>>>    - Some shards starts giving the following ERRORs:
>>>        - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
>> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
>> exception during distributed update: Read timed out
>>>        - o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException:
>> Read timed out
>>>    - the following error is seen on the shard which goes to recovery
>> state.
>>>        - too many updates received since start - startingUpdates no
>> longer overlaps with our currentUpdates.
>>>    - Sometimes, the same shard even goes to DOWN state and needs a node
>> restart to come back.
>>>    - a bulk update request of 50 documents takes more than 5 seconds.
>> Sometimes even >120 secs. This is seen for all the requests if at least one
>> node is in recovery state in the whole cluster.
>>> 
>>> We have a standalone setup with the same collection schema which is able
>> to take update & query load without any errors.
>>> 
>>> 
>>> We have the following solrcloud setup.
>>>    - setup in AWS.
>>> 
>>>    - Zookeeper Setup:
>>>        - number of nodes: 3
>>>        - aws instance type: t2.small
>>>        - instance memory: 2gb
>>> 
>>>    - Solr Setup:
>>>        - Solr version: 6.6.0
>>>        - number of nodes: 3
>>>        - aws instance type: m5.xlarge
>>>        - instance memory: 16gb
>>>        - number of cores: 4
>>>        - JAVA HEAP: 8gb
>>>        - JAVA VERSION: oracle java version "1.8.0_151"
>>>        - GC settings: default CMS.
>>> 
>>>        collection settings:
>>>            - number of shards: 3
>>>            - replication factor: 2
>>>            - total 6 replicas.
>>>            - total number of documents in the collection: 12 million
>>>            - total number of documents in each shard: 4 million
>>>            - Each document has around 25 fields with 12 of them
>> containing textual analysers & filters.
>>>            - Commit Strategy:
>>>                - No explicit commits from application code.
>>>                - Hard commit of 15 secs with OpenSearcher as false.
>>>                - Soft commit of 10 mins.
>>>            - Cache Strategy:
>>>                - filter queries
>>>                    - number: 512
>>>                    - autowarmCount: 100
>>>                - all other caches
>>>                    - number: 512
>>>                    - autowarmCount: 0
>>>            - maxWarmingSearchers: 2
>>> 
>>> 
>>> - We tried the following
>>>    - commit strategy
>>>        - hard commit - 150 secs
>>>        - soft commit - 5 mins
>>>    - with GCG1 garbage collector based on https://wiki.apache.org/solr/
>> ShawnHeisey#Java_8_recommendation_for_Solr:
>>>        - the nodes go to recover state in less than a minute.
>>> 
>>> The issue is seen even when the leaders are balanced across the three
>> nodes.
>>> 
>>> Can you help us find the soluttion to this problem?
>> 
>> 
> 
> 
> -- 
> Regards,
> Sravan


Mime
View raw message