lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Dig <teram...@gmail.com>
Subject Re: SolrClould 6.6 stability challenges
Date Sat, 04 Nov 2017 13:12:37 GMT
> hi Emir,
> thanks for the response -
> a) we see this once in a while when the node goes down, nothing at other
> times.
> ERROR - 2017-10-02 12:19:07.222; [c:rbconfig s:shard1 r:core_node4
> x:rbconfig_shard1_replica4] org.apache.solr.common.SolrException;
> org.apache.solr.update.processor.DistributedUpdateProcessor$
> DistributedUpdatesAsyncException: Async exception during distributed
> update: Read timed out
> at org.apache.solr.update.processor.DistributedUpdateProcessor.
> doFinish(DistributedUpdateProcessor.java:972)
> at org.apache.solr.update.processor.DistributedUpdateProcessor.
> finish(DistributedUpdateProcessor.java:1911)
> at org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> stBody(ContentStreamHandlerBase.java:78)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> uestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:361)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:305)
>
> b) gc log attached. looks ok as far as i can tell
>
> c) we actually were running default 2 maxWarmingSearchers, just
> experimented with 5, failing in both cases.
>
> d) making sure we don't commit every bulk.
>
> e) we are able to index 500 documents in around 20 seconds.
>
> f) when we index without queries it works just fine without any issues.
> queries work fine as well by themselves. just the two together causes nodes
> to go down.
>
>
>
>
>
> On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
>
>> Hi Rick,
>> Do you see any errors in logs? Do you have any monitoring tool? Maybe you
>> can check heap and GC metrics around time when incident happened. It is not
>> large heap but some major GC could cause pause large enough to trigger some
>> snowball and end up with node in recovery state.
>> What is indexing rate you observe? Why do you have max warming searchers
>> 5 (did you mean this with autowarmingsearchers?) when you commit every 5
>> min? Why did you increase it - you seen errors with default 2? Maybe you
>> commit every bulk?
>> Do you see similar behaviour when you just do indexing without queries?
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 4 Nov 2017, at 05:15, Rick Dig <teramera@gmail.com> wrote:
>> >
>> > hello all,
>> > we are trying to run solrcloud 6.6 in a production setting.
>> > here's our config and issue
>> > 1) 3 nodes, 1 shard, replication factor 3
>> > 2) all nodes are 16GB RAM, 4 core
>> > 3) Our production load is about 2000 requests per minute
>> > 4) index is fairly small, index size is around 400 MB with 300k
>> documents
>> > 5) autocommit is currently set to 5 minutes (even though ideally we
>> would
>> > like a smaller interval).
>> > 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
>> > 7) all of this runs perfectly ok when indexing isn't happening. as soon
>> as
>> > we start "nrt" indexing one of the follower nodes goes down within 10
>> to 20
>> > minutes. from this point on the nodes never recover unless we stop
>> > indexing.  the master usually is the last one to fall.
>> > 8) there are maybe 5 to 7 processes indexing at the same time with
>> document
>> > batch sizes of 500.
>> > 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
>> > 10) no cpu and / or oom issues that we can see.
>> > 11) cpu load does go fairly high 15 to 20 at times.
>> > any help or pointers appreciated
>> >
>> > thanks
>> > rick
>>
>>
>

Mime
View raw message