lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: Solr Cloud 6.5.0 Replicas go down while indexing
Date Tue, 04 Apr 2017 14:02:40 GMT
On 4/3/2017 7:52 AM, Salih Sen wrote:
> We have a three server set up with each server having 756G ram, 48
> cores, 4SSDs (each having tree solr instances on them) and a dedicated
> mechanical disk for zookeeper (3 zk instances total). Each Solr
> instances have 31G of heap space allocated to them. In total we have
> 36 Solr Instances and 3 Zookeeper instances (with 1G heapspace). Also
> servers 10Gig network between them.

You haven't described your index(es).  How many collections in the
cloud?  How many shards for each?  How many replicas for each shard? 
How many docs in each collection?  How much *total* index data is on
each of those systems?  To determine this, add up the size of the solr
home in all of the Solr instances that exist on that server.  With this
information, we can make an educated guess about whether the setup you
have engineered is reasonably correct for the scale of your data.

It sounds like you have twelve Solr instances per server, with each one
using a 31GB heap.  That's 372GB of memory JUST for Solr heaps.  Unless
you're dealing with terabytes of index data and hundreds of millions (or
billions) of documents, I cannot imagine needing that many Solr
instances per server or that much heap memory.

Have you increased the maximum number of processes that the user which
is running Solr can have?  12 instances of Solr is going to be a LOT of
threads, and on most operating systems, each thread counts against the
user process limit.  Some operating systems might have a separate
configuration for thread limits, but I do know that Linux does not, and
counts them as processes.

> We set Auto hardcommit time to 15sec and 10000 docs, and soft commit
> to 60000 sec and 5000 seconds in order to avoid soft committing too
> much and avoiding indexing bottlenecks. We also
> set DzkClientTimeout=90000.

Side issue: It's generally preferable to only use either maxDoc or
maxTime, and maxTime will usually result in more predictable behavior,
so I recommend removing the maxDoc settings on autoCommit and
autoSoftCommit.  I doubt this will have any effect on the problem you're
experiencing, just something I noticed.  I recommend a maxTime of 60000
(one minute) for autoCommit, with openSearcher set to false, and a
maxTime of at least 120000 (two minutes) for autoSoftCommit.  If these
seem excessively high to you, go with 30000 and 60000.

On zkClientTimeout, unless you have increased the ZK server tickTime,
you'll find that you can't actually define a zkClientTimeout that high. 
The maximum is 20*tickTime.  A typical tickTime value is 2000, which
means that the usual maximum value for zkClientTimeout is 40 seconds. 
The error you've reported doesn't look related to zkClientTimeout, so
increasing that beyond 30 seconds is probably unnecessary.  The default
values for Zookeeper server tuning have been worked on by the ZK
developers for years.  I wouldn't mess with tickTime without a REALLY
good reason.

Another side issue: Putting Zookeeper data on a mechanical disk when
there are SSDs available seems like a mistake to me.  Zookeeper is even
more sensitive to disk performance than Solr is.

> But it seems replicas still randomly go down while indexing. Do you
> have any suggestions to prevent this situation?
> Caused by: Read timed out

This error says that a TCP connection (http on port 9132) from one Solr
server to another hit the socket timeout -- there was no activity on the
connection for whatever the timeout is set to.  Usually a problem like
this has two causes:

1) A *serious* performance issue with Solr resulting in an incredibly
long processing time.  Most performance issues are memory-related.
2) The socket timeout has been set to a very low value.

In a later message on the thread, you indicated that the configured
socket timeout is ten minutes.  This should be plenty, and makes me
think option number one above is what we are dealing with, and the
information I asked for in the first paragraph of this reply is required
for any deeper insight.

Are there other errors in the Solr logfile that you haven't included? 
It seems likely that this is not the only problem Solr has encountered.


View raw message