lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hendrik Haddorp <hendrik.hadd...@gmx.net>
Subject Re: Recovery Issue - Solr 6.6.1 and HDFS
Date Tue, 21 Nov 2017 18:34:40 GMT
Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully. The write.lock files are then left in the HDFS as they do 
not get removed automatically when the client disconnects like a 
ephemeral node in ZooKeeper. Unfortunately Solr does also not realize 
that it should be owning the lock as it is marked in the state stored in 
ZooKeeper as the owner and is also not willing to retry, which is why 
you need to restart the whole Solr instance after the cleanup. I added 
some logic to my Solr start up script which scans the log files in HDFS 
and compares that with the state in ZooKeeper and then delete all lock 
files that belong to the node that I'm starting.

regards,
Hendrik

On 21.11.2017 14:07, Joe Obernberger wrote:
> Hi All - we have a system with 45 physical boxes running solr 6.6.1 
> using HDFS as the index.  The current index size is about 31TBytes. 
> With 3x replication that takes up 93TBytes of disk. Our main 
> collection is split across 100 shards with 3 replicas each.  The issue 
> that we're running into is when restarting the solr6 cluster.  The 
> shards go into recovery and start to utilize nearly all of their 
> network interfaces.  If we start too many of the nodes at once, the 
> shards will go into a recovery, fail, and retry loop and never come 
> up.  The errors are related to HDFS not responding fast enough and 
> warnings from the DFSClient.  If we stop a node when this is 
> happening, the script will force a stop (180 second timeout) and upon 
> restart, we have lock files (write.lock) inside of HDFS.
>
> The process at this point is to start one node, find out the lock 
> files, wait for it to come up completely (hours), stop it, delete the 
> write.lock files, and restart.  Usually this second restart is faster, 
> but it still can take 20-60 minutes.
>
> The smaller indexes recover much faster (less than 5 minutes). Should 
> we have not used so many replicas with HDFS?  Is there a better way we 
> should have built the solr6 cluster?
>
> Thank you for any insight!
>
> -Joe
>


Mime
View raw message