lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Obernberger <joseph.obernber...@gmail.com>
Subject Recovery Issue - Solr 6.6.1 and HDFS
Date Tue, 21 Nov 2017 13:07:58 GMT
Hi All - we have a system with 45 physical boxes running solr 6.6.1 
using HDFS as the index.  The current index size is about 31TBytes.  
With 3x replication that takes up 93TBytes of disk. Our main collection 
is split across 100 shards with 3 replicas each.  The issue that we're 
running into is when restarting the solr6 cluster.  The shards go into 
recovery and start to utilize nearly all of their network interfaces.  
If we start too many of the nodes at once, the shards will go into a 
recovery, fail, and retry loop and never come up.  The errors are 
related to HDFS not responding fast enough and warnings from the 
DFSClient.  If we stop a node when this is happening, the script will 
force a stop (180 second timeout) and upon restart, we have lock files 
(write.lock) inside of HDFS.

The process at this point is to start one node, find out the lock files, 
wait for it to come up completely (hours), stop it, delete the 
write.lock files, and restart.  Usually this second restart is faster, 
but it still can take 20-60 minutes.

The smaller indexes recover much faster (less than 5 minutes). Should we 
have not used so many replicas with HDFS?  Is there a better way we 
should have built the solr6 cluster?

Thank you for any insight!

-Joe


Mime
View raw message